It is true that an automaton-like character processing might be a more
natural way of doing this in a classical programming language.
While now in XSLT 2.0 (other than in XSLT 1.0), such a procedural
approach could certainly be realized with custom-defined functions
(even with all stored in one expression only, without XML syntax
interfering much), processing regexes with regexes seems to provide
for a more general solution less dependent on its implementation in
some language. And it's more fun... :-)
I am dealing here only with XSLT 2.0 regexes that are known to have
only fixed-length matches. Their definition is somewhat overly strict,
as it excludes on purpose certain types of regexes with fixed-length
matches. (For instance, each regex split into branches is considered
having potentially variable-length matches for simplicity's sake, even
if there are only top-level branches with fixed-length matches.)
For a discussion of the criterion used to distinguish "fixed-length
match regexes" from "variable-length match regexes", see thread
Meta-regex for variable-length matching regexes (XSLT).
Given a fixed-length matching XSLT 2.0 regex as defined above, the
constant length of its matches appears to be determinable by applying
the following operations to the regex, in this order:
<!-- 1) remove pairs of backslashes, i.e. literal backslashes -->
<xsl:variable name="r1"
select="replace($regex, '\\\\', '.')"/>
<!-- 2) reduce Category Escapes (variable length of
Unicode-related Property value) to single chars -->
<xsl:variable name="r2"
select="replace($r1, '\\[pP]\{[a-zA-Z0-9-]+\}', '.')"/>
<!-- 3) reduce escaped literals to single chars -->
<xsl:variable name="r3"
select="replace($r2, '\\[nrt\\|.?*+sSiIcCdDwW^$()\[\]{}-]', '.')"/>
<!-- 4) reduce Character Class Subtractions to single chars (must precede step 5) -->
<xsl:variable name="r4"
select="replace($r3, '\[[^\]]+\]-\[[^\]]+\]', '.')"/>
<!-- 5) reduce Negative and Positive Character Classes to single chars -->
<xsl:variable name="r5"
select="replace($r4, '\[[^\]]+\]', '.')"/>
<!-- 6) remove parentheses -->
<xsl:variable name="r6"
select="replace($r5, '[()]', '')"/>
<!-- finally, count length of the expression -->
<xsl:sequence select="string-length($r6)"/>
This simplified method, however, needs to exclude regexes that contain
a quantification using a fixed number, i.e. regexes that are
themselves matched by
\{\d\}
because replacements alone can not provide for the arithmetics that
are required to find the proper length of a series of atoms repeated
this way.
It seems to be rather complicated to integrate that into the method
described above. Below, I have tried to sketch the required additional
steps, while repeating the steps from above to show clearly how the
modifications fit in.
<!-- 1) remove pairs of backslashes, i.e. literal backslashes -->
<xsl:variable name="r1"
select="replace($regex, '\\\\', '.')"/>
1a) Shift any parenthesized group with a fixed number of repetitions
into the number indicator for later retrieval, so replace
(\([^)]+\))\{([0-9]+)\}
with
{$1$2}
Here $1 will contain the parentheses as well, serving to unambiguously
delimit the sequence to repeat from the number of repetitions. The
regex above, however, can not deal with nested parenthesized groups,
but I don't see any easy way to restrict the first matching group to
the atomic expression that is actually quantified. E.g., in
(abc)(d(.)f){3}
the first matched group should of course contain "(d(.)f)". IIRC,
assuring pairedness for an arbitrary number of parentheses is beyond
the realm of regular languages. But it would be possible to allow,
e.g., up to 2 pairs of parentheses:
\((.*?\([^)]+\).*?)|[^)]+\)
<!-- 2) reduce Category Escapes (variable length of
Unicode-related Property value) to single chars -->
<xsl:variable name="r2"
select="replace($r1, '\\[pP]\{[a-zA-Z0-9-]+\}', '.')"/>
<!-- 3) reduce escaped literals to single chars -->
<xsl:variable name="r3"
select="replace($r2, '\\[nrt\\|.?*+sSiIcCdDwW^$()\[\]{}-]', '.')"/>
<!-- 4) reduce Character Class Subtractions to single chars (must precede step 5) -->
<xsl:variable name="r4"
select="replace($r3, '\[[^\]]+\]-\[[^\]]+\]', '.')"/>
<!-- 5) reduce Negative and Positive Character Classes to single chars -->
<xsl:variable name="r5"
select="replace($r4, '\[[^\]]+\]', '.')"/>
5a) Calculate the lengths of fixed-number repetitions separately
Simple replacements won't suffice in this step, as some real
arithmetics are required to evaluate the numbers and maybe perform
some multiplications.
First, all of the substrings in the regex (as it is now) describing
fixed-number quantified parenthesized groups, i.e. those that match
the regex
\{\((.*?[^\\])\)([0-9]+)\}
are processed by multiplying the string length of $1 with the number
given in $2, and adding up the results for all of them. (In this
regex, a group content of ".*?[^\\]" helps to avoid confusing a
literalized closing parenthesis, i.e. one preceded by a backslash,
with the parenthesis that actually closes the group. Requiring at
least one character to occur within the group should be OK.)
Second, all of the substrings in the regex (as it is now) that match the
regex
\{([0-9]+)\}
are processed by adding the number given in each match's $2 to the
results previously obtained.
5b) Remove all fixed-number quantified atoms from the regex (they must
be deleted before measuring the length of the remaining regex
because their lengths will be added separately), so replace by the
empty string all matches for fixed-number quantified single
characters available as
.\{[0-9]+\}
(The preceding quantified character must also be removed here.)
After this, we can just remove all remaining matches for sequences
enclosed in curly brackets to catch the (previously modified)
fixed-number quantified parenthesized groups:
\{.*?\}
<!-- 6) remove parentheses -->
<xsl:variable name="r6"
select="replace($r5, '[()]', '')"/>
<!-- finally, count length of the expression -->
<xsl:sequence select="string-length($r6)"/>
7) Add the additional lengths obtained from fixed-number quantified
atoms in step 5a to the string length of our regex.
Any opinions on this? Where did I miss something? Will this order of
execution of the steps always yield the exact match length for any
"fixed-length matching" XSLT 2.0 regex?
Yves