In the context of the regex dialect of XSLT 2.0, I am looking for two"meta-regexes" that analyse other regexes. Each is discussed in a partof its own.
(Part 2 posted as: Meta-regex to find fixed length of regex match (XSLT))
Part 1Aim: A regex that would identify all those regexes that might yieldmatches of varying length, as opposed to regexes whose matches willalways be of fixed length. (Considering only the case that the regexmatches at all.)
The idea, of course, is to look for metacharacters in constructs thatallow repetition, alternation or other length variations, while
keeping out uses of metacharacters required for escaping that do notchange the length of the matches.
The syntax of XSLT 2.0 regexes is described here:http://www.w3.org/TR/xpath-functions/#regex-syntax , which is based onthe XML Schema Datatypes Spec athttp://www.w3.org/TR/xmlschema-2/#regexs .XSLT 2.0 permits these metacharacters in regexes:. \ ? * + { } [ ] ( ) | ^ $
(And "-" within Character Ranges and with Character Class
Subtractions.)
Among those, these metacharacters never influence the length of thematch (parentheses are only used for grouping in XSLT; pattern anchors
or character ranges by themselves cannot lead to variable lengths of
the matches):
. ( ) [ ] ^ $
The following characters are used to denote quantifiers (and
alternation) and thus most easily lead to matches of varying length:
? * + { } |
They make up these quantifiers (first greedy ones, then reluctant ones):
?
*
+
{n}
{n,}
{n,m}
??
*?
+?
{n}?
{n,}?
{n,m}?
However, non-escaped curly brackets may also be used in a Category
Escape like "\p{L}" which is used to designate a set of characters
that match a one-character string.
The backslash requires special thought. Although in most uses as
escaping metacharacter, it is just a notational device forsingle-character matches, XSLT 2.0 allows back-references like "\1"
that can definitely influence the matched string. So if followed by at
least one digit, the backslash indicates potential variable-length
matches, too.
Conclusion: An XSLT 2.0 regex is guaranteed to have only matches offixed length iff it does not contain:
1) a non-escaped occurrence of a character from this set of characters:
? * + |
2) curly brackets that a) are neither escaped nor b) do belong to a
Category Escape
3) an occurrence of a backslash followed by a digit
The rationale is that only the non-escaped presence of one of those
constructs enables matches with varying length in the XSLT 2.0 regexdialect.
So this regex should find all the regexes possibly matching strings ofvariable length (reluctant quantifiers are implicitly considered as
the greedy ones are prefixes to them):
[^\\][?*+{}|]|\\\d
Is my reasoning correct? In how far does the regex fulfill therequirements stated?
Unfortunately, condition 2 b is still unsatisified. This is because Ihave difficulty specifying for the opening curly bracket that it
should not match when it is preceded by the string "\p" or "\P" - is
there any way to enforce occurrence of both characters when testing,
with neither lookbehind assertions nor conditional subpatterns beingavailable in XSLT?
Requiring a preceding character other than a backslash should be OK,
because all real quantifiers must have at least some character in
front of them. (As to "|", it does not, but it would make no sense to
start a regexp with it, and so even if it is allowed in XSLT 2.0, Iwon't care about that.)
BTW, if someone should conceive the complementary "meta-regex" that isidentifying only those regexes with fixed-length matches, I could doequally well with that one, by simply inverting the logic of my
control flow.
Regards,
Yves