In the context of the regex dialect of XSLT 2.0, I am looking for two
"meta-regexes" that analyse other regexes. Each is discussed in a part
of its own.
(Part 1 posted as: Meta-regex for variable-length matching regexes (XSLT))
Part 2
Aim: A regex that provides for pre-calculating the length of the match
that a fixed-length matching regex will find, if it matches at all.
This essentially means normalizing the representationas that use
metacharacters into single characters.
As a premise, a set of regexes should be given which are known to
always have a match of some fixed length (if they match). See Part 1
for a method that permits to obtain this set.
The approach proposed is to identify all multi-character sequences
that match one-character strings and normalize them into a single
replacement character before submitting the result to a string length
computing function.
The multi-character sequences matching one-character strings of XSLT
2.0 regexes are examined below. Their syntax is described here:
http://www.w3.org/TR/xpath-functions/#regex-syntax , which is based on
the XML Schema Datatypes Spec at
http://www.w3.org/TR/xmlschema-2/#regexs .
a) Combinations of one of these characters with a preceding backslash,
representing a one-character string of a specific character
("single character escapes"):
n r t \ | . ? * + ( ) [ ] { } ^ $ -
(Caveat: Back-references are written like "\1" and may match more than
one character. Regexes with fixed-length matches will not contain any
of them, see above.)
b) "Multi-character escapes": they can only match a one-character
string, too (while allowing one out of a class of characters to
match), using one of these letters after a backslash:
S S i I c C d D w W
c) Additionally, XSLT (i.e., XML Schema) has some further constructs
that match a one-character string: "Category Escapes" and "Block
Escapes", written as:
\p{ NAME }
\P{ NAME }
where NAME is a sequence of characters from this set: a-z, A-Z, 0-9, -
Conclusion: Each of the combinations above needs to be transformed
into a single character in order to make the number of characters in
the regex equal to the fixed length of its matches.
This requires several substitution steps to be executed in order, each
using a different meta-regex to capture specific constructs:
1) Category Escapes:
\\[pP]\{[a-zA-Z0-9-]\}
to be replaced with string "."
2) Combinations of a backslash and a single character, including
escaped angle brackets, curly brackets etc.:
\\[nrt\\|.?*+sSiIcCdDwW^$()\[\]{}-]
to be replaced with string "."
3) Character classes:
\[[^\]]+\]
to be replaced with string "."
4) Parentheses (unescaped angle brackets were replaced in step 3;
unescaped curly brackets other than those replaced in step 1 are
incompatible with fixed-length matches):
[()]
to be replaced with nothing
Is this a reasonable approach? Are the regexes suited to the task of
normalizing each fixed-length match regex into a string of exactly
this length?
What about the order of the steps? For instance, step 2 replaces
literal (i.e. escaped) angle brackets within character classes which
are themselves replaced in step 3.
All of the regexes above forget to watch out for a backslash preceding
their match, which would destroy the (special) meaning of the
construct they describe. Unfortunately, XSLT 2.0 neither disposes of
lookbehind assertions nor of conditional subpatterns to take care of
this case.
Conventional testing of the preceding character depends on its
presence, which we can't be sure of. Wondering how to keep the left
side of the match always free from any backslash, I just can think of
doubling each regex into two alternatives that consider both cases
separately, like this (example for step 4):
^[()]|[^\\][()]
Regards,
Yves