Got more questions? Find advice on: ASP | SQL | XML | Windows
Welcome to RegexAdvice Sign in | Join | Help

Meta-regex to find fixed length of regex match (XSLT)

  •  10-02-2007, 9:42 AM

    Meta-regex to find fixed length of regex match (XSLT)

    In the context of the regex dialect of XSLT 2.0, I am looking for two
    "meta-regexes" that analyse other regexes. Each is discussed in a part
    of its own.

    (Part 1 posted as: Meta-regex for variable-length matching regexes (XSLT))

    Part 2

    Aim: A regex that provides for pre-calculating the length of the match
    that a fixed-length matching regex will find, if it matches at all.
    This essentially means normalizing the representationas that use
    metacharacters into single characters.

    As a premise, a set of regexes should be given which are known to
    always have a match of some fixed length (if they match). See Part 1
    for a method that permits to obtain this set.

    The approach proposed is to identify all multi-character sequences
    that match one-character strings and normalize them into a single
    replacement character before submitting the result to a string length
    computing function.

    The multi-character sequences matching one-character strings of XSLT
    2.0 regexes are examined below. Their syntax is described here:
    http://www.w3.org/TR/xpath-functions/#regex-syntax , which is based on
    the XML Schema Datatypes Spec at
    http://www.w3.org/TR/xmlschema-2/#regexs .

    a) Combinations of one of these characters with a preceding backslash,
       representing a one-character string of a specific character
       ("single character escapes"):

    n r t \ | . ? * + ( ) [ ] { } ^ $ -

    (Caveat: Back-references are written like "\1" and may match more than
    one character. Regexes with fixed-length matches will not contain any
    of them, see above.)

    b) "Multi-character escapes": they can only match a one-character
       string, too (while allowing one out of a class of characters to
       match), using one of these letters after a backslash:

    S S i I c C d D w W

    c) Additionally, XSLT (i.e., XML Schema) has some further constructs
       that match a one-character string: "Category Escapes" and "Block
       Escapes", written as:

    \p{ NAME }
    \P{ NAME }

    where NAME is a sequence of characters from this set: a-z, A-Z, 0-9, -

    Conclusion: Each of the combinations above needs to be transformed
    into a single character in order to make the number of characters in
    the regex equal to the fixed length of its matches.

    This requires several substitution steps to be executed in order, each
    using a different meta-regex to capture specific constructs:

    1) Category Escapes:

    \\[pP]\{[a-zA-Z0-9-]\}

    to be replaced with string "."

    2) Combinations of a backslash and a single character, including
       escaped angle brackets, curly brackets etc.:

    \\[nrt\\|.?*+sSiIcCdDwW^$()\[\]{}-]

    to be replaced with string "."

    3) Character classes:

    \[[^\]]+\]

    to be replaced with string "."

    4) Parentheses (unescaped angle brackets were replaced in step 3;
       unescaped curly brackets other than those replaced in step 1 are
       incompatible with fixed-length matches):

    [()]

    to be replaced with nothing


    Is this a reasonable approach? Are the regexes suited to the task of
    normalizing each fixed-length match regex into a string of exactly
    this length?

    What about the order of the steps? For instance, step 2 replaces
    literal (i.e. escaped) angle brackets within character classes which
    are themselves replaced in step 3.

    All of the regexes above forget to watch out for a backslash preceding
    their match, which would destroy the (special) meaning of the
    construct they describe. Unfortunately, XSLT 2.0 neither disposes of
    lookbehind assertions nor of conditional subpatterns to take care of
    this case.

    Conventional testing of the preceding character depends on its
    presence, which we can't be sure of. Wondering how to keep the left
    side of the match always free from any backslash, I just can think of
    doubling each regex into two alternatives that consider both cases
    separately, like this (example for step 4):

    ^[()]|[^\\][()]


     Regards,

       Yves


View Complete Thread