Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

Meta-regex for variable-length matching regexes (XSLT)

Last post 10-11-2007, 11:27 AM by yvesforkl. 4 replies.
Sort Posts: Previous Next
  •  10-02-2007, 9:37 AM 35317

    Meta-regex for variable-length matching regexes (XSLT)

    In the context of the regex dialect of XSLT 2.0, I am looking for two
    "meta-regexes" that analyse other regexes. Each is discussed in a part
    of its own.

    (Part 2 posted as: Meta-regex to find fixed length of regex match (XSLT))

    Part 1

    Aim: A regex that would identify all those regexes that might yield
    matches of varying length, as opposed to regexes whose matches will
    always be of fixed length. (Considering only the case that the regex
    matches at all.)

    The idea, of course, is to look for metacharacters in constructs that
    allow repetition, alternation or other length variations, while
    keeping out uses of metacharacters required for escaping that do not
    change the length of the matches.

    The syntax of XSLT 2.0 regexes is described here:
    http://www.w3.org/TR/xpath-functions/#regex-syntax , which is based on
    the XML Schema Datatypes Spec at
    http://www.w3.org/TR/xmlschema-2/#regexs .

    XSLT 2.0 permits these metacharacters in regexes:

    . \ ? * + { } [ ] ( ) | ^ $

    (And "-" within Character Ranges and with Character Class
    Subtractions.)

    Among those, these metacharacters never influence the length of the
    match (parentheses are only used for grouping in XSLT; pattern anchors
    or character ranges by themselves cannot lead to variable lengths of
    the matches):

    . ( ) [ ] ^ $

    The following characters are used to denote quantifiers (and
    alternation) and thus most easily lead to matches of varying length:

    ? * + { } |

    They make up these quantifiers (first greedy ones, then reluctant ones):

    ?
    *
    +
    {n}
    {n,}
    {n,m}

    ??
    *?
    +?
    {n}?
    {n,}?
    {n,m}?

    However, non-escaped curly brackets may also be used in a Category
    Escape like "\p{L}" which is used to designate a set of characters
    that match a one-character string.

    The backslash requires special thought. Although in most uses as
    escaping metacharacter, it is just a notational device for
    single-character matches, XSLT 2.0 allows back-references like "\1"
    that can definitely influence the matched string. So if followed by at
    least one digit, the backslash indicates potential variable-length
    matches, too.

    Conclusion: An XSLT 2.0 regex is guaranteed to have only matches of
    fixed length iff it does not contain:

    1) a non-escaped occurrence of a character from this set of characters:

    ? * + |

    2) curly brackets that a) are neither escaped nor b) do belong to a
       Category Escape

    3) an occurrence of a backslash followed by a digit

    The rationale is that only the non-escaped presence of one of those
    constructs enables matches with varying length in the XSLT 2.0 regex
    dialect.

    So this regex should find all the regexes possibly matching strings of
    variable length (reluctant quantifiers are implicitly considered as
    the greedy ones are prefixes to them):

    [^\\][?*+{}|]|\\\d

    Is my reasoning correct? In how far does the regex fulfill the
    requirements stated?

    Unfortunately, condition 2 b is still unsatisified. This is because I
    have difficulty specifying for the opening curly bracket that it
    should not match when it is preceded by the string "\p" or "\P" - is
    there any way to enforce occurrence of both characters when testing,
    with neither lookbehind assertions nor conditional subpatterns being
    available in XSLT?

    Requiring a preceding character other than a backslash should be OK,
    because all real quantifiers must have at least some character in
    front of them. (As to "|", it does not, but it would make no sense to
    start a regexp with it, and so even if it is allowed in XSLT 2.0, I
    won't care about that.)

    BTW, if someone should conceive the complementary "meta-regex" that is
    identifying only those regexes with fixed-length matches, I could do
    equally well with that one, by simply inverting the logic of my
    control flow.

      Regards,

       Yves
  •  10-02-2007, 10:56 AM 35323 in reply to 35317

    Re: Meta-regex for variable-length matching regexes (XSLT)

    Try

    [^\\](([*+?|])|(\{\d,))|\\\d

    I intentional didn't test for x{n} since to me that would be a fixed length.  When I have more time I can discuss this in more detail if you'd like 


    Michael

    "In theory, theory and practice are the same. In practice, they are not."
    Albert Einstein
  •  10-04-2007, 4:38 AM 35385 in reply to 35323

    Re: Meta-regex for variable-length matching regexes (XSLT)

    Thank you very much for this regex which also fulfills requirement 2 b, i.e. does avoid matching on "\p{name}". Only testing for minimum and maximum occurrence counts is perfect.

    However, I now have discovered that your regex as well as mine suffer from the same problem that afflicts the regexes discussed in my other posting ("part 2") - they will falsely reject input with escaped preceding backslashes.

    So, while "\+" as a literal plus sign is correctly rejected, a series of backslashes denoted as "\\+" should be accepted, but is rejected too.

    How to adapt the regex to only allow an even number of preceding backslashes while rejecting odd numbers? Replacing "[^\\]" just with a even-number enforcing
    construct like "(\\\\)*" obviously won't work, because each odd additional number of preceding backslashes will just be skipped when looking for the start of the match.

    I fear I might be bumping against a fundamental property of regexes - escaping is always possible, isn't it?

    My hope, though, is being able to anchor the match at the beginning of the input and catch all backslashes I'm passing over. But how to express that there might be, aside pairs of backslashes, as well anything else at the beginning of the input?

    Won't this: 

    ^([^\\]*|(\\\\)*)(([*+?|])|(\{\d,))|\\\d

    exclude some patterns I'd like to catch but can't think of now?

     
      Yves
     

  •  10-05-2007, 7:08 AM 35443 in reply to 35385

    Re: Meta-regex for variable-length matching regexes (XSLT)

    Oh, I see, I should have been more reluctant right from the start, before getting greedy. ;-) So the regex definitely needs to be changed into this (also reversing the first pair of alternatives to optimize the matching process a bit, and allowing pairs of backslashes to occur anywhere far before the first repetition metacharacter):

     ^((\\\\)*|[^\\]*?)*(([*+?|])|(\{\d,))|\\\d

    Will this match all cases of a series of "literal" backslashes without matching escaped metacharacters that are otherwise used for repetition? Do I need to use a reluctant quantifier for the first parenthesized group as well to prevent the repetition metacharacters from being munged. Will termination or performance problems arise from the nested reluctant quantifiers?

    Any opinions on the original question, i.e. whether that regex is capable of identifying all of the regexes that have (potentially) variable-length matches, within the XSLT regex dialect I described?

       Yves

     

  •  10-11-2007, 11:27 AM 35588 in reply to 35443

    Re: Meta-regex for variable-length matching regexes (XSLT)

    After some more research into this issue, I think I have finally found an XSLT 2.0 regex capable of identifying all regexes in that dialect which might have variable length matches. The exact function of this meta-regex is defined as follows: Each regex that it does NOT match is guaranteed to always have matches (if any) of some constant length. The regexes that it matches, however, need not all have variable-length matches.

    Here it is, an awful example of backslashitis:

    ((^\\\\|[^\\](\\\\)?)([*+?|]|(\{\d,)))|[^\\](\\\\)*\\\d

    Let's analyse it. If we remove the outer parentheses from the first part (or "branch"), it has as its first half

    (^\\\\|[^\\](\\\\)?)

    which ensures that the repetition-enabling metacharacters which follow are not escaped, while yet allowing any series of literal backslashes (written as a pair of them), anywhere in the input.

    After this comes a group which is composed of an alternative of

    [*+?|]

    which of course looks for any metacharacter that may influence on the length of the match depending on the input, and

    (\{\d,)

    which matches numerically quantified repetitions. Finally, the second major part, an alternative to the first part, is

    [^\\](\\\\)*\\\d

    which catches any backreference (a backslash followed by at least one digit) while avoiding similar character sequences with literal meaning. Fortunately, a backreference in a regex only makes sense if something (or more precisely: a parenthesized group) precedes it, so here I'm comfortable with requiring at least one (non-backslash) character to precede the backreference. (At least in XSLT 2.0, a backreference in a regex cannot refer to something external to the regex.)

    NB: In my application, I prefer to use a variant of the above regex which will classify regexes that match the regex \{\d\} as if they had potentially variable-length matches, so it reads:

     ((^\\\\|[^\\](\\\\)?)([*+?|]|(\{\d)))|[^\\](\\\\)*\\\d

    Were those regexes treated correctly as having fixed-length matches, I would need to perform multiplications when calculating the match length of a regex with such a fixed-length numerically quantified repetition. With them kept outside, determining the length of fixed-length matches is feasible by doing some regex-based replacements only, see thread Meta-regex to find fixed length of regex match (XSLT).

    Any comments on this are welcome.

       Yves

     

View as RSS news feed in XML