Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

Meta-regex to find fixed length of regex match (XSLT)

Last post 10-12-2007, 6:07 AM by yvesforkl. 4 replies.
Sort Posts: Previous Next
  •  10-02-2007, 9:42 AM 35318

    Meta-regex to find fixed length of regex match (XSLT)

    In the context of the regex dialect of XSLT 2.0, I am looking for two
    "meta-regexes" that analyse other regexes. Each is discussed in a part
    of its own.

    (Part 1 posted as: Meta-regex for variable-length matching regexes (XSLT))

    Part 2

    Aim: A regex that provides for pre-calculating the length of the match
    that a fixed-length matching regex will find, if it matches at all.
    This essentially means normalizing the representationas that use
    metacharacters into single characters.

    As a premise, a set of regexes should be given which are known to
    always have a match of some fixed length (if they match). See Part 1
    for a method that permits to obtain this set.

    The approach proposed is to identify all multi-character sequences
    that match one-character strings and normalize them into a single
    replacement character before submitting the result to a string length
    computing function.

    The multi-character sequences matching one-character strings of XSLT
    2.0 regexes are examined below. Their syntax is described here:
    http://www.w3.org/TR/xpath-functions/#regex-syntax , which is based on
    the XML Schema Datatypes Spec at
    http://www.w3.org/TR/xmlschema-2/#regexs .

    a) Combinations of one of these characters with a preceding backslash,
       representing a one-character string of a specific character
       ("single character escapes"):

    n r t \ | . ? * + ( ) [ ] { } ^ $ -

    (Caveat: Back-references are written like "\1" and may match more than
    one character. Regexes with fixed-length matches will not contain any
    of them, see above.)

    b) "Multi-character escapes": they can only match a one-character
       string, too (while allowing one out of a class of characters to
       match), using one of these letters after a backslash:

    S S i I c C d D w W

    c) Additionally, XSLT (i.e., XML Schema) has some further constructs
       that match a one-character string: "Category Escapes" and "Block
       Escapes", written as:

    \p{ NAME }
    \P{ NAME }

    where NAME is a sequence of characters from this set: a-z, A-Z, 0-9, -

    Conclusion: Each of the combinations above needs to be transformed
    into a single character in order to make the number of characters in
    the regex equal to the fixed length of its matches.

    This requires several substitution steps to be executed in order, each
    using a different meta-regex to capture specific constructs:

    1) Category Escapes:

    \\[pP]\{[a-zA-Z0-9-]\}

    to be replaced with string "."

    2) Combinations of a backslash and a single character, including
       escaped angle brackets, curly brackets etc.:

    \\[nrt\\|.?*+sSiIcCdDwW^$()\[\]{}-]

    to be replaced with string "."

    3) Character classes:

    \[[^\]]+\]

    to be replaced with string "."

    4) Parentheses (unescaped angle brackets were replaced in step 3;
       unescaped curly brackets other than those replaced in step 1 are
       incompatible with fixed-length matches):

    [()]

    to be replaced with nothing


    Is this a reasonable approach? Are the regexes suited to the task of
    normalizing each fixed-length match regex into a string of exactly
    this length?

    What about the order of the steps? For instance, step 2 replaces
    literal (i.e. escaped) angle brackets within character classes which
    are themselves replaced in step 3.

    All of the regexes above forget to watch out for a backslash preceding
    their match, which would destroy the (special) meaning of the
    construct they describe. Unfortunately, XSLT 2.0 neither disposes of
    lookbehind assertions nor of conditional subpatterns to take care of
    this case.

    Conventional testing of the preceding character depends on its
    presence, which we can't be sure of. Wondering how to keep the left
    side of the match always free from any backslash, I just can think of
    doubling each regex into two alternatives that consider both cases
    separately, like this (example for step 4):

    ^[()]|[^\\][()]


     Regards,

       Yves


  •  10-02-2007, 1:03 PM 35332 in reply to 35318

    Re: Meta-regex to find fixed length of regex match (XSLT)

    Another approach is to do a simple parsing of the data, assuming you have a language at your disposal that can handle that.  Since you won't have any variable-length regexes, it should be a pretty simple task, and cleaner conceptually and in its execution than a bunch of regex-replaces.

     

  •  10-04-2007, 5:11 AM 35389 in reply to 35332

    Re: Meta-regex to find fixed length of regex match (XSLT)

    Could you be more elaborate on this interesting idea? Which language or set of functions would be required for this?

    Maybe I should also explain a bit more what I need to do. Before looking at any input, I need to sort a list of regex patterns by their matches' length (known to be constant, see above).

    Which kind of "simple data parsing" technique could furnish these values, other than an actual matching of the input against each regex? As my input will frequently change and I have quite a long list of regexes which need to be applied to each input, matching should occur as late (or rarely) as possible.

    As a consequence, examining the regexes before using them seems to be the most elegant solution as far as I can see.

      Yves

     

     

     

  •  10-04-2007, 11:31 AM 35408 in reply to 35389

    Re: Meta-regex to find fixed length of regex match (XSLT)

    I was referring to the actual length calculation for a given regex.

     Any regular programming language can handle it - .NET, java, C, etc, etc.

    If you have something like:

    \d[a-z]{10}\s\S{2}

    You would parse it one character at a time via a simple loop.  This would only have to be done once of course, once you have the lengths you can store them somewhere.

    something like:

    Dim s As string = "\d[a-z]{10}\s\S{2}"

    For i as Integer = 1 To s.Length

     Dim c as string = s.Substring(i,1)                     ' Fetch the next character

     ProcessChar(c)                                               ' Handle the char, update your total length, look at your current state, etc.  For example, if we see a backslash we know its an escape character so it shouldn't be counted, but it should affect the next character we look at.

    Next i

     

    I personally find this a simpler concept than using Regex replaces here, your mileage may vary.

  •  10-12-2007, 6:07 AM 35616 in reply to 35408

    Re: Meta-regex to find fixed length of regex match (XSLT)

    It is true that an automaton-like character processing might be a more
    natural way of doing this in a classical programming language.

    While now in XSLT 2.0 (other than in XSLT 1.0), such a procedural
    approach could certainly be realized with custom-defined functions
    (even with all stored in one expression only, without XML syntax
    interfering much), processing regexes with regexes seems to provide
    for a more general solution less dependent on its implementation in
    some language. And it's more fun... :-)

    I am dealing here only with XSLT 2.0 regexes that are known to have
    only fixed-length matches. Their definition is somewhat overly strict,
    as it excludes on purpose certain types of regexes with fixed-length
    matches. (For instance, each regex split into branches is considered
    having potentially variable-length matches for simplicity's sake, even
    if there are only top-level branches with fixed-length matches.)

    For a discussion of the criterion used to distinguish "fixed-length
    match regexes" from "variable-length match regexes", see thread
    Meta-regex for variable-length matching regexes (XSLT).

    Given a fixed-length matching XSLT 2.0 regex as defined above, the
    constant length of its matches appears to be determinable by applying
    the following operations to the regex, in this order:

        <!-- 1) remove pairs of backslashes, i.e. literal backslashes -->
        <xsl:variable name="r1"
          select="replace($regex, '\\\\', '.')"/>

        <!-- 2) reduce Category Escapes (variable length of
          Unicode-related Property value) to single chars -->
        <xsl:variable name="r2"
          select="replace($r1, '\\[pP]\{[a-zA-Z0-9-]+\}', '.')"/>

        <!-- 3) reduce escaped literals to single chars -->
        <xsl:variable name="r3"
          select="replace($r2, '\\[nrt\\|.?*+sSiIcCdDwW^$()\[\]{}-]', '.')"/>

        <!-- 4) reduce Character Class Subtractions to single chars (must precede step 5) -->
        <xsl:variable name="r4"
          select="replace($r3, '\[[^\]]+\]-\[[^\]]+\]', '.')"/>

        <!-- 5) reduce Negative and Positive Character Classes to single chars -->
        <xsl:variable name="r5"
          select="replace($r4, '\[[^\]]+\]', '.')"/>

        <!-- 6) remove parentheses -->
        <xsl:variable name="r6"
          select="replace($r5, '[()]', '')"/>

        <!-- finally, count length of the expression -->
        <xsl:sequence select="string-length($r6)"/>

    This simplified method, however, needs to exclude regexes that contain
    a quantification using a fixed number, i.e. regexes that are
    themselves matched by

    \{\d\}

    because replacements alone can not provide for the arithmetics that
    are required to find the proper length of a series of atoms repeated
    this way.

    It seems to be rather complicated to integrate that into the method
    described above. Below, I have tried to sketch the required additional
    steps, while repeating the steps from above to show clearly how the
    modifications fit in.

        <!-- 1) remove pairs of backslashes, i.e. literal backslashes -->
        <xsl:variable name="r1"
          select="replace($regex, '\\\\', '.')"/>


    1a)  Shift any parenthesized group with a fixed number of repetitions
         into the number indicator for later retrieval, so replace

    (\([^)]+\))\{([0-9]+)\}

    with

    {$1$2}

    Here $1 will contain the parentheses as well, serving to unambiguously
    delimit the sequence to repeat from the number of repetitions. The
    regex above, however, can not deal with nested parenthesized groups,
    but I don't see any easy way to restrict the first matching group to
    the atomic expression that is actually quantified. E.g., in

    (abc)(d(.)f){3}

    the first matched group should of course contain "(d(.)f)". IIRC,
    assuring pairedness for an arbitrary number of parentheses is beyond
    the realm of regular languages. But it would be possible to allow,
    e.g., up to 2 pairs of parentheses:

    \((.*?\([^)]+\).*?)|[^)]+\)


        <!-- 2) reduce Category Escapes (variable length of
          Unicode-related Property value) to single chars -->
        <xsl:variable name="r2"
          select="replace($r1, '\\[pP]\{[a-zA-Z0-9-]+\}', '.')"/>

        <!-- 3) reduce escaped literals to single chars -->
        <xsl:variable name="r3"
          select="replace($r2, '\\[nrt\\|.?*+sSiIcCdDwW^$()\[\]{}-]', '.')"/>

        <!-- 4) reduce Character Class Subtractions to single chars (must precede step 5) -->
        <xsl:variable name="r4"
          select="replace($r3, '\[[^\]]+\]-\[[^\]]+\]', '.')"/>

        <!-- 5) reduce Negative and Positive Character Classes to single chars -->
        <xsl:variable name="r5"
          select="replace($r4, '\[[^\]]+\]', '.')"/>


    5a) Calculate the lengths of fixed-number repetitions separately

    Simple replacements won't suffice in this step, as some real
    arithmetics are required to evaluate the numbers and maybe perform
    some multiplications.

    First, all of the substrings in the regex (as it is now) describing
    fixed-number quantified parenthesized groups, i.e. those that match
    the regex

    \{\((.*?[^\\])\)([0-9]+)\}

    are processed by multiplying the string length of $1 with the number
    given in $2, and adding up the results for all of them. (In this
    regex, a group content of ".*?[^\\]" helps to avoid confusing a
    literalized closing parenthesis, i.e. one preceded by a backslash,
    with the parenthesis that actually closes the group. Requiring at
    least one character to occur within the group should be OK.)

    Second, all of the substrings in the regex (as it is now) that match the
    regex

    \{([0-9]+)\}

    are processed by adding the number given in each match's $2 to the
    results previously obtained.


    5b) Remove all fixed-number quantified atoms from the regex (they must
        be deleted before measuring the length of the remaining regex
        because their lengths will be added separately), so replace by the
        empty string all matches for fixed-number quantified single
        characters available as

    .\{[0-9]+\}

    (The preceding quantified character must also be removed here.)

    After this, we can just remove all remaining matches for sequences
    enclosed in curly brackets to catch the (previously modified)
    fixed-number quantified parenthesized groups:

    \{.*?\}


        <!-- 6) remove parentheses -->
        <xsl:variable name="r6"
          select="replace($r5, '[()]', '')"/>

        <!-- finally, count length of the expression -->
        <xsl:sequence select="string-length($r6)"/>


    7) Add the additional lengths obtained from fixed-number quantified
       atoms in step 5a to the string length of our regex.


    Any opinions on this? Where did I miss something? Will this order of
    execution of the steps always yield the exact match length for any
    "fixed-length matching" XSLT 2.0 regex?

      Yves


View as RSS news feed in XML