Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

How to know what this Rex Works?

Last post 04-08-2008, 11:22 PM by Aussie Susan. 2 replies.
Sort Posts: Previous Next
  •  04-08-2008, 10:13 PM 41207

    How to know what this Rex Works?



    I just find out this Rex→  string pattern = @"\\(?:.+)\\(.+)\.(.+)" ;
    I am using VS2005 C# when I'm creating a .Net WebPage
    I ust this Rex to separate file name and file extesion name and works fine

    just want to know how this Rex works ? like what does the "@" mean or what does (?:.+) mean like that

    thank you


  •  04-08-2008, 10:36 PM 41208 in reply to 41207

    Re: How to know what this Rex Works?

    jcjcjc,

    You might start here for some information about .NET regular expressions:

    http://msdn2.microsoft.com/en-us/library/hs600312(VS.80).aspx

    As for your questions, a '@' at the beginning of a string in C# makes the string a "verbatim" string, which does not process simple escape sequences, and can even span multiple lines.  The advantage of using a verbatim string with regular expressions is that they are much more readable.  For example, the regular expression @"\d+\.\w*" would need to be written "\\d+\\.\\w*" when using a normal string.

    Any set of parentheses in a regular expression is a "group" that can be accessed later, like the filename and extension in your example.  Putting '?:' in a set of parentheses makes that particular set "non-capturing", which means you will not be able to access what was matched in that set later.   You can also name a capture using '?<...>':

    Match nameExtensionMatch = Regex.Match( path,  @"\\(?:.+)\\(?<Name>.+)\.(<Extension>.+)" );

    string name = nameExtensionMatch.Groups[ "Name" ].Value;

    string extension = nameExtensionMatch.Groups[ "Extension" ].Value;

     
    I hope this helps.

    Jeff
     

  •  04-08-2008, 11:22 PM 41211 in reply to 41208

    Re: How to know what this Rex Works?

    Just to add my 2c in here, the OP's pattern is why some consider regex's hard to read, understand and maintain.

    The trick is to know where to break up the pattern into individual tokens.

    Also, while Jeff is correct about the impact of the '@' at the start of a C# string, you need to be clear that there are several levels of interpretation going on here. For example:

    @"\\(?:.+)\\(.+)\.(.+)"

    without the '@' would need to be written as:

    "\\\\(?:.+)\\\\(.+)\\.(.+)"

    The reason is that both the compiler and the regex pattern parser treat the '\' as a special character. The compiler looks at each '\\' pair and interprets it as a single '\', so '\\\\' comes out as '\\'. The regex parser then interprets this (also) as a literal backslash, which leaves a single, literal character of '\'.

    Therefore the first version can be interpreted as:

    - a literal backslash character
    - a non-capture group (the '(?:' part) that matches one or more of any character (see the note below)
    - another literal backslash character
    - a capture group that will match 1 or more of any character
    - a literal dot or full-stop character
    - another capture group that will match 1 or more of any character

    Several things about this pattern need to be mentioned:
    1) firstly, the dot as 'matches any character' changes its meaning slightly depending on whether the 'single-line' option is used. The other name for this option is 'dot matches newline' which better explains what it really means
    2) this pattern will probably cause a lot of backtracking and could take some time, depending on the size and construction of the text being scanned. The ".+" part is greedy and will grab all characters until the end of the line or text (see point 1 above). The engine then wants to match a literal backslash so it will start working its way back looking for the last backslash in the line/text. When it finds one (there is no match if it doesn't) then it will match the backslash and then grab everything to the end of the line/text again before it looks for the literal dot. Therefore it has to backtrack again until it finds one when it can again grab everything to the end of the line/text. If it cannot find a dot that follows a backslash, it will go back to the backslash it has already found and resume backtracking until it gets to the next backslash when it will do the whole process again.

    I suspect that this is looking for a file path along the lines of:

    \folder\file.type

    which it would match with the first capture groups receiving the 'file' and the second receiving the 'type' 

    However, if was given

    \toplevel\folder\sub.dir\dummy

    then the first match group would get 'sub' and the second would get 'dir\dummy'

    In the right context, this is fine, but this comes back to something that Mash keeps saying: "know your data". This can blow up in your face

    Susan 

View as RSS news feed in XML