Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

ShowUsYour<Regex>

Irregular Expressions Regularly

  • Is RegExLib full of "it"?

    Today I read a comment on Jeffrey Schoolcraft's regex blog from Randal L. Schwartz which I felt that I needed to respond to.  As I started writing the comment I realized that this is probably news that needs to be publicly visible, so I'm posting it to my blog and cross referencing  the original comment.  First, here is Randal's comment:


    Yup. I continue to downvote and negative-comment nearly every entry at "regex lib".

    Do not validate email addresses with a regex (unless it's the full regex, as you point out).

    Do not parse HTML with a regex. HTML is surprisingly complex.

    Do not validate a date with a regex. All these regex I see that try to compute the number of days of february based on the year number just have me going "WTF!".

    These are NOT regex tasks. These are dedicated tool tasks.

    And yet, "regexlib" is full of them. And full of "it", if you know what I mean.


    Randal...

    I hear your pain.  As the lead developer of RegExLib I also see the problems that you are mentioning and, presently we haven't really provided a good enough toolset for the newbies to really help themselves properly.  Should the newbies be randomly using regex's that they find on the site... dunno?  That's for another argument.

    We implemented the rating and comment system in the middle of last year to try and give some indication about the value of individual patterns - so I'm extremely grateful that dilligent members of the community such as yourself are helping out by casting your votes.  We also implement an Rss feed for the comments so that comments such as yours are given public visibility - http://www.regexlib.com/RssComments.aspx

    It's a hard battle to win as RegExLib continues to grow and, as of today contains nearly 1000 expressions.  There's good news though.  Over the past couple of months there's been a lot of effort put into helping solve these problems and, to that effect, users of the site will see a vastly improved set of tools to help deal with some of the problems that you've mentioned. 

    To give you a quick example, one of the new features will provide users with a shortcut way of finding useful AND ACCURATE expressions by offering a box which says: "Enter N examples of what you want to match and N that you shouldn't and we'll provide you with a list of patterns which that match your requirements".  This will help to remove the hit and miss element of a NOOB scanning through 1000 patterns to find the veritable needle in the haystack.

    The tools that allow users to manage their expressions is also getting an improvement so hopefully pattern authors might be more responsive in adjusting their patterns based on feedback received.

    I hope that, once you see the new features for yourself you will agree with me that RegExLib is a much more valuable resource than it is today.

    Sponsor
  • compressing javascript code in perl

    While sneaking around the web tonight looking at sites which have interesting dhtml controls or cool stylesheets, I came across this article:

        http://www.bazon.net/mishoo/articles.epl?art_id=209

    I couldn't help but start pulling apart some of those regex's.  The one which strips whitespace at the beginning and end of lines is quite familiar:

        \n)\s+|\s+(?:$|\n))
       

    This regex is basically saying:

       either anchor at ^ or \n and slurp all whitespace
    or
       slurp all whitespace up until the $ or \n anchors

    so, while the alternation is costly, overall the regex is very readable and maintainable.  The C-style comment regex on the other hand seems way too complicated compared to my C-style regex which can be found at RegexLib.com here:

        http://regexlib.com/REDetails.aspx?regexp_id=445

    Overall though, it's not the regex whackiness that I wanted to highlight; it's the script.  This seems like a nice, lean way to reduce the byte size of js code files.

    Sponsor
  • apparently this represents my dominant intelligence...

    Your Dominant Intelligence is Linguistic Intelligence
    You are excellent with words and language. You explain yourself well. An elegant speaker, you can converse well with anyone on the fly. You are also good at remembering information and convicing someone of your point of view. A master of creative phrasing and unique words, you enjoy expanding your vocabulary. You would make a fantastic poet, journalist, writer, teacher, lawyer, politician, or translator.
    Sponsor
  • Named Groups, Unnamed Groups and Captures

    I see this question come up a bit in regex so, I thought that I'd blog about it.  It has to do with 2 things: named groups and captures.  First, an example...

    I have a set of attributes ascribed to a value and I want to match each of them and then write them out - this is similar to matching attributes off of an xml or html element:

    Example text:
    Attributes=(Animal=cat; Human=paul; Car=ford; Color=green;)

    Sample pattern:
    Attributes=\(((?'type'\w+)(=)(?'value'\w+)\;\s?)+\)


    Problem 1: Named and Unnamed Groups

    This pattern uses 2 named groups - "type" and "value" - to store each of the attributes; it also has 2 unnamed groups, one which matches the entire attribute string and one which matches the "=" sign between type and value.

    Looking at that pattern, you know that there's going to be 4 groups and, using logic you would probably expect them to appear in the following order:

    1. Group 0 : The unnamed entire match
    2. Group 1 : The named "type" group
    3. Group 2 : The unnamed "=" group
    4. Group 3 : The named "value" group

    Unnamed Groups always come first

    The first important rule of .NET regex's is that unnamed groups always come before named groups when you are enumerating over a Groups collection.  So, the order of our groups will be:

    1.     Group 0 : The unnamed entire match
    2.     Group 1 : The unnamed "=" group
    3.     Group 2 : The named "type" group
    4.     Group 3 : The named "value" group

    Problem 2: Groups and Captures

    Another gotcha with this example arises when a user is attempting to write out all of the results to the screen.  As you can see, there will be:

    • 1 Match - The entire string
    • 4 Groups - as we've already seen
    • and 4 instances of the attributes.

    The question is, how to get each of those 4 attribute values?  The answer is that each Group has a Captures collection to store each "capture".  So, the idea is to get a count of the captures for a group and then display the value at each index between 0 and the count of captures for that group.

    Here's some sample code which demonstrates how you'd do that for the example shown above:

     

    string pattern = @"Attributes=\(((?'type'\w+)(=)(?'value'\w+)\;\s?)+\)" ;
    string input = @"Attributes=(Animal=cat; Human=paul; Car=ford; Color=green;)" ;
    
    Match m = Regex.Match(input, pattern);
    
    if( m.Groups["type"].Success ) {
      
      // this will tell us how many captures we have...
      int matchedItems = m.Groups["type"].Captures.Count ;
    
      // now, enumerate the Captures and render the groups for each Capture...
      for( int i=0; i<matchedItems; i++ ) {
        
        string name = m.Groups["type"].Captures[i].Value ;
        string val = m.Groups["value"].Captures[i].Value ;
    
        Console.WriteLine("{0} = {1}", name, val) ;
      }
    }
    
    Console.ReadLine() ;

     

    And here's the output generated by the above example...

    Animal = cat
    Human = paul
    Car = ford
    Color = green

    Sponsor
  • Usage of patterns on RegexLib

    I often get asked whether patterns found on RegexLib are free for use to which my reply is...

    They are free for use but, I'd recommend that, if you use one that you might like to place a comment such as:

        // pattern found at http://regexlib.com/AddressOfPattern

    Reason is that, patterns get updated and also get comments left against them so, it's a good way to leave useful info for another developer who may have to maintain your code later!

    Sponsor
  • Regex blogger

    Just discovered another netizen with a blogging category for regex's:

        http://weblogs.asp.net/trobbins/category/6948.aspx

    Sponsor
  • Regex Compiled vs Regex Interpreted

    David Gutierrez - one of the Regex devs on the BCL Team - has posted a great article which discusses some of the semantics behind pre-compilation of your regex's:

        http://weblogs.asp.net/BCLTeam/archive/2004/11/12/256783.aspx

    [ Nothing Playing. ]

    Sponsor
  • First post with PostXING

    I downloaded and installed PostXING today:

        Project Distributor :: Blue Fenix :: PostXING

    The download comes with source code and is packed full of features.  I mostly grabbed it for the cross-posting functionality so that I can more easily maintain the 2 or 3 blogs that I currently run.  The tool not only handles cross-posting but also offers rich functionality for commonly used features such as:

    • Inserting syntax highlighted snippets
    • It imports all categories from your blog
    • Allows for, not only blog configurations but also Ftp configs for uploading images and files

    I know that Chris did some work recently on the tool and, although there's not much more to add to make it totally awesome I've already noted that it could probably do with offline functionality.   If you use the tool and have ideas about enhancements you can offer them via the feedback form for the project:

        http://markitup.aspxconnection.com/Projects/Project.aspx?projectId=12

    Sponsor
  • RegexLib Testing Tool - The new Details Grid

    The other day I blog'ged about the new Options panel and today I'd like to announce another part of the expanded testing tools - the Details Grid:

    The new Details Grid on RegexLib.com

    Initially, the Details Grid displays information about Matches however, you can expand each Match out to view detailed information about the Groups within the Match.

    At the Group level of data you are presented with all of the useful diagnostic information about the regex's matching behaviour such as:

    • Value : the captured value of a Group. In the case where a group has sub-captures, the Value is equivalent to the value of the last Capture in the Group.
    • Index : where the capture commenced in the input string.
    • Length : the length of the Value. This is useful when the value contains whitespace or unprintable characters.
    • Grp. Num. : The position of the Group within the Match. This is the number returned from regex.GetGroupNumbers (Refer Extra Information below). It is important to note that Named Groups *always* occur last. That is, no matter where your named group actually occurs within a match, it will always appear at the end of the Matches Groups collection.
    • Name : This is the name of the Group. This is the name returned from regex.GroupNameFromNumber( grpNum ).
    • Captures : This field lets you know whether there are sub-captures for this Group. If there are and you want to see their values, you can switch to TreeView mode and view the contents of the entire Match/Group/Capture hierarchy.

    Extra Information

    Regex: Functionality about named/numbered groups everyone should know.
    Sponsor
  • RegexLib Testing Tool - The new Options Panel

    Yesterday Justin blog'ged about the new features that are included in the new testing tool on RegexLib.com:

         http://weblogs.asp.net/justin_rogers/archive/2004/09/04/225692.aspx

    ...but I just wanted to expand on that a little. In that article Justin mentioned the new "options" panel and alluded to the fact that we'll be adding all future functionality through that widget. Today I just wanted to highlight some of the functionality that I've added to it for this initial release.

    The new Options Panel on RegexLib.com

    Firstly, you can see that you can choose whether you are tesing against the .NET engine or against client-side engines (VBScript or Javascript). Depending on which option you choose you will see different mode modifier options on the left hand side. The mode modifiers in the above image reflect the modifiers for the .NET engine. The mode modifier settings that you choose will be persisted between sessions; this means that, if you configure the settings to use: ExplicitCapture and IgnoreCase then, next time you use the tool, those settings will be pre-selected for you.

    The widget can be minimized by toggling the state with the arrow in the top right hand corner of the panel. Again, the state of the panel will be persisted so, if you minimize it, it will be minimized when you re-visit. It's easier to minimize the panel once you have made your initial configuration settings so that you have more space to play with to view matches.

    Clicking on the text of any of the radio choices displays help text about that setting. In the future many more Help features will be added into the panel - most likely in the form of "flyouts" and, for each of them the text will be displayed in the center area of the panel.

    I have to thank Justin Rogers and Thomas 'Aylar' Johansen for their help in getting this new testing tool to its current state. It was Justin's idea to use the panel as an extensible help provider and configuration widget and Thomas used his design skills to bring the panel to life.


         The testing tool

    Sponsor
  • New Regex Tester

    Currently I'm in the middle of replacing the testing tool on RegexLib.com.  You can play with the changes that I've made so far at:

        http://www.regexlib.com/RETester.aspx

    The new features include:

    • Ability to grab the source from a Url
    • Ability to grab the source from a file on your PC
    • Able to test using server-side .NET or client-side Javascript or VBScript
    • Able to display .NET matches as either a Grid (group view) or a Tree (capture view)
    • Settings that you make in the new "Options Panel" are persisted between sessions

    There's more to come, but I wanted to get this stuff live because it's already head-and-shoulders ahead of what I had previously!

    Enjoy! :-)

    Sponsor
  • Another interesting timeout pattern

    I haven't had time to fully investigate the evilness of this pattern but it certainly looked interesting:

     

    Pattern:
    
    ^(([^\\/:*?<>"|]*(?<![ ]))*~+)(\.[xsl]{3})$ Input String $d_ d}ddddddddddddFoo2.xsl.xsl
    Sponsor
  • Invalid patterns

    I've spent some time this week looking for interesting ways to create invalid patterns.  2 of my favourites are:

     

    [\_]

     

    and...
    [a-Z]

     

    Sponsor
  • Patterns which caused timeouts to occur

    It's been 2 or 3 weeks since I added timeout protection to the RegexLib.com site. During that period I've averaged 1 or 2 timed out patterns per day. I thought that I'd post a small selection of some which have caused timeout exceptions.

    The following 9 samples show the pattern and the input string which caused the failure; for the last 3 I've just shown "{LARGE BODY OF TEXT}" instead of the actual text which, in each case was a large amount of Html source.

    Pattern:      ^[a-zA-Z0-9]+((.?|\-*)[a-zA-Z0-9]+)*$ 
    Input string: 
    asdf.host-name.asd-f.
    

     

    Pattern 
    ^(([a-zA-Z\d!#$%&'*+-/=?^_`{|}~]+\x20*|"((?=[\x01-\x7f])[^"\\]|\\[\x01-\x7f])*"\x20*)*(?<angle><))?
    ((?!\.)(\.?[a-zA-Z\d!#$%&'*+-/=?^_`{|}~]+)+|"((?=[\x01-\x7f])[^"\\]|\\[\x01-\x7f])*")
    @
    (((?!-)[a-zA-Z\d\-]+(?<!-)\.)+[a-zA-Z]{2,}
    |\[
    (((?(?<!\[)\.)(25[0-5]|2[0-4]\d|[01]?\d?\d)){4}
    |[a-zA-Z\d\-]*[a-zA-Z\d]:((?=[\x01-\x7f])[^\\\[\]]|\\[\x01-\x7f])+)
    \])
    (?(angle)>)$
    Input string: 
    www42af43ds.afsd.fds.ds
    

     

    Pattern 
    ^([0-9a-zA-Z]([-.\w]*[0-9a-zA-Z])*@(([0-9a-zA-Z])+([-\w]*[0-9a-zA-Z])*\.)+[a-zA-Z]{2,9})$
    Input string: 
    hello23423423423424n@aol.c
    

     

    Pattern 
    ^((?:[a-zA-Z]:)|(?:\\{2}\w+)\$?)\\([^\\/:*?<>"| ](?:[^\\/:*?<>"|]*[^\\/:*?<>"| ])*\\)*((?:[^\\/:*?<>"|]*[^\\/:*?<>"| ])+\.(?:[^\\/:*?<>"|]*[^\\/:*?<>"| ]){2,15})?
    Input string: 
    C:\RECYCLER\S-1-5-21-878277593-965038143-782984527-1002
    

     

    Pattern 
    ^\w{1}[a-zA-Z]+(([\'\,\.\- ][a-zA-Z ])?[a-zA-Z]*)*$
    Input string: 
    gjlglujhL'KJK'LKJL'KLJKL-JKHKJHGK
    

     

    Pattern 
    ^[A-Za-z0-9](([_\.\-]?[a-zA-Z0-9]+)*)@([A-Za-z0-9]+)(([\.\-]?[a-zA-Z0-9]+)*)\.([A-Za-z]{2,})$
    Input string: 
    michael.sumeranoscranton.edu
    

     

    Pattern 
    (<table.*align=\"right\">)((\s*.*\s*)*?)(<table)((.*|\s*)*?)(</table>)
    Input string: 
    {LARGE BODY OF TEXT}
    

     

    Pattern 
    (<table.*align=\"right\">)((\s*.*\s*)*?)((\s*.*\s*)*?)(<table)((.*\s*)*?)(</table>)((\s*.*\s*)*?)(</table>)
    Input string: 
    {LARGE BODY OF TEXT}
    

     

    Pattern 
    (<table)((\s*.*\s*)*?).*(<FONT\sFACE=\"arial).*(Related\s Articles).+((\s*.*\s*)*?)(</table>)((\s*.*\s*)*?)(</table>) 
    Input string: 
    {LARGE BODY OF TEXT}
    

     

    Sponsor
  • New testing tool for RegexLib.com

    I'm playing around with some ideas for a new testing tool for RegexLib.com:

        http://regexlib.com/BetaTester/TestPage.aspx

    This tool will allow you to test patterns using clientside (VBScript or Javascript) or, serverside (.NET) engines.  When using the .NET engine you can also choose to display results in a Grid (group view) or a Tree (captures view).

    Feedback
    I should have it tested and integrated in with the rest of the site within a week (hopefully).  If you have any feedback, or, requests for additional features, please either mail me via my contact form or just leave them as a comment below.

    Thanks to Moshe Salomon for the client-side bits :-)

     

    Known bugs:
    .NET Named groups throw a syntax error when using the angle bracket version, use the single quote version for the time being: i.e.: use (?'nameOfGroup'...) instead of (?<nameOfGroup>...)

    Sponsor
More Posts Next page »