JavaScript Regular Expressions on Steroids

If you're a JavaScript programmer and regular expression master (or want to be), check out the open source XRegExp library. Version 2 just hit the release candidate milestone yesterday, and brings a truckload of new features and improvements. Plus, if you want to match Unicode text in JavaScript, XRegExp is the only game in town. It works in Node and even Internet Explorer 5.5.

Check out XRegExp now, or track the latest development progress and contribute on GitHub.

Sponsor
Major upgrades for XRegExp, the JavaScript regex library

I've just released XRegExp 1.0, the next generation of my JavaScript regular expression library. This version fixes a couple bugs, corrects even more cross-browser regex inconsistencies, and adds a suite of new regular expression functions and methods that make writing regex-intensive JavaScript applications easier than ever. One of these new functions, XRegExp.addToken, fundamentally changes XRegExp's implementation and allows you to easily create your own XRegExp plugins.

Here's XRegExp's abbreviated feature list from the brand new xregexp.com (which includes extensive documentation and code examples):

The full list of changes can be seen in the changelog.

Sponsor
New: Regular Expressions Cookbook

Regular Expressions Cookbook (written by Jan Goyvaerts and me, and published by O'Reilly Media) is now available at Amazon.com and other fine bookstores. The book covers eight programming languages equally (C#, Java, JavaScript, Perl, PHP, Python, Ruby, and VB.NET), but it's also useful for non-programmers. The majority of the content covers regular expressions more generally, and most of the regexes will work fine in your favorite regular-expression-enabled text editor or other tool. The book is targeted at people with regex skills from beginner to upper intermediate, and there's a fair amount of information in there even for people who already consider themselves regex experts. Here is O'Reilly's press release for the book.

Don't forget to pick up a copy of your very own.

Sponsor
XRegExp 0.5: JavaScript regex library

XRegExp is a JavaScript library that provides an augmented, cross-browser implementation of regular expressions, including support for additional modifiers and syntax. Several convenience methods and a new, powerful recursive-construct parser that uses regex delimiters are also included.

Here's the feature list:

  • Added regex syntax:
    • Comprehensive named capture support.
    • Comment patterns: (?#…).
  • Added regex modifiers (flags):
    • s (singleline), to make dot match all characters including newlines.
    • x (extended), for free-spacing and comments.
  • Added awesome:
    • Reduced cross-browser inconsistencies.
    • Recursive-construct parser with regex delimiters.
    • An easy way to cache and reuse regex objects.
    • The ability to safely embed literal text in your regex patterns.
    • A method to add modifiers to existing regex objects.
    • Regex call and apply methods, which make generically working with functions and regexes easier.

All of this can be yours for the low, low price of 2.4 KB. :-) Version 0.5 also introduces extensive documentation and code examples. Get it while it's hot! XRegExp: JavaScript regular expression library.

Sponsor
RegexPal: Web-Based Regex Testing Reinvented

Yes I know, there are many other JavaScript regex testers available. Why did I create yet another? RegexPal brings several new things to the table for such web-based apps, and in my (biased) opinion it's easier to use and more helpful towards learning regular expressions than the others currently available. Additionally, most other such tools are very slow for the kind of data I often work with. They might appear fast when displaying 10 matches, but what about 100 matches, 1000, or 5000? Try generating 5,000 matches (which is easy to do with a regex consisting of a single dot, or similar) in your favorite existing web-based tool and see if your browser ever recovers (doubtful). The same task takes RegexPal less than half a second, and what's more, results are overlaid on the actual text you're typing.

At the moment, RegexPal is fairly short on features, but here are the highlights:

  • Real-time regex syntax highlighting with backwards and forwards context awareness.
  • Lightning-fast match highlighting with alternating styles.
  • Inverted matches (match any text not matched by the regex).

A few things to be aware of:

  • The approach I've used for scrollable rich-text editing (which I haven't seen elsewhere on the web) is a bit buggy (but it's fast). Firefox and IE7 have the least issues, but it more or less works in other browsers as well.
  • With the syntax highlighting, I generally mark corner-case issues which create cross-browser inconsistencies as errors even if they are the result of browser bugs or missing behavior documentation in ECMA-262v3.
  • There are different forms of lines breaks cross-platform/browser. E.g., Firefox uses \n even on Windows where nearly all programs use \r\n. This can affect the results of certain regexes.

RegexPal, at least for me, is lots of fun to play with and helps to make learning regular expressions easy through its instant feedback. Check it out at regexpal.com.

Sponsor
Posted 06 August 07 03:10 by Stevezilla00 | 1 Comments   
Filed under
Mimicking Conditionals

Excited by the fact that I can mimic atomic groups when using most regex libraries which don't support them, I set my sights on another of my most wanted features which is commonly lacking: conditionals (which provide an if-then-else construct). Of the regex libraries I'm familiar with, conditionals are only supported by .NET, Perl, PCRE (and hence, PHP's preg functions), and JGSoft products (including RegexBuddy).

There are two common types of regex conditionals in those libraries: lookaround-based and capturing-group-based. I'll get to the former type in a bit, but first, I'll address capturing-group-based conditionals, which are able to base logic on whether an optional capturing group has participated in the match so far. Here's an example:

(a)?b(?(1)c|d)

That matches only "bd" and "abc". The pattern can be expressed as follows:

(if_matched)?inner_pattern(?(1)then|else)

Here's a comparable pattern I created which doesn't require support for conditionals:

(?=(a)()|())\1?b(?:\2c|\3d)

Note that to use it without an "else" part, you still need to include the second empty backreference (in this case, "\3") at the end, like this:

(?=(a)()|())\1?b(?:\2c|\3)

As a brief explanation of how that works, there's an empty alternation option within the lookahead at the beginning which is used to cancel the effect of the lookahead, while at the same time, the intentionally empty capturing groups within the alternation are exploited to base the then/else part on which option in the lookahead matched. However, there are a couple issues:

  • This doesn't work with some regex engines, due to how they handle backreferences for non-participating capturing groups. For example, this does not work in Firefox, which treats non-participating capturing groups as if they matched an empty string.
  • It interacts with backtracking differently than a real conditonal (the "a" part is treated as if it were within an optional, atomic group, e.g., (?>(a))? instead of (a)?), so it's best to think of this as a new operator which is similar to a conditional.

As for lookaround-based conditionals, we can mimic them using the same concepts. Here's what real lookaround-based conditionals look like (this example uses a positive lookahead for the assertion):

(?(?=if_assertion)then|else)

And here's how you can mimic it:

(?:(?=if_assertion()|())\1then|\2else)

Again, to use it without an "else" part, you still need to include the second empty backreference (in this case, "\2") at the end, like this:

(?:(?=if_assertion()|())\1then|\2)

Notes:

  • Backtracking does not come into play with lookaround-based conditionals in the same way as with capturing-group-based conditionals. As a result, mimicked lookaround-based conditionals are functionally identical to their "real" counterparts.
  • (?:(?=if_assertion()|())\1then|\2else) is functionally equivalent to (?=if_assertion()|())(?:\1then|\2else)

For a compatibility table detailing support for these constructs with all the regex engines I've tested them with, see StevenLevithan.com: Mimicking Regular Expression Conditionals.

Sponsor
XRegExp: An Extended JavaScript Regex Constructor

I have recently written an extended JavaScript regular expression constructor which I’ve called XRegExp. This script is very small (the minified version weighs in at 937 bytes), and it adds support for two simple but powerful additional flags beyond those JavaScript supports natively:

  • s – Dot matches all (a.k.a., single-line) mode.
  • x – Free-spacing and comments mode.

It also allows you to use these flags with the RegExp constructor itself after running one line of code. Additionally, XRegExp improves some minor cross-browser regex syntax consistency issues.

Full deets and extensive documentation is available at XRegExp: JavaScript Regular Expression Constructor, or you can just download the script (with comments, or minified).

Regexes built using XRegExp are identical in speed to those built using the native RegExp constructor, support all of JavaScript’s regex syntax and regex-related methods, and should work in JavaScript 1.5+ browsers (tested in IE 5.5–7, Firefox 2, and Opera 9).

Sponsor
Posted 30 May 07 08:19 by Stevezilla00 | 1 Comments   
Filed under