I received an e-mail last week which was asking me for help with a specific problem; the writer wanted a pattern to help filter out spam. They wanted to be able to filter out words which contained more than 2 instances of any "bad characters". Here is the actual request:
> Lets say I have a word like this...
> p@e_R$cr!ption
> Now, how can I write a regular expression that basically says
> 1. I don't care how long the word is, as long as it does not have spaces (loose definition of a word).
> 2. I don't care what alphabetic characters a-zA-Z are in the word.
> 3. I ONLY CARE that the word ( I use that term loosely) contains >2 of the following characters,
> @ $ _ ! IN ANY POSITION OR ORDER IN THE WORD.
Normally when looking for "words" I would use the \b metacharacter to look for non-word characters at word boundaries like so: \b\w+\b
Redefining Boundaries
\b basically says, match any chars that are not numbers, letters or the "_" character. With the current problem that won't work because I need to manually group characters together so that I can keep a count of how many bad characters are found in an individual sequence of "word" characters. \b won't let me do that because it will break the "words" apart before I even get to look at them.
So, given the rules - "I don't care how long the word is, as long as it does not have spaces" - my first task is to create some character classes to substitute \b with some of my own beginning and ending word markers and apply lookaround techniques to enforce them. Here are my \b substitutions:
Opening \b
(?<=^|[\s ]+)
Closing \b
(?=[\s ]+|$)
...now consider them applied thusly:
Matching on the string: h*lp
Using \b\w+\b
Matches: 2 h, lp
Using (?<=^|[\s ]+).+?(?=[\s ]+|$)
Matches: 1 h*lp
Normal, (Special, Normal)
One technique that I learnt from Mastering Regular Expressions (page 262) was the "unrolling the loop" technique for lexing through strings. Unrolling the loop is used when you have repeatable patterns of data that you are searching for. In my case, the repeatable pattern is that I have good characters and bad characters that I'm searching on:
Good [^!@%$s ]
Bad [!@%$]
I also know that those characters will always be in the following pattern:
my new opening word boundary character
zero or more good chars
( a bad char + zero or more good chars ) REPEATED AT LEASE TWICE
my new closing word boundary character
So, you can mentally play that through with the following examples:
H@%p Should match
@$ Should match
H@lp Should not match
@ Should not match
So, putting those normal(special, normal) sequences together, gives me the following pattern:
(?<=^|[\s ]+)[^!@%$s ]*([!@%$][^!@%$s ]*){2,}(?=[\s ]+|$)
I've uploaded that pattern to RegexLib.com ( http://www.regexlib.com/REDetails.aspx?regexp_id=614 )