Hi Torsten,
(I have only just seen this thread, so I hope this reply is not too late...)
As Michael says, my regexp uses .NET specific features and would need to be modified to work for Java or Perl.
Here are some links that describes the features used:
http://msdn2.microsoft.com/en-us/library/bs2twtah.aspx
http://msdn2.microsoft.com/en-us/library/36xybswe.aspx
(?<angle><) is a named capture group.
It is the same as (<), except that the result is captured in the named group 'angle' (the name can be anything).
(?(angle)>) is a name alternation.
It is the same as this psuedo-code:
if (angle) then match(">")
(angle) will be true if the named capture group above captured a non-empty pattern.
(?(?<!\[)\.) is an expression alternation.
It is the same as this pseudo-code:
if (?<!\[) then match("\.")
The nested (?<!\[) is a Zero-width negative lookbehind assertion.
Putting it all together, it is:
if (!lookbehind("\[")) then match("\.")
Of course all the above might not help!
OK, what do you want to match?
My regexp matches an RFC 2822 'mailbox' (exluding comments and whitespace).
That includes these strings:
mark@domain.com
<mark@domain.com>
Mark Cranness <mark@domain.com>
Do you want/need to match an RFC 2822 'mailbox'?
Maybe it would be better/simpler to match an 'addr-spec' instead, and store the 'display-name' separately?
An 'addr-spec' example is:
mark@domain.com
Assuming that you want to STORE an email address to send an email to, then if you stored fields like so:
FIRST_NAME: Mark
LAST_NAME: Cranness
EMAIL: mark@domain.com
... then you can build the full 'mailbox' address like so:
FIRST_NAME + ' ' + LAST_NAME + '<' + EMAIL + '>'
The 'addr-spec' part of my regexp is this:
((?!\.)(?>\.?[a-zA-Z\d!#$%&'*+\-/=?^_`{|}~]+)+|"((?=[\x01-\x7f])[^"\\]|\\[\x01-\x7f])*")@(((?!-)[a-zA-Z\d\-]+(?<!-)\.)+[a-zA-Z]{2,}|\[(((?(?<!\[)\.)(25[0-5]|2[0-4]\d|[01]?\d?\d)){4}|[a-zA-Z\d\-]*[a-zA-Z\d]:((?=[\x01-\x7f])[^\\\[\]]|\\[\x01-\x7f])+)\])
That is simpler than the entire 'mailbox', and avoids some of the problem expressions.
But it still has the problem (?(?<!\[)\.) expression...
That problem expression (and some of the following) is a result of me being 'too clever' and trying to use the minimum number of characters to match an IP address. It can be modified to work for Java/Perl like so:
(25[0-5]|2[0-4]\d|[01]?\d?\d)(\.(25[0-5]|2[0-4]\d|[01]?\d?\d)){3}
The final 'addr-spec' pattern looks like so (which should work in Java/Perl):
^((?!\.)(?>\.?[a-zA-Z\d!#$%&'*+\-/=?^_`{|}~]+)+|"((?=[\x01-\x7f])[^"\\]|\\[\x01-\x7f])*")@(((?!-)[a-zA-Z\d\-]+(?<!-)\.)+[a-zA-Z]{2,}|\[((25[0-5]|2[0-4]\d|[01]?\d?\d)(\.(25[0-5]|2[0-4]\d|[01]?\d?\d)){3}|[a-zA-Z\d\-]*[a-zA-Z\d]:((?=[\x01-\x7f])[^\\\[\]]|\\[\x01-\x7f])+)\])$
If you MUST have the full 'mailbox', you could do it the hard way like so:
^
((?>[a-zA-Z\d!#$%&'*+\-/=?^_`{|}~]+\x20*|"((?=[\x01-\x7f])[^"\\]|\\[\x01-\x7f])*"\x20*)*<((?!\.)(?>\.?[a-zA-Z\d!#$%&'*+\-/=?^_`{|}~]+)+|"((?=[\x01-\x7f])[^"\\]|\\[\x01-\x7f])*")@(((?!-)[a-zA-Z\d\-]+(?<!-)\.)+[a-zA-Z]{2,}|\[((25[0-5]|2[0-4]\d|[01]?\d?\d)(\.(25[0-5]|2[0-4]\d|[01]?\d?\d)){3}|[a-zA-Z\d\-]*[a-zA-Z\d]:((?=[\x01-\x7f])[^\\\[\]]|\\[\x01-\x7f])+)\]))
|
((?!\.)(?>\.?[a-zA-Z\d!#$%&'*+\-/=?^_`{|}~]+)+|"((?=[\x01-\x7f])[^"\\]|\\[\x01-\x7f])*")@(((?!-)[a-zA-Z\d\-]+(?<!-)\.)+[a-zA-Z]{2,}|\[((25[0-5]|2[0-4]\d|[01]?\d?\d)(\.(25[0-5]|2[0-4]\d|[01]?\d?\d)){3}|[a-zA-Z\d\-]*[a-zA-Z\d]:((?=[\x01-\x7f])[^\\\[\]]|\\[\x01-\x7f])+)\])
$
-remove line breaks in the above.
Michael says you can't tell if an email is valid using a regex...
Well, regular expressions are pretty powerful.
I think I've done a pretty good job of matching an RFC 2822 'mailbox'.
But I don't have enough real-world experience to know how my regex fairs with real-world email addresses.
One issue I am aware of is that some people tell me that mark@127.0.0.1 is a valid email address. But it is not according to RFC 2822...