Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

Mark Cranness regEx for e-mails

Last post 01-28-2007, 11:16 PM by mash. 5 replies.
Sort Posts: Previous Next
  •  08-19-2006, 5:57 AM 20945

    Mark Cranness regEx for e-mails

    Hi,

    I want to use the regular expression of Mark Cranness to validate an e-mail address. The expression is:

    ^((?>[a-zA-Z\d!#$%&'*+\-/=?^_`{|}~]+\x20*|"((?=[\x01-\x7f])[^"\\]|\\[\x01-\x7f])*"\x20*)*(?<angle><))?((?!\.)(?>\.?[a-zA-Z\d!#$%&'*+\-/=?^_`{|}~]+)+|"((?=[\x01-\x7f])[^"\\]|\\[\x01-\x7f])*")@(((?!-)[a-zA-Z\d\-]+(?<!-)\.)+[a-zA-Z]{2,}|\[(((?(?<!\[)\.)(25[0-5]|2[0-4]\d|[01]?\d?\d)){4}|[a-zA-Z\d\-]*[a-zA-Z\d]:((?=[\x01-\x7f])[^\\\[\]]|\\[\x01-\x7f])+)\])(?(angle)>)$

    I am using Java. I don't know what (?<angle><) and (?(angle)>) means. I am not so new to regular expressions and I know about positive and negative lookaheads or the like, but I dont't know what these tags mean. I can't find anything about it (google, forums,...). Even the found specifications of regular expressions in Java or Perl are not helpful.
    Can anybody help me, please.

    Thank you

    Torsten Graf
  •  08-20-2006, 3:58 AM 20968 in reply to 20945

    Re: Mark Cranness regEx for e-mails

    First off just so you know, you can't validate E-mail 100% with a regex.  So if this is for something mission critical you can stop now.  only use a regex if you ok with it rejecting at least 20% of good email addresses. What you have will probably validate 60% of valid emails, as will most you'll find on the www

    Second to answer your question, the (?<name>...)  syntax is a named group, it is an advanced feature not (yet) supported in JAVA.   You'll have to remove it from your pattern to get them to work. (?<angle><) would become (<)

    The other construct is another advanced feature, which isn't supported by java (yet) either.  (?(foo)...) is a  conditional group, which you can't simply remove and have your pattern work in the same matter. 
    What is being done with the two constructs is he's checking for angles around the email address but the closing angle is allowed only if there is an opening one.

    The pattern you are trying to use was written for .Net and isn't unversally translatable. 

    Michael

    Michael

    "In theory, theory and practice are the same. In practice, they are not."
    Albert Einstein
  •  08-20-2006, 8:22 AM 20975 in reply to 20968

    Re: Mark Cranness regEx for e-mails

    Hi,

    thank you for the fast answer. But if I remove the two tags I can't compile the pattern.
    Java says:
    Exception in thread "main" java.util.regex.PatternSyntaxException: Unknown inline modifier at the line
    I marked it down below in red inside the pattern. It is the ?. If I remove it, the pattern doesn't work correct. I searched the library of MSDN but this construction is not there. For what is this single ? good for at this position? Any idea?

    ^((?>[a-zA-Z\d!#$%&'*+\-/=?^_`{|}~]+\x20*|"((?=[\x01-\x7f])[^"\\]|\\[\x01-\x7f])*"\x20*)*)?((?!\.)(?>\.?[a-zA-Z\d!#$%&'*+\-/=?^_`{|}~]+)+|"((?=[\x01-\x7f])[^"\\]|\\[\x01-\x7f])*")
    @(((?!-)[a-zA-Z\d\-]+(?<!-)\.)+[a-zA-Z]{2,}|\[(((?(?<!\[)\.)(25[0-5]|2[0-4]\d|[01]?\d?\d)){4}|[a-zA-Z\d\-]*[a-zA-Z\d]:((?=[\x01-\x7f])[^\\\[\]]|\\[\x01-\x7f])+)\])$

    Thank you again

    Torsten Graf

    p.s. If I got the most valid existing e-mail addresses its fine. My effort is to filter only valid addresses. If one is not accepted by my application it is not a usual kind of address or it is not valid.                                                                     
                                                                                
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            

  •  08-21-2006, 12:50 AM 21020 in reply to 20975

    Re: Mark Cranness regEx for e-mails

    I sorry if I wasn't clear before. You can't translate all of the functionality of this .Net-flavored regex to Java.  It is using two features not supported by Java.  One of which can simply be removed, the other can not.   Removing it would alter the functionality of the pattern. and you can't replicate it's functionality with Java's regex engine.

    You'd have to rewrite the entire pattern to get the same results and it wouldn't be a simple rewrite

    As far as matching e-mails with a regex, like I said before you can't tell if an email is valid using a regex (that isn't several thousand, yes thousand, characters long, and even that's not perfect) Follow this link to go more into detail about this. http://regexadvice.com/blogs/mash/archive/2004/07/15/314.aspx

    Michael

    Michael

    "In theory, theory and practice are the same. In practice, they are not."
    Albert Einstein
  •  01-28-2007, 4:56 AM 26742 in reply to 20945

    Re: Mark Cranness regEx for e-mails

    Hi Torsten,

    (I have only just seen this thread, so I hope this reply is not too late...)

    As Michael says, my regexp uses .NET specific features and would need to be modified to work for Java or Perl.
    Here are some links that describes the features used:
    http://msdn2.microsoft.com/en-us/library/bs2twtah.aspx
    http://msdn2.microsoft.com/en-us/library/36xybswe.aspx

    (?<angle><) is a named capture group.
    It is the same as (<), except that the result is captured in the named group 'angle' (the name can be anything).

    (?(angle)>) is a name alternation.
    It is the same as this psuedo-code:
     if (angle) then match(">")
    (angle) will be true if the named capture group above captured a non-empty pattern.

    (?(?<!\[)\.) is an expression alternation.
    It is the same as this pseudo-code:
     if (?<!\[) then match("\.")

    The nested (?<!\[) is a Zero-width negative lookbehind assertion.
    Putting it all together, it is:
     if (!lookbehind("\[")) then match("\.")

    Of course all the above might not help!

    OK, what do you want to match?

    My regexp matches an RFC 2822 'mailbox' (exluding comments and whitespace).
    That includes these strings:

     mark@domain.com
     <mark@domain.com>
     Mark Cranness <mark@domain.com>

    Do you want/need to match an RFC 2822 'mailbox'?
    Maybe it would be better/simpler to match an 'addr-spec' instead, and store the 'display-name' separately?
    An 'addr-spec' example is:
     mark@domain.com

    Assuming that you want to STORE an email address to send an email to, then if you stored fields like so:
     FIRST_NAME: Mark
     LAST_NAME: Cranness
     EMAIL:  mark@domain.com
    ... then you can build the full 'mailbox' address like so:
     FIRST_NAME + ' ' + LAST_NAME + '<' + EMAIL + '>'

    The 'addr-spec' part of my regexp is this:

    ((?!\.)(?>\.?[a-zA-Z\d!#$%&'*+\-/=?^_`{|}~]+)+|"((?=[\x01-\x7f])[^"\\]|\\[\x01-\x7f])*")@(((?!-)[a-zA-Z\d\-]+(?<!-)\.)+[a-zA-Z]{2,}|\[(((?(?<!\[)\.)(25[0-5]|2[0-4]\d|[01]?\d?\d)){4}|[a-zA-Z\d\-]*[a-zA-Z\d]:((?=[\x01-\x7f])[^\\\[\]]|\\[\x01-\x7f])+)\])

    That is simpler than the entire 'mailbox', and avoids some of the problem expressions.

    But it still has the problem (?(?<!\[)\.) expression...

    That problem expression (and some of the following) is a result of me being 'too clever' and trying to use the minimum number of characters to match an IP address.  It can be modified to work for Java/Perl like so:
    (25[0-5]|2[0-4]\d|[01]?\d?\d)(\.(25[0-5]|2[0-4]\d|[01]?\d?\d)){3}

    The final 'addr-spec' pattern looks like so (which should work in Java/Perl):
    ^((?!\.)(?>\.?[a-zA-Z\d!#$%&'*+\-/=?^_`{|}~]+)+|"((?=[\x01-\x7f])[^"\\]|\\[\x01-\x7f])*")@(((?!-)[a-zA-Z\d\-]+(?<!-)\.)+[a-zA-Z]{2,}|\[((25[0-5]|2[0-4]\d|[01]?\d?\d)(\.(25[0-5]|2[0-4]\d|[01]?\d?\d)){3}|[a-zA-Z\d\-]*[a-zA-Z\d]:((?=[\x01-\x7f])[^\\\[\]]|\\[\x01-\x7f])+)\])$

    If you MUST have the full 'mailbox', you could do it the hard way like so:

    ^
    ((?>[a-zA-Z\d!#$%&'*+\-/=?^_`{|}~]+\x20*|"((?=[\x01-\x7f])[^"\\]|\\[\x01-\x7f])*"\x20*)*<((?!\.)(?>\.?[a-zA-Z\d!#$%&'*+\-/=?^_`{|}~]+)+|"((?=[\x01-\x7f])[^"\\]|\\[\x01-\x7f])*")@(((?!-)[a-zA-Z\d\-]+(?<!-)\.)+[a-zA-Z]{2,}|\[((25[0-5]|2[0-4]\d|[01]?\d?\d)(\.(25[0-5]|2[0-4]\d|[01]?\d?\d)){3}|[a-zA-Z\d\-]*[a-zA-Z\d]:((?=[\x01-\x7f])[^\\\[\]]|\\[\x01-\x7f])+)\]))
    |
    ((?!\.)(?>\.?[a-zA-Z\d!#$%&'*+\-/=?^_`{|}~]+)+|"((?=[\x01-\x7f])[^"\\]|\\[\x01-\x7f])*")@(((?!-)[a-zA-Z\d\-]+(?<!-)\.)+[a-zA-Z]{2,}|\[((25[0-5]|2[0-4]\d|[01]?\d?\d)(\.(25[0-5]|2[0-4]\d|[01]?\d?\d)){3}|[a-zA-Z\d\-]*[a-zA-Z\d]:((?=[\x01-\x7f])[^\\\[\]]|\\[\x01-\x7f])+)\])
    $
    -remove line breaks in the above.

    Michael says you can't tell if an email is valid using a regex...
    Well, regular expressions are pretty powerful.
    I think I've done a pretty good job of matching an RFC 2822 'mailbox'.
    But I don't have enough real-world experience to know how my regex fairs with real-world email addresses.
    One issue I am aware of is that some people tell me that mark@127.0.0.1 is a valid email address.  But it is not according to RFC 2822...

  •  01-28-2007, 11:16 PM 26761 in reply to 26742

    Re: Mark Cranness regEx for e-mails

    Mark, here's a blog entry why I say it can't be done 100% with a regex, but that's validating everything.   Leaving out comments would make it a lot easier. Regexes validating as much of the RFC2822 as possible run a few thousand characters.

    Michael

    "In theory, theory and practice are the same. In practice, they are not."
    Albert Einstein
View as RSS news feed in XML