Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

Preparing text for display handling word break and linking urls - outside tags

Last post 07-24-2008, 10:42 AM by ddrudik. 3 replies.
Sort Posts: Previous Next
  •  07-24-2008, 3:25 AM 44517

    Preparing text for display handling word break and linking urls - outside tags

    Hi All,

    I am trying to process some regexps on long text (~4K).

    The text may include plain text and also HTML tags.

    Example:

    Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Fusce et lorem. Donec leo augue, feugiat eu, vulputate sed, porttitor sed, elit. Sed mollis tellus vitae nunc. Etiam metus. Suspendisse ligula erat, ultrices vel, euismod ut, egestas <img src="http://www.SomeLongUrlHere.com" /> eget, justo. Vestibulum pretium, nisl eu pellentesque semper, tellus urna suscipit sem, quis http://www.AnyUrlHere.com volutpat augue odio et magna. Curabitur rhoncus arcu at leo. Nam lobortis, orci ac condimentum rutrum, sapien purus lacinia sapien, ut semper nulla eros a nisl. Morbi quis libero. Aliquam cursus elit at purus. Pellentesque cursus leo vitae felis.

    The processing flow:

    1. Link all URLs not inside HTML tags (between the < and >)

    e.g: text text www.MatchThisUrl.com text text text <a href="http://www.DoNotMatchThisUrl.com">http://www.MatchThisUrl.com</a>

    2. Insert word break delimiter on long words (<wbr/>) not inside HTML tags (between the < and >)

     e.g: text text aVeryLongWordThatTheRegExpShouldMatch text text text <object type="aVeryLongWordThatTheRegExpShould-NOT-Match ">aVeryLongWordThatTheRegExpShouldMatch</object>

    I am using this pattern to match the URLs:

    ((www\.|(http|https|ftp|news|file)+\:\/\/)[_.a-z0-9-]+\.[a-z0-9\/_:@)(=.+?,#%&~-]*[^.|'|\#|!|\(|?|,| |>|<|;|\)])

    the problem is how to match URLs not inside <...>.

     

    Thanks a lot,

    Rami

     

     

     

     

     

     

  •  07-24-2008, 6:17 AM 44519 in reply to 44517

    Re: Preparing text for display handling word break and linking urls - outside tags

    What platform are you using?  On some platforms (Classic ASP etc.) if this was a replacement operation I would first match the <[^>]*> blocks into an array, remove them from the source text replacing them with numbered placeholders, then match on whatever I consider a valid URL:

    (?:www\.|(?:https?|ftp|news|file))\S+

    replacing with whatever link bounding tag text.

    Then reinsert the previously-matched <[^>]*> blocks back from the array into the source text.  Not very efficient but some platforms are more limited in their regex abilities.

    On other platforms such as PHP and .NET you can avoid this extra replacement/reinsertion step.

    As for the long non-tag \S+ sections, again it would depend on your platform.


  •  07-24-2008, 9:32 AM 44524 in reply to 44519

    Re: Preparing text for display handling word break and linking urls - outside tags

    I'm using Java.

     

    Now this pattern works for the URLs:

    (?![^<]+>)((www\.|(http|https|ftp|news|file)+\:\/\/)[_.a-z0-9-]+\.[a-z0-9\/_:@)(=.+?,#%&~-]*[^.|'|\#|!|\(|?|,| |>|<|;|\)])(?![^<]+>)

     

    But for word break I use:

     ((?![^<]+>)\S){20,}(?![^<]+>)

     

    and it still doesn't match long words at all

  •  07-24-2008, 10:42 AM 44528 in reply to 44524

    Re: Preparing text for display handling word break and linking urls - outside tags

    This is the match I get with your first pattern, it must work differently on your platform:

    $matches Array:
    (
        [0] => Array
            (
                [0] => http://www.A
            )

    )

    This worked for your second task in PHP:
    Raw Match Pattern:
    (?:(?![^<]+>)\S){20}

    Raw Replace Pattern:
    $0<wbr>
     

    Assuming you want to break long words (as defined by \S) at 20 character increments.


View as RSS news feed in XML