Hi All,
I am trying to process some regexps on long text (~4K).
The text may include plain text and also HTML tags.
Example:
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Fusce et
lorem. Donec leo augue, feugiat eu, vulputate sed, porttitor sed, elit.
Sed mollis tellus vitae nunc. Etiam metus. Suspendisse ligula erat,
ultrices vel, euismod ut, egestas <img src="http://www.SomeLongUrlHere.com" /> eget, justo. Vestibulum pretium, nisl
eu pellentesque semper, tellus urna suscipit sem, quis http://www.AnyUrlHere.com volutpat augue
odio et magna. Curabitur rhoncus arcu at leo. Nam lobortis, orci ac
condimentum rutrum, sapien purus lacinia sapien, ut semper nulla eros a
nisl. Morbi quis libero. Aliquam cursus elit at purus. Pellentesque
cursus leo vitae felis.
The processing flow:
1. Link all URLs not inside HTML tags (between the < and >)
e.g: text text www.MatchThisUrl.com text text text <a href="http://www.DoNotMatchThisUrl.com">http://www.MatchThisUrl.com</a>
2. Insert word break delimiter on long words (<wbr/>) not inside HTML tags (between the < and >)
e.g: text text aVeryLongWordThatTheRegExpShouldMatch text text text <object type="aVeryLongWordThatTheRegExpShould-NOT-Match ">aVeryLongWordThatTheRegExpShouldMatch</object>
I am using this pattern to match the URLs:
((www\.|(http|https|ftp|news|file)+\:\/\/)[_.a-z0-9-]+\.[a-z0-9\/_:@)(=.+?,#%&~-]*[^.|'|\#|!|\(|?|,| |>|<|;|\)])
the problem is how to match URLs not inside <...>.
Thanks a lot,
Rami