Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

linkify with a twist

Last post 02-06-2013, 2:32 PM by SectionI. 6 replies.
Sort Posts: Previous Next
  •  02-04-2013, 6:29 AM 87276

    linkify with a twist

    Hi,

    I would like to replace all links that point to a page on facebook to have the text: "facebook page" and all other links should have the text: "website"

    this is what i currently use to linkify:

    Regex.Replace(pic.PictureDesc, @"((http|ftp|https):\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,@?^=%&:/~\+#]*[\w\-\@?^=%&/~\+#])?)","<a href=\"$1\" target=\"blank\">$1</a>")

    if you could help me adjust for both rules.

    thank you

    Filed under: ,
  •  02-04-2013, 5:25 PM 87277 in reply to 87276

    Re: linkify with a twist

    What you want to do should be fairly straight forward. However I have little knowledge of how to detect a facebook URL - can you please let me know what the URLs look like. (I'm guessing that the contain xxx.facebook.com but I am not sure).

    Also I am not clear what you are wanting to end up with at the end. Perhaps you could provide me with some sample "before" and "after" text.

    However, this will almost certainly require 2 passes over the text. While most regex variants provide a set of very powerful pattern operators, they generally are very simplistic with how you can create the replacement strings. Therefore you will probably need to have one pattern that replaces the facebook URLs (as this is the more specific of the 2 patterns) and then rn over the text again with the pattern that replaces all of the other URLs.

    There are a number of suggestions I could make about your pattern but they are secondary to getting the basic operation going for you.

    Susan 

  •  02-04-2013, 5:41 PM 87278 in reply to 87277

    Re: linkify with a twist

    you were correct. i want to detect urls containing facebook.com in them.

     

    if that domain exists the link should appear like so;

    <a href="http://www.facebook.com/somepage">facebook page</a>

    else

     <a href="http://www.somewebsite.com/somepage">website</a>

  •  02-04-2013, 9:50 PM 87279 in reply to 87278

    Re: linkify with a twist

    OK, for the first pass to look for the facebook URLs, try:

     

    (http|ftp|https):\/\/(?![\w.]*facebook.com)[\w\-_]+(\.[\w\-_]+)+([\w\-.,@?^=%&:/~+#]*([\w\-  @?^=%& /~+#]))?

     

    with the replacement text:

    <a href=\"$0\" target=\"blank\">facebook</a>

    and the "ignore case" option set".

    With the test strings:

     

    http://wombat.com.au/first/second
    http://my.friend.on.facebook.com/whatever?comes=after
    http://facebook.com/whatever?comes=after
    http://www.facebook.com/whatever?comes=after
    ftp://facebook.is.silly.com/test.message
    https://silly

     

    the result is:

     

    http://wombat.com.au/first/second
    <a href=\"http://my.friend.on.facebook.com/whatever?comes=after\" target=\"blank\">facebook</a>
    <a href=\"http://facebook.com/whatever?comes=after\" target=\"blank\">facebook</a>
    <a href=\"http://www.facebook.com/whatever?comes=after\" target=\"blank\">facebook</a>
    ftp://facebook.is.silly.com/test.message
    https://silly

    where you can see that the 3rd, 4th and 5th lines are substituted but the others are not.

     

    Then using:

     

    (http|ftp|https):\/\/(?![\w.]*facebook.com)[\w\-_]+(\.[\w\-_]+)+([\w\-.,@?^=%&:/~+#]*([\w\-  @?^=%& /~+#]))?

     

    and the replacement text:

    <a href=\"$0\" target=\"blank\">website</a>

    and the same option settings, then over the resulting text from before, you get

    <a href=\"http://wombat.com.au/first/second\" target=\"blank\">website</a>
    <a href=\"http://my.friend.on.facebook.com/whatever?comes=after\" target=\"blank\">facebook</a>
    <a href=\"http://facebook.com/whatever?comes=after\" target=\"blank\">facebook</a>
    <a href=\"http://www.facebook.com/whatever?comes=after\" target=\"blank\">facebook</a>
    <a href=\"ftp://facebook.is.silly.com/test.message\" target=\"blank\">website</a>
    https://silly

    I would suggest that you look closely at the part of the pattern that matches the URL - URLs are notorious in not having any structure that a regex pattern can verify. Generally, if you need to locate a URL, you can start with the "http" or "ftp" etc., and then accept anything until you get to something that does not make sense in the context of the source file - whitespace or "<" or ">" characters for example (even though they can form parts of value URLs in some cases).

    As an example, if you use:

    (http|ftp|https):\/\/[\w\-.,@?^=%&:/~+#]*

    which is your pattern with all but the first and last parts taken out, then it will match ALL of the URLs use above, and so does:

    (http|ftp|https):\/\/[\w=?.%/]*

    You can also get fancy with factoring the first part to make it:

    (https?|ftp):\/\/[\w=?.%/]*

    Susan

  •  02-05-2013, 7:45 AM 87280 in reply to 87279

    Re: linkify with a twist

    Well, the first and second rules you wrote are exactly the same.

    and they work only on non-facebook urls.

    i tried removing the ! from one of the rules but i recieved an error

  •  02-05-2013, 5:16 PM 87282 in reply to 87280

    Re: linkify with a twist

    Sorry - my mistake. I was testing out both patterns and I must have copy and pasted the wrong one. It shuold be:

    (http|ftp|https):\/\/[\w.]*?facebook.com([\w\-.,@?^=%&:/~+#]*([\w\-  @?^=%& /~+#]))?

    Also, if you tried to simply remove the '!' then 1) it would not have worked but 2) it shows that you probably have little or no understanding of what the pattern is actually doing. Therefore I suspect that you will have problems in the future if you want to change anything.

    By way of an explanation:

    (http|ftp|https):\/\/               - the is an alternation that matches the first hit it comes to of "http://", "ftp://" or "https://" (see #1 below)
    [\w.]*?facebook.com            - the first part is a character set definition that matches any alphanumeric or period character and does so 0 or more times;
                                              it will then only match "facebook.com" (see #2 below)
    (                                      - the start of a matching group
    [\w\-.,@?^=%&:/~+#]*       - this is a character set definition that will match an alphanumeric or one of the special characters that are also listed do does so 0 or more times
    ([\w\-@?^=%&/~+#])         - basically the same thing as before but with a couple of characters missing and will only match a single character (see #3 below)
    )?                                    - the end of the matching group started above but also makes a match of the whole group optional

    Note #1: while this does match the start of the URL, it does so in a slightly inefficient manner. Consider when the text is "https://". The first alternative will match the first 4 letters and then fail as the "s" in the text wont match the ':' in the pattern. Therefore it will try to 2nd alternative and fail on the first character. The 3rd alternative will match the first 5 characters and then carry on with the rest of the pattern. 

    However, if you used one of my suggestions: 'https?' then it would avoid the backtracking and attempts for the other alternatives because the "http" will match and the IF the next character is "s" then it will be matched; but if it is not then no problem and the rest of the pattern will be attempted.

    Admittedly not much of a saving in this case but as a general rule, if you can avoid backtracking then it is a good idea.

    Note #2:This part is attempting to match the "facebook.com" part of the URL but is actually written in a fairly "sloppy" manner. The '[\w.]*? part at the beginning will match any part of the address that comes before the "facebook.com" literal (as in "this.is.my.facebook.com") but will only allow alphanumerics, the underscore character and the period. This is a rough attempt to stop it from going beyond the address part, and possibly matching all of (as a made up example):

    http://fred.com.au/target=Facebook.com/fred

    The quantifier is '*?' which is called a lazy quantifier. Normally the '*' quantifier will try to match as many characters as possible and, in this case it would match right to the end of the ".com" part. Then when it tried to match the 'facebook' literal in the pattern it would need to start backtracking to the start of the "facebook" text and then move forward. By making the quantifier lazy, the regex engine will actually try to match the "facebook" part and, when it fails, then use the '[\w.]' part to step forward over the intervening characters (if it can). Therefore it steps forward 1 character at a time until it gets the the "facebook" part then takes off from there.

    There are some dangers in simply using "facebook.com" in that it will also match the middle of "myfacebook.commute". In this case it is relying on what is before and after the literal in the pattern to protect it from such a mistake. Unfortunately the rest of the pattern does not play along. Therefore, if you are finding that you are picking up too many false references, then try

    [\w.]*?\bfacebook.com\b

    instead. The '\b' is called a 'zero-width anchor": the "zero-width" comes from the fact that it must match but does NOT advance the text pointer when it does - zero characters are "consumed" by this operator. The "anchor" comes from the fact that it forces some characteristic to be true at the point in the text. In this case it requires that there be a "word" character (generally defined to be the same as the '\w' short-cut: alphanumeric or underscore character) on one side and a non-word character (anything OTHER than an alphanumeric or underscore) on the other. Some texts refer to this as marking the beginning or end of a word.

    Note #3: This whole section of the pattern is the basis of my previous comment about possibly cleaning up how a URL is being matched. What this appears to be doing is to match any of the characters listed in the first character set definition which, I assume, should match anything that can come after the address and still be part of the URL. Once it has matched as many of those characters as it can, then it tries to match one character from the 2nd character set which is basically the same except for the component separator characters (period, comma and colon). I think this is an attempt to stop matches with

    http://facebook.com/whatever=fred:

    However, what it will do is to match everything up to but not including the colon.

    As for why you cannot simply remove the '!' in my erroneous pattern, that is actually part of a lookahead that, in the proper pattern, stops a match with the "facebook.com" text. By removing the '!' you turn the negative lookahead into a positive lookahead which requires that the "facebook.com" text appears. However, a lookahead is a "zero width" operator - therefore it will check that the "facebook.com " appears, but then carry on with the rest of the pattern regardless.

    Susan 

     

  •  02-06-2013, 2:32 PM 87283 in reply to 87282

    Re: linkify with a twist

    thanks alot for your explainations.

     

    there comes a point in a programmer's life when he has to admit that he's regex-incompetent and seek professional help...

View as RSS news feed in XML