Sorry - my mistake. I was testing out both patterns and I must have copy and pasted the wrong one. It shuold be:
(http|ftp|https):\/\/[\w.]*?facebook.com([\w\-.,@?^=%&:/~+#]*([\w\- @?^=%& /~+#]))?
Also, if you tried to simply remove the '!' then 1) it would not have worked but 2) it shows that you probably have little or no understanding of what the pattern is actually doing. Therefore I suspect that you will have problems in the future if you want to change anything.
By way of an explanation:
(http|ftp|https):\/\/ - the is an alternation that matches the first hit it comes to of "http://", "ftp://" or "https://" (see #1 below)
[\w.]*?facebook.com - the first part is a character set definition that matches any alphanumeric or period character and does so 0 or more times;
it will then only match "facebook.com" (see #2 below)
( - the start of a matching group
[\w\-.,@?^=%&:/~+#]* - this is a character set definition that will match an alphanumeric or one of the special characters that are also listed do does so 0 or more times
([\w\-@?^=%&/~+#]) - basically the same thing as before but with a couple of characters missing and will only match a single character (see #3 below)
)? - the end of the matching group started above but also makes a match of the whole group optional
Note #1: while this does match the start of the URL, it does so in a slightly inefficient manner. Consider when the text is "https://". The first alternative will match the first 4 letters and then fail as the "s" in the text wont match the ':' in the pattern. Therefore it will try to 2nd alternative and fail on the first character. The 3rd alternative will match the first 5 characters and then carry on with the rest of the pattern.
However, if you used one of my suggestions: 'https?' then it would avoid the backtracking and attempts for the other alternatives because the "http" will match and the IF the next character is "s" then it will be matched; but if it is not then no problem and the rest of the pattern will be attempted.
Admittedly not much of a saving in this case but as a general rule, if you can avoid backtracking then it is a good idea.
Note #2:This part is attempting to match the "facebook.com" part of the URL but is actually written in a fairly "sloppy" manner. The '[\w.]*? part at the beginning will match any part of the address that comes before the "facebook.com" literal (as in "this.is.my.facebook.com") but will only allow alphanumerics, the underscore character and the period. This is a rough attempt to stop it from going beyond the address part, and possibly matching all of (as a made up example):
The quantifier is '*?' which is called a lazy quantifier. Normally the '*' quantifier will try to match as many characters as possible and, in this case it would match right to the end of the ".com" part. Then when it tried to match the 'facebook' literal in the pattern it would need to start backtracking to the start of the "facebook" text and then move forward. By making the quantifier lazy, the regex engine will actually try to match the "facebook" part and, when it fails, then use the '[\w.]' part to step forward over the intervening characters (if it can). Therefore it steps forward 1 character at a time until it gets the the "facebook" part then takes off from there.
There are some dangers in simply using "facebook.com" in that it will also match the middle of "myfacebook.commute". In this case it is relying on what is before and after the literal in the pattern to protect it from such a mistake. Unfortunately the rest of the pattern does not play along. Therefore, if you are finding that you are picking up too many false references, then try
instead. The '\b' is called a 'zero-width anchor": the "zero-width" comes from the fact that it must match but does NOT advance the text pointer when it does - zero characters are "consumed" by this operator. The "anchor" comes from the fact that it forces some characteristic to be true at the point in the text. In this case it requires that there be a "word" character (generally defined to be the same as the '\w' short-cut: alphanumeric or underscore character) on one side and a non-word character (anything OTHER than an alphanumeric or underscore) on the other. Some texts refer to this as marking the beginning or end of a word.
Note #3: This whole section of the pattern is the basis of my previous comment about possibly cleaning up how a URL is being matched. What this appears to be doing is to match any of the characters listed in the first character set definition which, I assume, should match anything that can come after the address and still be part of the URL. Once it has matched as many of those characters as it can, then it tries to match one character from the 2nd character set which is basically the same except for the component separator characters (period, comma and colon). I think this is an attempt to stop matches with
However, what it will do is to match everything up to but not including the colon.
As for why you cannot simply remove the '!' in my erroneous pattern, that is actually part of a lookahead that, in the proper pattern, stops a match with the "facebook.com" text. By removing the '!' you turn the negative lookahead into a positive lookahead which requires that the "facebook.com" text appears. However, a lookahead is a "zero width" operator - therefore it will check that the "facebook.com " appears, but then carry on with the rest of the pattern regardless.