Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

Strip HTTP links off of text - how?

Last post 10-09-2012, 6:32 PM by Aussie Susan. 9 replies.
Sort Posts: Previous Next
  •  10-01-2012, 5:56 PM 86702

    Strip HTTP links off of text - how?

    Hi, people.

    Using EditPad Pro, how do I extract http links, using regex?

    I need to preserve those links, everything else must go.

    '€' or '€' can be used as delimiters.

    Thanks in advance...

  •  10-01-2012, 7:48 PM 86703 in reply to 86702

    Re: Strip HTTP links off of text - how?

    There are a couple of issues that you will need to deal with first.

    The first is that there is no such thing as a "standard" HTTP link, at least not in the sense that you can create a regular expression pattern that will always find 100% of then accurately. Therefore, you need to look at the overall structure of the links that you have to see what common factors they have that you can use.

    The second is how are you expecting the search to work. You say "everything else must go" which, at leas to me,implies that you want to extract the matches into some other file or window. In general (and I must admit I don't know much about EditPad Pro other than it is supposed to have a fairly powerful regex engine but I don't know which) the regex engine will locate a matching section of text and either let you do something with it, or replace it with something else. It sounds like you want to actually match whatever is NOT an HTTP string and delete that (i.e. replace it with a null string).

    One approach might be to start by looking for the 'https?://' and then using the fact that most URLs use alphanumeric characters plus a few others as in (untested):

    https?://[\w./?=]+

    (with the "ignore case" option set if necessary) but there may be other characters you need to include in the character set definition. If that reliably finds the links you are after, you can then start investigating ways of extracting the links or deleting the non-matched text.

    Susan

  •  10-02-2012, 10:52 AM 86716 in reply to 86703

    Re: Strip HTTP links off of text - how?

    "...deleting the non-matched text."

    That's precisely the problem I'm facing with EditPad.

    =\

     "It sounds like you want to actually match whatever is NOT an HTTP string and delete that (i.e. replace it with a null string)."

    Yes, that's it.

     

    Thanks for all considerations, Susan.

    Hopefully an EditPad user can show up for rescue...

  •  10-02-2012, 6:51 PM 86718 in reply to 86716

    Re: Strip HTTP links off of text - how?

    OK, if EditPad pro does have a "fairly modern" regex engine, then try the following on some sample text.

    Firstly, use the pattern I suggested to see if finds the links reliably (i.e. finds all of the ones that are there and doesn't "find" other text that is not a link. You may want to play with the pattern until this step works correctly.

    Second, create a pattern such as

    (working-pattern)|.

    and, with a replacement string if $1, do a 'find and replace all" (preferably with the "multiline" option off). Using my suggested pattern, that would look like

    (https?://[\w./?=]+)|.

    What this will do is match either the link (and capture the text in match group #1) or match any single character (in which case match group #1 will contain the null string). The "replace" step will start by deleting the matched text and then construct the replacement string which will be inserted back into the text. If it matches the link, then the link text will be inserted back; if it matches the "any single character", then the null string will be inserted back which has the overall effect of just deleting the original character.

    There is a side-effect or 2: multiple "links" (especially in the same line) will end up next to each other with no gaps in between; and line breaks in the original text bay also be deleted (depending on whether the '.' pattern operator matches the line terminator - hence the suggestion to turn off the "multiline" option that normally controls this) in which case all links will end up in a single line in the text.

    (By the way, you can use a replacement string of "$1\n" or similar to add a line break after each replacement but this would replace each "deleted" character with a line break as well - that *may* be preferable to everything all together in that each link would be on a line of its own and you can then make a second pass to delete all blank lines,)

    Give that a go and see how far you get.

    Susan

  •  10-04-2012, 1:56 PM 86733 in reply to 86718

    Re: Strip HTTP links off of text - how?

    "form" argument "action" http://samplesite.com/base/next/magla?tab=getconnectiontab&connectionsconnshow_limit=12&connectionsconnshow_limitstart=12€€€€€€€€€€€€€

     Hi, Susan.

    I see that using '$1' as a replacement will do a difference, but not sure how...

    Using '$2' deletes the http links, so it's half way done.

    As you can see per code above, '(https?://[\w./?=]+)|. ' should work with a variation like '(https?://[\w.€?=]+)|. ' as the link 's delimiter is '€'. I have tried some combination but got nothing.

    Assuming '$1' works, your suggestion of "$1\n"will certainly do. Thanks for that!

    EDIT: is there a way to prevent forum from  converting my example into html code...?

    EDIT 2: got it...

  •  10-04-2012, 7:26 PM 86734 in reply to 86733

    Re: Strip HTTP links off of text - how?

    Glad to hear it.

    The "$2" will always be a null string as the pattern (at least as I specified it) does not have a 2nd capture group defined.

    I would not include the trailing delimiter in the character set UNLESS you want to include the delimiters in the captured text. If you know that the "€" character will never be used within the URL and will ALWAYS be there as a delimiter, then you could also use patterns such as:

    (https?://[^€]*)|.

    Also you suggested '(https?://[\w.€?=]+)|.' but this will only match "http://samplesite.com" of your example text as it does not include the "/" character in the set. This is the down-side of using my suggestion - as I said you need to include all of the characters that can be in the valid URL. On the other hand the suggestion just above gets around this my accepting everything (actually including line breaks etc.) until it either finds the delimiter or the end of the text 9whcih itself could be a down-side).

    Susan

  •  10-08-2012, 11:43 AM 86776 in reply to 86734

    Re: Strip HTTP links off of text - how?

    Ahem...

    IT WORKED, Susan! Thanks a bunch!

    IF this is not bothering you that much, would you mind in explaining the pattern, now?

    -the remaining problem is to remove excessive '\n' in the output. Do you know of a method for this?

  •  10-08-2012, 7:09 PM 86777 in reply to 86776

    Re: Strip HTTP links off of text - how?

    I assume you mean the pattern '(https?://[\w./?=]+)|.'

    (         - start a capture group (as this is the first, it will be numbered 1)
    http    - match the literal characters
    s?      - match the literal "s" character "0 or 1 times" (i.e. the '?' quantifier means the previous item is optional)
    ://      - more literal characters to be matched
    [         - the start of a character set definition
    \w       - a short-cut way to specify the character set definition of all alphanumeric characters plus the "_" character
    .         - as this is defined inside a character set definition, this is a 'literal' period character
    /?=     - more literal characters that make up the character set
    ]+       - the ']' ends tyhe character set definition; the '+' is a quantifier that means "repeat the previous item 1 or more times, matching as many times as possible"
    )         - closes the capture group
    |         - the alternation operator
    .         - as this is used outside a character set definition, this is the short-cut operator to match any character (except the '\n' character unless the "singleline" option is set)

    So much for the item by item breakdown - what does all of this *mean*.

    Lets start with the alternation operator: this is a very low precedence operator in regex patterns and so its effect is very far reaching (and is often the source of problems until you get the hang of how it works). What it does is to take everything on the left of the operator and try to match that sub-pattern to the text. If it succeeds, then it will ignore whatever is on the right and carry on. If the left-hand part fails to match, then the right-hand sub-pattern will be attempted.

    Therefore what we are trying to do here is to match your URL (the left-hand pattern) and if that doesn't work, then simply match any character. As you can see, this will always "succeed", but we can tell the difference because we have surrounded the URL part with parentheses - if the first part matches a URL then the matched text will be captured in match group #1; if the second part matches the match group #1 will contain a null string. As it turns out, this is exactly what we want for the replacement operation.

    The way the replacement operation works is to delete ALL of the matched text and then construct a replacement string which is inserted back into the text. Therefore we will either start by deleting the URL or the matched single character. The replacement string is made up of the text captured in match group #1 and a newline character: if we matched the URL then we are going to put the URL characters (plus the newline) back into the text; if it was a single character then we will insert the null character plus the newline into the text.

    This means that we will end up with a series of blank lines with the occasional line with a URL on  it.

    To remove the excess blank lines, you need to look at how the blank lines appear in the text. A line either starts after the beginning of the text or after a newline character. A line ends either with the end of the text or a newline character. Therefore a blank line will be seen as two consecutive newline characters.

    To delete the blank lines you need a search pattern of '\n\n' and a null replacement string.

    If the line contains a URL, then the newline characters will have text in between them and so will not be found by this pattern and so those lines will be left unaltered.

    Of course life is never quite that easy. For a start, some editors use "\r\n" as the line terminator so you will need to alter the pattern accordingly. Also, you may end up with some blank lines (especially at the start and end of the text) - you can either delete these by hand if there are only a few or you can make another pass over the text with the replacement pattern.

    Susan

  •  10-09-2012, 12:27 PM 86794 in reply to 86777

    Re: Strip HTTP links off of text - how?

    Fantastic, Susan. I needed extra cleanup for stuff like '?start=130' and per your comments came up with this, which I assume is right...

    ([?]start=[0-9]+)

    ([?]showall=[0-9]+)

    I understand your comment regarding '\r\n' vs '\n\n' as I crossed those variations previously.

    So thank you very much!

  •  10-09-2012, 6:32 PM 86795 in reply to 86794

    Re: Strip HTTP links off of text - how?

    The "?" as a literal is quite hard to handle and your approach would certainly work. The alternative is '\?' where the '\' at the start tells the pattern parser to treat the next character as a literal (this is generally true - you will often see '\\' and '\+' and the like).

    There is a subtle difference between the two approaches (which won't impact on you but I'm mentioning it for some possible future). A character set in the pattern requires that the text character being tested is converted to a bit-map which can then be compared to the bit-map of the character set. When the character set has multiple options in it (as in '\w' or '\d' or even '.') then this is often considered "worth it".

    However, when there is only one or 2 characters in the set, then it may be better to use actual character comparisons. Therefore, instead of '[12]' and converting the text to a bitmap for comparison, it may well be better to use '(1|2)' which makes up to 2 character comparisons with the text character.

    The trade-off is between the time taken to convert the text character to a bit-map at "runtime" plus the bit-map comparison vs the number of direct character comparisons required. The time when this sort of thing becomes important is when the character set is buried deep in the pattern that has lots of repetitions and/or backtracking - as I say, not in your situation.

    One of my personal preferences is to use the "short-cut' operators where possible. Therefore, instead of '[0-9]' I prefer to use '\d' as it means I can scan the pattern and immediately say 'any digit' whereas the character set I have to stop and check for the exact range and any possible gaps in the range.

    Susan

View as RSS news feed in XML