Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

My regexp is getting broken down

Last post 04-19-2011, 8:03 AM by jiri2. 7 replies.
Sort Posts: Previous Next
  •  04-17-2011, 8:32 PM 80906

    My regexp is getting broken down

     In this example 
    (?<=\'\)\" href\=\").*?(?=\"\>\<EM\>)
     I believe '.*? is supposed to cover my input values. The rest is supposed to be regexp that goes before and after my values. Problem is when I insert non alpha numeric values it takes the regexp as finished and it breaks down. Things like ':' or '/' break my script down.
     
    How can I isolate the input value to be parsed without it affecting the rest of the regexp? thanks
     
    Also happens with this expression, if it helps explain my problem:
     (?<=\" href\=\").*?(?=\<\/A\>\<\/H3\>\<BUTTON class\=vspib type\=submit\>\<\/BUTTON\>\<\/SPAN\>) 
  •  04-17-2011, 11:21 PM 80930 in reply to 80906

    Re: My regexp is getting broken down

    I'm not really sure what you mean by phrases such as "break my script down"; in fact I'm having trouble understanding what your question is all about.

    I would suggest that you look at the posting guidelines in the sticky note at the beginning of this forum as to how to ask questions and what information you need to provide.

    (By the way, in what follows, I've taken out all of the extraneous and unnecessary back-slash characters - these are not necessary in the pattern; they may be for expressing the pattern in your programming language but you have not mentioned what that is so I can't tell -  and they simply make it hard to understand you pattern)

    Anyway, the way your first regex pattern is supposed to work is:

    (?<='\)" href=")  - this is a "lookbehind" and makes sure that the characters that precede the match are exactly    ')" href=" - I assume this means that what you want to match is a value for the 'href' attribute
    .*?      - this will match any character zero or more times, matching as few characters as possible
    (?="><EM>)   - this is a lookahead and makes sure that the characters   "><EM>  occur immediately after any matched characters

    From this I'm assuming that you are wanting to match something like:

    <sometag attr1="(abc')" href="myURLHere"><EM> more text

    and you are wanting the myURLHere text to be matched.

    The '.' operator should match any character (with the possible exception of the "newline" character and this is controlled by the "singleline" or "dot matches newline" modifier setting - again something that you have not told us). Therefore, without seeing the text you are trying to scan (both what is working and what is not) it is very hard to see what could be going wrong.

    Perhaps you could also tell us what you are trying to do?

    Susan

     

    PS, the second pattern can be written as:

     (?<=" href=").*?(?=</A></H3><BUTTON class=vspib type=submit></BUTTON></SPAN>) 

    without all of the unnecessary backslashes.

  •  04-18-2011, 3:38 AM 80934 in reply to 80930

    Re: My regexp is getting broken down

    thanks for your reply.

     

    well I'm trying to data mine google using something called Zennoposter that uses regexp by taking the text on the web page, I have a look at it and try to determine in between what html text on the webpage would in the real interface equate to the search box, to insert the text I'm looking for. Let's say I use:

     

    (?<=\'\)\" href\=\").*?(?=\"\>\<EM\>)

     

    My problem is I've inserted the before and after the .*? and please correct me if I'm wrong. But this .*? only seems to be covering letters of the alphabet. Whenever I add a number or any other letter of the keyboard, it doesn't work. So I guess my question would be what regexp covers things like :, / or numbers.

     

    Now for example, I try to scrape 

     

    ''inurl: mysite/pages''

     

    I wouldn't know what regexp to put in between my lookbehind and lookforward regexp to cover for the characters I'm actually looking for.

     

    Something like this:

     

    [a-z, A-Z ,0-9 ]*/.*

     

    Only this isn't working because I still haven't understood or learnt regexp I believe.

  •  04-18-2011, 3:45 AM 80935 in reply to 80934

    Re: My regexp is getting broken down

    Sorry that would be 'what in real life would equate to the websites that google gets in its results, as I'm looking for the links in the results, not the values in between the search box.

     

    This is what intrigues me though. Why my template is failing me when I input an attribute value into the search box shouldn't be something that affects the results I'm scraping and yet it is.

     

    So I believe I've put the question wrong as the issue is indeed in the search box so is a lot more inexplicable to me than I thought.

     

     

  •  04-18-2011, 4:10 AM 80936 in reply to 80935

    Re: My regexp is getting broken down

    Yes just to summarise, it was and is a html tag issue thanks anyway. I've changed the tag from input:text to input:id or input:name and it's covering the numbers but I haven't found anything to cover things like : or / yet. thanks anyway (if you know a bit about html tags maybe you know? :-) http://www.w3schools.com/tags/tag_input.asp
  •  04-18-2011, 6:57 PM 80951 in reply to 80936

    Re: My regexp is getting broken down

    In a "normal" regex, the '.' will match all characters (with the possible exception of the newline character as described earlier). Therefore this includes any of the special characters that you mention.

    However, one of the biggest "traps" in regex patterns is that  not all regex variants are the same - some even define their own pattern syntax. If the '.' operator is not matching the colon or slash characters for you, then it is possible that the regex variant is making its own interpretation of the '.' operator or there are some other settings somewhere that are altering the way the application processes your pattern. I tried a quick "Google" but I can't see the regex syntax for for program listed anywhere, not can I see a reference to a "standard" regex library or engine so I have no idea what is actually happening here. (On the other hand I did find your other post on the CodeCall forum).

    As the underlying text is actually HTML, we recommend that the way to handle this is to use the HTML DOM libraries to parse the text and then use an XPath query to pick out what you need. Regexes are notoriously bad at handling HTML (and XML) but the HTML (or XML) DOM is specifically designed for this type of work and handles it with ease. I would recommend Michael Ash's blog entry at http://regexadvice.com/blogs/mash/archive/2011/01/23/Regex-and-HTML.aspx that will explain this in detail.

    Susan

  •  04-19-2011, 7:54 AM 80966 in reply to 80951

    Re: My regexp is getting broken down

    thanks. yes thats me at codecall and other forums...

     

    i think im going nuts wondering if an issue is an html attribute value problem or a regex issue lol...

     

     what kind of regex's are there? so i can ask at the zennoposter forum to see which one they're using? cheers

     

    Just to add thanks for that link to the entry but I believe I'm using an HTML DOM of the webpage i'm trying to parse, and not just plain HTML. The problem lies for the moment in my incompetence instead :-)

     

    Indeed my webpages have DOM HTML, source HTML and text, but as I'm looking to parse links I'm pretty screwed and pretty much need to get into the DOM.

     

  •  04-19-2011, 8:03 AM 80968 in reply to 80951

    Re: My regexp is getting broken down

    cant delete but want to
View as RSS news feed in XML