Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

Regex help

Last post 03-16-2010, 9:07 PM by Peadarin. 5 replies.
Sort Posts: Previous Next
  •  03-14-2010, 12:33 PM 60897

    Regex help

    Hi,

       I am writing a regex in Java using eclipse and need bit of help. This is part of web crawler code for a domain. From all the web pages of the a domain I need to fetch webpages with address like - http://www.abc.com/~<Only Characters A-Z a-z after tilt> 

    Here is what I have written but it is not filtering websites with "~" only. It is taking everything after ".com/"

    http://www.abc.com/+~*[a-zA-Z]+

    Thanks in advance,

    Stan 

     

     

  •  03-14-2010, 8:10 PM 60920 in reply to 60897

    Re: Regex help

    Turn the last '+' quantifier into '*'. The '+' means "match 1 or more, matching a many as possible" which means that it must match at least once alphabetic character. The '*' means "match 0 or more, matching as many as possible" which allows there to be no alphabetics after the tilde.

    Also check if you really need the '+'  between the '/' and the '~' in the pattern - do you really want to allow for multiple slashes as in "http:///www.abc.com////////////////~abs"?

    Susan

  •  03-16-2010, 12:26 AM 61068 in reply to 60920

    Re: Regex help

    Thanks for the reply Susan. But how do I enforce that I get webpages only which have tilt "~" for "\". I want all webpages from of the format www.abc.com\~xyz

    My crawler is returning the webpages, but regex pattern matching is not able to filter results as expected above. Here is my expression:

    String  regexp  = "http://www.utdallas.edu/\\~+[a-zA-Z]*"; 

  •  03-16-2010, 6:18 PM 61239 in reply to 61068

    Re: Regex help

    I must be having a "blonde day" today but I'm not really sure what you are trying to achieve.

    I understood that you were not able to match URLs of the form "http://www.abc.com/~" but from your last post it would seem that you don't want those - I'm confused.

    Can you please provide a list of sample URLs that include those you want to match and those you don't, telling us which is which.

    Also, in your OP you explicitly use the "www.abc.com" as part of your pattern, but in your last post you also use "www.utdallas.edu". Is this part fixed or variable or what?

    Susan

  •  03-16-2010, 7:56 PM 61247 in reply to 61239

    Re: Regex help

    Sorry for the confusion. I want all URLs of the type http://www.abc.com/~xxx

    Only variable here is after "~" i.e. xxx. xxx needs to be only alphatbets. Here are examples

    1) http://www.abc.com/~employee -->Valid

    2) http://www.abc.com/~role -->Valid

    3) http://www.abc.com/~05dept -->Invalid

    4) http://www.abc.com/~English001 -->Invalid

    Hope this clarifies.

    Appreciate your help.

     

    Thanks,

    Stan 

  •  03-16-2010, 9:07 PM 61249 in reply to 61247

    Re: Regex help

    nice uri/url with \ instead of /.  Yet it does exist ..  but they are non w3c standard .  Do not forget to read the posting guidelines about providing the whole examples you want to use, and yes some web servers on some \\ friendly operating system use UNC names after the protocol. Using absolute file names is an open door to hackers.

    the following classs select both bytes

    [/\\]

    you can also use (/|\\)

    what you typed was /\\ which is always 2 bytes : a first byte / followed by a second byte \

    I am not sure which one of the 2 following expressions will provide the filter you want.

    http://www.abc.com/\\*~[a-zA-Z]+

    http://www.abc.com[/\\]+~[a-zA-Z]+

     

View as RSS news feed in XML