Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

First two characters in each string?

Last post 06-06-2012, 7:31 PM by Aussie Susan. 8 replies.
Sort Posts: Previous Next
  •  05-29-2012, 12:02 PM 85324

    First two characters in each string?

    delicate
    private
    solid

    company

     

    When I try \w{2} I get too many variations, and when I try \w{2}(?=\s) I get only the last two characters of the first three strings. How can I get the first two characters of each string? th

    anks.

     

     

  •  05-29-2012, 12:05 PM 85325 in reply to 85324

    Re: First two characters in each string?

    try

    ^\w{2}

    if you use .NET flavour of regex, then use

    (?m)^\w{2}

  •  05-29-2012, 9:05 PM 85327 in reply to 85325

    Re: First two characters in each string?

    Thanks, yes I'm using the .NET and that worked fine.

     

    While I'm here, how can I erase all the spaces in the strings below to make those strings without spaces?

     

     

    camera bags for women
    theory
    theory of a deadman

    theory of relativity

     

     

  •  05-29-2012, 11:08 PM 85328 in reply to 85327

    Re: First two characters in each string?

    Try a pattern that is a single space character (rather hard to show in this forum but it is ' ') or a pattern such as '\x20' and use the "replace" function with a null string as the replacement string.

    I must admit that both of your questions can probably be better performed using the .NET string functions SubString() and Replace() - the regex functions have quite a memory footprint and also quite a processing overhead (you need to parse the pattern string, build the internal state machine, process the text using the state machine).

    Susan

  •  06-05-2012, 1:31 AM 85377 in reply to 85328

    Re: First two characters in each string?

    Thanks, yes you're right. It is best done with the replace macro.

     

    While I'm here though, I came across another issue this morning. How could I erase a string that's more than 30 characters? The problem is I'd know if they were all digits or all numbers, it would be \d{30} or \w{30}. But my problem is the characters are of all kinds and types, alphanumeric and weirdo characters like _ or /. I'm still on .NET

  •  06-05-2012, 12:19 PM 85383 in reply to 85377

    Re: First two characters in each string?


    as always, a picture is better than 1000 words: can you send an example of test data that shoud be cleaned from those 30+ chars strings. Even better, if you could send us
    a sample that has *bad* and *good* strings mixed up. And show in bold which ones need to be deleted.
  •  06-05-2012, 12:54 PM 85384 in reply to 85383

    Re: First two characters in each string?

     

    sure, thanks.

     

    It's gotten more complicated. I need to erase all the strings that start with <IMG alt=, but I also need to do the following; the string that commences with ''Bars'' ends at ''Barstools'', which is btw one string even though it seems two in this format. I need to keep the ''Bars'' and the '' Barstools''. The one that begins with ''Wreaths'', I need to keep the ''Mirrors'', same for sconces and pedants. Because I need ''Mirrors'' and ''Barstools'' in new strings on their own, I'm thinking I need to isolate the part of the string that starts at </A>, and then, and then isolate the ''Barstools'' and ''Mirrors'', ''Pedants'' with another regex.

     


    Buffets
    Bars</A>, <A href="/Barstools-Home-Bar-Furniture-D%C3%A9cor/b/ref=amb_link_84791371_24?ie=UTF8&amp;node=3733851&amp;pf_rd_m=ATVPDKIKX0DER&amp;pf_rd_s=gp-center-5&amp;pf_rd_r=0BCEBJVB7CD59RFDJKBS&amp;pf_rd_t=101&amp;pf_rd_p=1289954142&amp;pf_rd_i=1057794">Barstools
    All dining room furniture

    Home Improvement : Tools &amp; Home Improvement Value Center


    <IMG alt="Dewalt DPG541C" src="http://ecx.images-amazon.com/images/I/31Qr4wkk4HL._SL75_.jpg" 

    Wreaths</A>, <A href="/b/ref=amb_link_85759371_32?ie=UTF8&amp;node=3736371&amp;pf_rd_m=ATVPDKIKX0DER&amp;pf_rd_s=gp-center-5&amp;pf_rd_r=111YJ7178MDNASBF4S8B&amp;pf_rd_t=101&amp;pf_rd_p=1343281822&amp;pf_rd_i=1057794">Mirrors

    Sconces</A> &amp; <A href="/b/ref=amb_link_85759371_57?ie=UTF8&amp;node=3736681&amp;pf_rd_m=ATVPDKIKX0DER&amp;pf_rd_s=gp-center-5&amp;pf_rd_r=111YJ7178MDNASBF4S8B&amp;pf_rd_t=101&amp;pf_rd_p=1343281822&amp;pf_rd_i=1057794">Pendants

     

    --------------------

     

    Then there are further variations like this one

    5 x 8</A>, <A href="/s/ref=amb_link_85759371_65?ie=UTF8&amp;page=1&amp;rh=n%3A1063298%2Cp_n_size_browse-bin%3A369538011&amp;pf_rd_m=ATVPDKIKX0DER&amp;pf_rd_s=gp-center-5&amp;pf_rd_r=111YJ7178MDNASBF4S8B&amp;pf_rd_t=101&amp;pf_rd_p=1343281822&amp;pf_rd_i=1057794">6 x 9

     

    Where even though I'm not that keen on the 5 x 8 and 6 x 9 I think the only way to coordinate it with the Bars and Barstools kind of string that I do want, I will also have to apply a similar regex, erasing whats inside the </A> and >. So basically, inside the </A> and >, there are all sorts of variations going on.

  •  06-05-2012, 2:51 PM 85385 in reply to 85384

    Re: First two characters in each string?

    The parse builder I'm using allows me to extract a string from a file if it contains a certain regular expression, but it doesn't allow me to only extract what's within my regular expression. This means (?<=\<\/A\>).*(?=\"\>) erases what I want to keep. However, .*(?=\<\/A\>) and (?<=\"\>).* probably will do what I want, so I don't have an issue right now. But just so I can learn something, I would still like to know what the solution to what I thought the problem was initially, would be, that is, how would I erase each string that is above 20 characters both alphanumeric and of other nature, and keep those strings below, thanks.

     

    An example would be

     

    Wreaths</A>, <A href="/b/ref=amb_link_85759371_32?ie=UTF8&amp;node=3736371&amp;pf_rd_m=ATVPDKIKX0DER&amp;pf_rd_s=gp-center-5&amp;pf_rd_r=111YJ7178MDNASBF4S8B&amp;pf_rd_t=101&a

     

    Home Improvement : Tools &amp; Home Improvement Value Center

    <IMG alt="Dewalt DPG541C" src="http://ecx.images-amazon.com/images/I/31Qr4wkk4HL._SL75_.jpg" 

    Buffets

  •  06-06-2012, 7:31 PM 85394 in reply to 85385

    Re: First two characters in each string?

    Please see my response to your other question (which I saw first) at http://regexadvice.com/forums/thread/85386.aspx

    Basically you can spend an awful lot of time trying to create a regex pattern to do what you want and, if you succeed, than I can almost guarantee you it will be long and complex, hard to maintain and fragile (in that it will be easy to find a slight variation in the text that will not be matched by the pattern, or match when there should not be one).

    On the other hand, using an HTML (or XML) DOM library is the way to go. For a start the parser (especially in the HTML case) can handle many of the idiosyncrasies (and down-right errors) that occur in "valid" HTML code and still build a valid parse tree - on the other hand a regex pattern will almost certainly break on many of these same text files.

    The Xpath query capabilities (especially of the .NET library) are very powerful and you can either create a small application to step your way through the various layers or perhaps create a single query that will return exactly what you want in a single step - it all depends on what you are actually trying to do (you have told us the end step of your problem but now what you are trying to achieve overall).

    Finally the DOM lets you edit the parse tree and then write out a new document - this is something that it would seem you are trying to do.

    Overall, I would suggest that you are using the wrong tool for your job.

    Also, for this type of problem, if you really must follow the regex path, then I would recommend that you learn more about how to create regex patterns and throw away any pattern builders you are trying to use. In general (and I'm making a sweeping generalisation here) they can be used for very simple situations that are well structured and follow the pattern that the parser is built to handle. My experience is that HTML and XML patterns are simply too complex for these tools to be useful.

    Instead, you really need to carefully define what you are trying to to and then manually craft a solution that fits.

    Susan

View as RSS news feed in XML