Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

Problems with line breaks

Last post 11-30-2012, 1:45 AM by jiri2. 4 replies.
Sort Posts: Previous Next
  •  11-20-2012, 8:16 PM 87110

    Problems with line breaks

    Hello. I'm having quite a few problems expressing line breaks in regex. I only know how to capture my text with a bunch of new lines that break the csv or tsv I try to create with it. When trying to capture: 

    -------------------------------------------------------------- 

     

     

    <b><span class="h3color tiny">This review is from: </span><a href="http://www.amazon.com/Schindlers-List-VHS-Liam-Neeson/dp/6303168507/ref=cm_cr_pr_orig_subj/183-1056400-8295663">Schindler's List [VHS] (VHS Tape)</a></b>
    </div>

    Schindler's List is my favorite historical drama of all-time for a number of reasons. Not only is it a masterpiece from a cinematic point of view, but it is priceless for the story it tells to the world. <p>First of all, the acting is superb. Liam Neeson does well as Oskar Schindler, but in particular I liked Ben Kingsley (as Istak Stern, Schindler's accountant) and Ralph Fiennes (as Amon Goeth, the camp commandant). All of the performances were very convincing and reflect the good casting.</p><p></p><p>Another great feature of this film is the soundtrack. Slow, soaring music tells of the painful circumstances of the Jews and of their conflict with the Nazi regime. Mixed in with the instrumental pieces are Jewish melodies which also gave me a sense of the cultural traditions of the Jewish people. </p><p>From a technical point of view, the decision by director Spielberg to shoot the movie is black-and-white was a good one. In fact, I think it makes the movie better than it would have been in color. The few color segments throughout the movie are aptly placed and help to focus the viewer's attention on particular details through the eyes of Schindler. The scenery and photography were excellent compared to other movies I have seen and contribute to the whole atmosphere of the 1940s. Some people may be put off a bit by the length (over 3 hours) but believe me, every minute is worthwhile. Unlike other long movies, there are no lulls or useless scenes -- everything counts. </p><p>The best part of the movie without any doubt is the story itself, the tale of Oskar Schindler and how he was able to save 1100 Jews from the Auschwitz gas chambers by employing them in his enamelware factory and eventually his shelling factory. Schindler's ambition and personal success shines through amidst the Jewish tragedy and shows how one man, if he has the willpower, can accomplish what appears to be impossible. Based on the novel by Thomas Keneally (which I have not yet had the opportunity to read), this movie digs deep into the human soul and shows how different people are able to survive. </p><p>There are many touching moments in this film; in particular, near the end when the war has been declared over and the *** must flee from the Soviet army. This part and the modern-day segment that follows are both truly heart-warming tributes. I finished watching this movie for about the fourth time yesterday, and even though I didn't cry, tears welled up in my eyes (and this rarely happens when I watch movies). </p><p>This movie is a must-see not only for its excellence in the film genre but for the story it presents to the viewer. Although it is not suitable for young children (due to its violence and mature content), any mature individual should see it so they can understand that a spark of good can still exist in a fire of evil. This movie deserved all of the Academy Awards that it received and will likely remain in top ten lists for at least the next fifty years. Highly recommended.
    </p><div style='padding-top: 10px; clear: both; width: 100%;'>

     

     

    ------------------------------------------

     

    using 

     

    (?<=\ \ \ \ \ \ \ <b><span\ class="h3color\ tiny">This\ review\ is\ from:\ </span><a\ href="http://www\.amazon\.com/Schindlers-List-VHS-Liam-Neeson/dp/6303168507/ref=cm_cr_pr_orig_subj/183-1056400-8295663">Schindler's\ List\ \[VHS]\ \(VHS\ Tape\)</a></b>)[\w\W]*?(?=\ \ \ \ \ \ </p><div\ style='padding-top:\ 10px;\ clear:\ both;\ width:\ 100%;'>)

     

    ------------------------

     

    I get three annoying lines above my article. Two of them are empty and one has the </div> mention. Below the article I also get another line break that further breaks up my expression. These line breaks go on to be interpreted by my script as line breaks in the csv or the text file I try to create.

     

    I'm using .NET regex, and a regex builder to come up with the expressions. I would like the expression to target articles that are in this format, not this specific article only. I always lose myself in regex line breaks and expressions in general due to the amount of syntax involved, so would greatly appreciate any help, thanks.

     

    The document is DOM HTML and it has spaces in front of some of its row, though not in front of the article's row. Below is my understanding of what the three rows coming before the article should look like but I clearly must be doing something wrong:

     

    (?<=\ \ \ \ \ \ \ <b><span\ class="h3color\ tiny\n\<\/\w+\ >\n">This\ review\is\ from:\ </span><a\ href="http://www\.amazon\.com/Schindlers-List-VHS-Liam-Neeson/dp/6303168507/ref=cm_cr_pr_orig_subj/183-1056400-8295663">Schindler's\ List\ \[VHS]\ \(VHS\ Tape\)</a></b>)[\w\W]*?(?=\ \ \ \ \ \ </p><div\ style='padding-top:\ 10px;\ clear:\ both;\ width:\ 100%;'>)

     

    ---------------------

     

    Is  \n\<\/\w+\ >\n wrong? 

  •  11-20-2012, 10:16 PM 87111 in reply to 87110

    Re: Problems with line breaks

    In the first instance, I suggest that you drop the regex builder as it will end up causing you more pain than anything else (in my opinion).

    Secondly, regex patterns are notoriously hard to write to handle HTML - I generally recommend that you use the HTML DOM library and then write suitable xquery's to get what you want - these are the tools are are intended to handle (even badly formatted) HTML code. All it takes is an extra space or unexpected character and the regex pattern will fail whereas the DOM will generally handle these errors (and a whole lot of others) very easily.

    Now, as to why you are getting the extra blank lines, it is because they are in the original text. Your pattern starts the capture for the displayed text just before the line break in the line that ends "</a></b>" - that is your first line. Then you have the line with the "</div>" on it , plus its line break - your 2nd line. Then there is the blank line between the "</div>" and the start of the text - line #3!

    Finally (in terms of what is going wrong wit your pattern), you say that you want to match other articles that are in the correct format, but your pattern explicitly looks for text that includes (for example) "Schindlers-List-VHS" and other literal text. (This is a by product of using the regex builder when it has no idea of what is literal text and what is the variable text you are trying to capture). Also, the regex pattern that is created looks for sequences of 2 space characters before the pattern and 6 afterwards - if you don't see exactly these number of spaces or they are actually other whitespace (as would be indicated by the blanks lines at the start of your text and possibly stripped from the end of the example text you have in your posting), then the complete match will fail. This makes for an extremely fragile pattern.

    Now, what to do about it. The first step is to be clear as to exactly what is the overall structure of the entries you are looking for. Without seeing the entire text, I have to guess that you could use text such as "This review is from" and skip forward to the next "</div>" text - but only you know the overall structure and can tell us how to locate the start of the entry of interest.

    We could use something like:

    this\sreview\sis\sfrom.*?</div>\s*

    with the 'singleline" and "ignore case" options set to find the start of the entry. You will note that I'm trying to use the least number of literal characters as possible that will reliably locate the start of each entry (i.e. they will be repeated in every entry - I'm making a guess here that "This review is from" will meet that criteria. This also uses '.*?' to skip forward only as many characters as are needed to get to the "</div>" text which I'm guessing  helps to mark the start of the bit you want. Also I use the '\s*' part to skip over all whitespace, including carriage-return and new-line characters - this helps to remove any leading  blank lines.

    Next we want to capture the text of the review. In itself, that is easy, but the trick is to know when the review is complete. I this case I'm going to use the "</p><div" part - again this may be right or may need expanding. Therefore we can add the following:

    (.*?)
    </p><div

    to our pattern.

    This does capture all of the entry, but it also captures the HTML codes at the start and end. You could use lookbehind and lookahead constructs as your pattern does, but an alternative is to simply use a capture block around the middle line (as I've done above). Then, when you get your pattern match, you just look at the text in  match group #1 for the main part of the review. (Note that the sample text contains "<p>" and "</p>" tags - these WILL be included in the captured text - again if you use the HTML DOM then you will be able to use the "innerText" that will not have those)

    Therefore, my recommended pattern would be (all on 1 line unless you want to use the "ignore whitespace" option):

    this\sreview\sis\sfrom.*?</div>\s*(.*?)</p><div

    Depending on how you are using this pattern, you may want to only use the default text (technically the text captured in match group #0) in which case the lookarounds may be need - therefore the pattern would be:

    (?<=this\sreview\sis\sfrom.*?</div>\s+).*?(?=</p><div)

    however, because of the way lookbehinds actually work,the '\s*' at the end of the lookbehind is effectively ignored and so the returned text will have some leading whitespace - yuop can remove that with the "trim" standard string  function.

    (If you want to know why this is,it is because the lookbehind starts by effectively working backwards from the starting point, trying to match the lookbehind text. Imagine that we are at the ">" of the "</div>" text. when we start looking backwards from there, the pattern will see the '\s*' and the text will be the "v" of "div" - as this is an optional match the pattern element will be skipped and it wil move on to the '>' literal character, Of course the '>' in the pattern and the "v" in the text don't match and, as this is a required match, the overall lookbehind match will fail.

    Therefore, the text pointer will move on to just after the ">" in </div>" (i.e just before the first whitespace) and the process will begin again. Again the '\s*' won't match but that's OK as it is optional so the next part of the pattern - working backwards - will be tried. In this case all of the rest of the pattern WILL match and so the lookbehind is successful. However the next characters are the whitespace and these will therefore be included in the returned text.

    Unfortunately there is not much that can be done to force the inclusion of the whitespace as the last item in the lookbehind.)

    Hope this helps. Get back with some of the details about how to locate the start and end of the text you want and we might find a better pattern for you.

    Susan

  •  11-21-2012, 11:46 PM 87114 in reply to 87111

    Re: Problems with line breaks

     

    Hi Susan, Thanks once again for the great input. Below is an image of what I want to scrape. It's missing some details but gives an idea of the space bars I´m having difficulty matching. The article starts in the line that says ''Meet Oskar Schindler.'' and it's a different article to yesterday's example. The highlight gives an idea of the space bars that seem to be in the text. The screenshot also show how my regex builder seems to think it's dealing with DOM Html, as below there are three tabs with options of DOM, Source HTML and plain text. Ther regex builder and page sraper, comes integrated with the Zennoposter Projectmaker which can be downloaded for free as a demo from their website, in case you're interested.

    Does my screenshot help in clarifying the possibilities? The missing part in the lookbehind is: 

     

    <b><span class="h3color tiny">This review is from: </span><a href="http://www.amazon.com/Schindlers-List-Widescreen-Edition-VHS/dp/0783211856/ref=cm_cr_pr_orig_subj/184-8970944-2269543">Schindler's List (Widescreen Edition) [VHS] (VHS Tape)</a></b>

     

    and it's all in one line.

     

     

  •  11-22-2012, 5:44 PM 87116 in reply to 87114

    Re: Problems with line breaks

    I'm not sure what you mean with "The missing part of the lookbehind is:" - it is certainly not missing from my suggested pattern and adding anything like that to a pattern would limit you to only ever matching an entry with that EXACT text. The whole point of a "regular expression" is to abstract away the actual characters (where possible) and leave only the general "pattern" of characters that let you identify the parts you are interested in that will include varying text.

    For example, if you had another entry for Schindler's List but NOT the widescreen version, then your lookbehind would miss it, but my suggestion would pick it up. Also if you were looking for "Star Wars" or "The Great Escape" or whatever, then the only part that might (perhaps) be common across them all is "This review is from".

    The image you have provided might be useful but, in reality, you will probably be scanning the original text source - and that is what you need to look at. Using a DOM to parse the text may allow it to correct any mistakes it finds and also reformat the text to make the structure more visible through indenting, and therefore what you are seeing in this display may NOT be what the regex engine would see.

    Does your application allow you to run XQuerys against the DOM structure? If so then that would be the best way to go: find some characteristic(s) (perhaps of the "div" tags with the "id" attribute of a null string) that identify the text and then extract the "innerText" from that. In this case that would also include all of the other text before the reivew itself, but from that you could use the regex to extract what you are after. (I must admit that I can't see from your example any specific HTML tags that would let you identify just the review part (unless perhaps it is the Div tag with the "class="tiny" " attribute as a starting marker and the next "div" tag as the end marker.

    From what I can see, either of my suggested patterns would work against the text as shown above, perhaps with the need to strip the leading whitespace if you use the lookaround version.

    Susan

  •  11-30-2012, 1:45 AM 87140 in reply to 87116

    Re: Problems with line breaks

    Thanks for the suggestion. Indeed, from reading up on that thing called XQuery, it would seem like it's a considerable enhancement upon what the above calls DOM Html. My builder probably doesn't have the option inside it, but own codes can be created and executed. Of course that would mean having to write the code down myself each time which would make things more complicated at least for the time being as I try to make a living out of all this.

     

    In the end I ended up using my original expression, having to take the empty lines with the deal unfortunately, but it was OK as all I did was erase the empty lines (and the odd </p>) and the values would match up. Your solutions from the first answer I'll probably try them in the future.

View as RSS news feed in XML