In the first instance, I suggest that you drop the regex builder as it will end up causing you more pain than anything else (in my opinion).
Secondly, regex patterns are notoriously hard to write to handle HTML - I generally recommend that you use the HTML DOM library and then write suitable xquery's to get what you want - these are the tools are are intended to handle (even badly formatted) HTML code. All it takes is an extra space or unexpected character and the regex pattern will fail whereas the DOM will generally handle these errors (and a whole lot of others) very easily.
Now, as to why you are getting the extra blank lines, it is because they are in the original text. Your pattern starts the capture for the displayed text just before the line break in the line that ends "</a></b>" - that is your first line. Then you have the line with the "</div>" on it , plus its line break - your 2nd line. Then there is the blank line between the "</div>" and the start of the text - line #3!
Finally (in terms of what is going wrong wit your pattern), you say that you want to match other articles that are in the correct format, but your pattern explicitly looks for text that includes (for example) "Schindlers-List-VHS" and other literal text. (This is a by product of using the regex builder when it has no idea of what is literal text and what is the variable text you are trying to capture). Also, the regex pattern that is created looks for sequences of 2 space characters before the pattern and 6 afterwards - if you don't see exactly these number of spaces or they are actually other whitespace (as would be indicated by the blanks lines at the start of your text and possibly stripped from the end of the example text you have in your posting), then the complete match will fail. This makes for an extremely fragile pattern.
Now, what to do about it. The first step is to be clear as to exactly what is the overall structure of the entries you are looking for. Without seeing the entire text, I have to guess that you could use text such as "This review is from" and skip forward to the next "</div>" text - but only you know the overall structure and can tell us how to locate the start of the entry of interest.
We could use something like:
with the 'singleline" and "ignore case" options set to find the start of the entry. You will note that I'm trying to use the least number of literal characters as possible that will reliably locate the start of each entry (i.e. they will be repeated in every entry - I'm making a guess here that "This review is from" will meet that criteria. This also uses '.*?' to skip forward only as many characters as are needed to get to the "</div>" text which I'm guessing helps to mark the start of the bit you want. Also I use the '\s*' part to skip over all whitespace, including carriage-return and new-line characters - this helps to remove any leading blank lines.
Next we want to capture the text of the review. In itself, that is easy, but the trick is to know when the review is complete. I this case I'm going to use the "</p><div" part - again this may be right or may need expanding. Therefore we can add the following:
to our pattern.
This does capture all of the entry, but it also captures the HTML codes at the start and end. You could use lookbehind and lookahead constructs as your pattern does, but an alternative is to simply use a capture block around the middle line (as I've done above). Then, when you get your pattern match, you just look at the text in match group #1 for the main part of the review. (Note that the sample text contains "<p>" and "</p>" tags - these WILL be included in the captured text - again if you use the HTML DOM then you will be able to use the "innerText" that will not have those)
Therefore, my recommended pattern would be (all on 1 line unless you want to use the "ignore whitespace" option):
Depending on how you are using this pattern, you may want to only use the default text (technically the text captured in match group #0) in which case the lookarounds may be need - therefore the pattern would be:
however, because of the way lookbehinds actually work,the '\s*' at the end of the lookbehind is effectively ignored and so the returned text will have some leading whitespace - yuop can remove that with the "trim" standard string function.
(If you want to know why this is,it is because the lookbehind starts by effectively working backwards from the starting point, trying to match the lookbehind text. Imagine that we are at the ">" of the "</div>" text. when we start looking backwards from there, the pattern will see the '\s*' and the text will be the "v" of "div" - as this is an optional match the pattern element will be skipped and it wil move on to the '>' literal character, Of course the '>' in the pattern and the "v" in the text don't match and, as this is a required match, the overall lookbehind match will fail.
Therefore, the text pointer will move on to just after the ">" in </div>" (i.e just before the first whitespace) and the process will begin again. Again the '\s*' won't match but that's OK as it is optional so the next part of the pattern - working backwards - will be tried. In this case all of the rest of the pattern WILL match and so the lookbehind is successful. However the next characters are the whitespace and these will therefore be included in the returned text.
Unfortunately there is not much that can be done to force the inclusion of the whitespace as the last item in the lookbehind.)
Hope this helps. Get back with some of the details about how to locate the start and end of the text you want and we might find a better pattern for you.