Let me take a step back: lets get the basic structure sorted first and then worry about capturing the correct bits of it.
To do this, I want to stop possible confusion about what the <p1>, <p2> etc in your pattern is actually telling us. Therefore lets start with a slightly modified "sentence" matcher:
(?<s1>(?>([{(]+\w+(\x20\w+)*[})]+|\w+)=))?
This is basically what we had before except that it is really composed of 2 alternatives: the first (which is actually written last) is the simply '\w+' with no surrounding punctuation, and the other alternative is where we have the surrounding brackets that must be at the start and end and there must be at least one word in between and possibly more separated by a space.
You will note that I've switched to using '\x20' instead of '\s' as, in some places, this was matching the end of line incorrectly - the '\x20' will only match a space instead of any whitespace character.
Now the named group will match as a whole or will not - and we can test the name to make sure the sentence is given (or not). Previously, the name was telling us whether or not the sentence had punctuation - if there was none, then we could not use the name to differentiate between when the sentence was not there or just the punctuation was missing. This will become important when we get to handling the required "s3" case.
Therefore, the simplistic application of this sub-pattern to the whole language becomes:
^(?<s1>(?>([{(]+\w+(\x20\w+)*[})]+|\w+)=))?
(?<s2>(?>[{(]+\w+(\x20\w+)*[})]+|\w+))?
(?<s3>[{(]+\w+(\x20\w+)*[})]+|\w+)
(?<s4>[{(]+\w+(\x20\w+)*[})]+|\w+)?
\r?$
In other words, sentence 1 must have a trailing "=" character, sentence 2 os optional, sentence 3 is required and sentence 4 in optional.
My the way, in answer to your last question, the '\r?$' simply forces a match with the end of the string. The reason the '\r?' is there, is that some platforms represent the end of the string as '\n' and others as '\r\n'. Also, some systems have nothing at the end of the text - the '$' anchor is happy with this situation. This little pattern will match all of these situations.
This actually matches all but the first 2 of your examples correctly, allocating the right text to each of the "sentences".
The reason it does not match the first 2 cases is that there is no allowance for the spaces around sentence 3. If we go back to the basic sentence structure, we only have to worry about spaces when there is no punctuation. Therefore lets split the sentence up as follows:
(?<s3>
[{(]+\w+(\x20\w+)*[})]+
|
\w+
)
and we start work on the conditions where spaces need to be handled. The first is, if Sentence 2 is present (don't care about sentence1 as it has trailing punctuation anyway) then we need a space:
(?<s3>
[{(]+\w+(\x20\w+)*[})]+
|
(?(s2)\x20)
\w+
)
At this point we actually match ALL of your examples, BUT, example 2 has the "F" matching sentence 2 and the "x" matching sentence 3. Therefore we need to do something to stop this situation from occurring.
If sentence 2 is there, then we have just matched sentence 3 AND either there is a sentence four (which will need another space) or we should be at the end of the line.
Therefore we ad in another part after the word match:
(?<s3>
[{(]+\w+(\x20\w+)*[})]+
|
(?(s2)\x20)
\w+
(?(s2)(?!\r?$)|(\x20|\r?$))
)
What this part says is, if sentence 2 has been given and we have just matched sentence then we can't be at the end of the line: otherwise we should have matched this as sentences 3 and 4, not 2 and 3. On the other hand, if there is NO sentence 2, then we must either have a space before any sentence 4 or we are at the end of the line.
Putting this altogether, we get:
^(?<s1>(?>([{(]+\w+(\x20\w+)*[})]+|\w+)=))?
(?<s2>(?>[{(]+\w+(\x20\w+)*[})]+|\w+))?
(?<s3>
[{(]+\w+(\x20\w+)*[})]+
|
(?(s2)\x20)
\w+
(?(s2)(?!\r?$)|(\x20|\r?$))
)
(?<s4>[{(]+\w+(\x20\w+)*[})]+|\w+)?
\r?$
Now, let's dig further in to sentence 3 and capture the main word that you are after. To do this we can create a simply named group around our loan '\w+' as in '(?<Main>\w+)'.
However the way we have written the first alternative (with the punctuation) is not quite right in this case. Elsewhere, we don't care about any specific word, but in THIS case you want to capture the 2nd word only. Therefore we need to change this part to '[{(]+\w+(?<Main>\x20\w+)(\x20\w+)*[})]+'.
You will see that I have used the same name - I tested this in a regex test platform that uses the .NET variant but the same is true for PCRE in my experience. When you have named groups, the regex engine is happy to find the same name used in multiple places and will simply add the captured text into the named "slot' Therefore the named capture group "Main" will receive either the first and only word, OR the second (of 2 or more) word.
Therefore the final pattern I've used is:
^(?<s1>(?>([{(]+\w+(\x20\w+)*[})]+|\w+)=))?
(?<s2>(?>[{(]+\w+(\x20\w+)*[})]+|\w+))?
(?<s3>
[{(]+\w+(?<Main>\x20\w+)(\x20\w+)*[})]+
|
(?(s2)\x20)
(?<Main>\w+)
(?(s2)(?!\r?$)|(\x20|\r?$))
)
(?<s4>[{(]+\w+(\x20\w+)*[})]+|\w+)?
\r?$
and the captures are:
s1 = whatever matched sentence 1, ditto s2, s3 and s4; "Main" will contain the target sentence 3 word.
One thing you may want to do is to trim any extra whitespace that may be included in the captured text of sentence 3.
Susan