Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

Need a little help scoping my regex

Last post 04-17-2012, 11:42 PM by Aussie Susan. 1 replies.
Sort Posts: Previous Next
  •  04-17-2012, 10:55 PM 84986

    Need a little help scoping my regex

    I have the following text that includes section and subsection numbers separated by a colon, followed by several words of text (I used lorem for this example):

    1:1 Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed commodo, nisi non faucibus consequat, purus neque ultrices dolor, eget ornare ligula diam a velit. 1:2 Cras auctor massa eget diam ultrices rutrum. Sed convallis purus non nibh sollicitudin non pulvinar leo lacinia. Nunc luctus risus id elit accumsan quis auctor augue posuere. 1:3 Nullam urna lectus, molestie id vehicula sit amet, convallis luctus justo. Phasellus vestibulum mi non ligula pulvinar pulvinar. Nullam egestas imperdiet diam id adipiscing. 1:4 Etiam ornare tincidunt dictum. Maecenas aliquam venenatis massa, et viverra arcu tempus eu. Mauris nisl arcu, interdum vel aliquet in, malesuada eget turpis. Sed vitae sapien tortor, nec varius dolor. Integer egestas condimentum tincidunt. Mauris eget libero non ligula viverra dapibus. Mauris nunc nisi, facilisis nec volutpat at, cursus non quam. Donec ac erat quis enim vehicula auctor id quis nibh. Proin nisl lacus, viverra sed vehicula quis, convallis vel arcu. Nunc ut urna in orci pretium consequat semper in felis. 1:5 Phasellus elementum, velit eget facilisis ultricies, nibh magna lacinia mi, vitae pellentesque mauris lorem a enim. Nunc quis iaculis turpis. Ut egestas ante eu urna sagittis blandit. Aliquam venenatis diam sit amet purus egestas sollicitudin. 1:6 Phasellus nec nunc a leo commodo posuere. Nam sit amet dui a mauris tristique feugiat. Duis dui turpis, ultricies et venenatis at, imperdiet eu ipsum. Nullam et nunc massa. Aenean tellus quam, fringilla non imperdiet ac, pulvinar non augue.

     

    I plan to separate the section numbers, subsection numbers (losing the colon) and word text into separate columns in a database table.

    I know I can capture the section numbers like this:

    [0-9]+:

     and the subsection numbers like this:

    :[0-9]+

     What I'm needing help with is a regex that will capture all the words between each section/subsection number.

    I tried this:

     ([a-zA-Z]+.)

    but it only captured one word at a time, and I need to capture all the words between any two section numbers.

    Can anyone help?

    I am probably going to have to pass over my document three times with a separate regex each time and store the matches in an array.


    Randy H. Johnson
  •  04-17-2012, 11:42 PM 84988 in reply to 84986

    Re: Need a little help scoping my regex

    Try:

    \d+(:\d+)?((?!\d+(:\d+)?).)+

    on the assumption that your regex has the capability of handling the lookahead operator. You may also need the "singleline" option to let the '.' match any line terminator character that may be in the text block (otherwise each match will stop when it gets to the line break).

    I've combined your two patterns to match the section numbers into 1 with the sub-section part being optional - this may not be correct but is should be easy to change.

    The way this works is to start with the section number match just as you have defined it. The pattern then will test (using the lookahead) to see if the next characters also form a section header; if they don't then the '.' will be used to match whatever character doers come next and then the test will repeat. If the next characters are a section header, then the lookahead will stop the matching process and you are done.

    I have taken a couple of liberties as well. For a start, I've used '\d' instead of '[0-9]' - the effect should be the same. However I have used '.' instead of '[a-zA-Z]' which will match a lot more characters (i.e punctuation characters such as "." and "," which you have in your example text. This may or may not be what you are expecting but I suspect it might be (i.e. include ALL text following a section header). 

    Finally it is always a good idea to tell us the regex variant that you are using. I've assumed that you are using one that is relatively "modern" in that the '\d' and similar operators are recognised.

    Susan

View as RSS news feed in XML