Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

How can I select the first 300 words of a value with regexp?

Last post 12-03-2012, 4:48 PM by Aussie Susan. 3 replies.
Sort Posts: Previous Next
  •  12-01-2012, 3:40 AM 87142

    How can I select the first 300 words of a value with regexp?

    Can regexp do this? It's .NET regexp. I had an expression last night from stackexchange but seem to have lost it, and anyways it wasn't working. Imagine you have a value inside a table and you're extracting it and applying regexp to the extraction.  And you want to select the first three hundred characters, in which there can be all sorts of things like line breaks, spaces, new paragraphs, non alphanumeric, and numbers, you name it, it's there, which is why extracting by line break is a bad idea, and only by character amount sounds the best way.

     

    And what about an added twist. Can regexp, all on its own, not split the value at the the 300th word, but at the period sign that's closest to the 300th character? This second question easily has a much easier, though clumsier, solution; doing a second regexp to extract what's left of the sentence in the result, that's sitting say at the 310th character. But I'm wondering if anyone can envision a regexp where it all happens in one go?

     

    Thanks in advance for any help.

  •  12-02-2012, 7:15 PM 87144 in reply to 87142

    Re: How can I select the first 300 words of a value with regexp?

    I think we need to be clear on the terminology being used here. For example, in your first question, you say you want to match the first 300 characters but then mention other "characters" that you imply should not be counted. Similarly, when you talk about 300 "words" in the title and 2nd question, you don't define what a "word" is in your context (there are many posts in this forum where this has been discussed and probably as many definitions determined as there were situations put forward).

    I'm going to assume that you want to count the first alphanumeric characters plus the underscore (this happens to match the definition of the '\w' character set shortcut) and match all characters from the start of the text until either the end of the text or 300 such characters have been matched.

    To start with, we can use

    ^.{0,300}

    (with the "singleline" option set) to match the first 300 or any character. Next we use the fact that the '\W' character set shortcut is the complement of the '\w' one that we are targeting, and that if we try to match any character with either the set or its complement, then we will always have a match. Therefore we can express the same pattern as

    ^(\w|\W){0,300}

    Now, we want to not count any character that is matched by the '\W' part. Therefore we need to skip over any such character and so we change this pattern to

    ^(\W*\w){0,300}

    As you can see, we skip over (and therefore don't count) any character that matches the '\W' character set, and then count each single character that matches the 'w' set.

    You can now change this pattern to count whatever "character set" you want. For example, to match the first 10 vowels, you could use

    ^([^aeiou]*[aeiou]){0,10}

    (with the "ignore case" option set as necessary)

    Note that in this whole discussion, we will include the skipped (and uncounted) character in the overall match, but that the counting itself will be driven by the character set we have specified.

    To count words, we can use a similar structure but we first need to define what we mean by a "word". In this case I'm going to assume a word made up of alphabetic characters plus the apostrophe and hyphen (i.e the pattern element '[a-z'-]+' with the "ignore case" option set to include "words" with capitalised letters).

    Therefore we can set up the pattern:

    ^([^a-z'-]*([a-z'-]+)){0,300}

    What this does is to skip over any non-"word" characters and then capture the characters in a "word". I have included a capture group around the "word" part of that it is easy for us to get to the individual "words". (I note that you are using  the .NET regex engine: this will work in this case because of an extension in the .NET regex capability to capture the text in each repeated capture group - look up "captures"; in other regex engines, each repeated capture will overwrite the previously captured text and other techniques are needed.)

    To make it a bit easier to identify the "words", we can tell the regex engine not to "capture" the grop that includes the non-word characters, as in:

    ^(?:[^a-z'-]*([a-z'-]+)){0,300} 

    By the way, I tested this on the text of your question (as it is not 300 "words" long, I matched the first 100 only - that is to the "And" and "what" of the 2nd paragraph). This matched the words "wasn't" and "you're" correctly. If you expand the count a bit, you will find that it captured "th" and not "300th" as it would have skipped the leading digits in the same way it did the leading whitespace for the "word". This can be fixed but you will need to define exactly what you want to do first.

    The last part of the 2nd question talks about finding the "closest" period to the 300th character. Unfortunately regex engines can't do maths and so they have no concept of "closest". Also, a characteristic of a regex engine is that it can't go back and rescan text without first forgetting any previously captured text that was scanned.

    In this situation, to find the "nearest, you need to:
    - locate the 300th character
    - scan back to the previous period character and record the number of characters
    - scan forward to the next period character and record the number of characters
    - compare the 2 counts and keep the shortest

    If I was trying to do this, I would create a pattern that captured "sentences" defined (in some way) based on the occurrence of a period character. For example

    [^.]+\.

    (with the "singleline" options set) will match characters up to and including a period. (Actually it will have problems with, for example, the text of your question where it will see the ".NET" as the end of a sentence followed by "NET" at the start of the next one - another example of how careful you need to be to define what you really want to match.) You can then look at the array of the matches and count the length of each and keep a running sum. When the sum ticks over the 300 mark, you can see if you really should include the last 'sentence" of not.

    Susan 

  •  12-03-2012, 1:34 AM 87145 in reply to 87144

    Re: How can I select the first 300 words of a value with regexp?

    Aussie Susan:

    I think we need to be clear on the terminology being used here. For example, in your first question, you say you want to match the first 300 characters but then mention other "characters" that you imply should not be counted. Similarly, when you talk about 300 "words" in the title and 2nd question, you don't define what a "word" is in your context (there are many posts in this forum where this has been discussed and probably as many definitions determined as there were situations put forward).

    I'm going to assume that you want to count the first alphanumeric characters plus the underscore (this happens to match the definition of the '\w' character set shortcut) and match all characters from the start of the text until either the end of the text or 300 such characters have been matched.

    To start with, we can use

    ^.{0,300}

    (with the "singleline" option set) to match the first 300 or any character. Next we use the fact that the '\W' character set shortcut is the complement of the '\w' one that we are targeting, and that if we try to match any character with either the set or its complement, then we will always have a match. Therefore we can express the same pattern as

    ^(\w|\W){0,300}

    Now, we want to not count any character that is matched by the '\W' part. Therefore we need to skip over any such character and so we change this pattern to

    ^(\W*\w){0,300}

    As you can see, we skip over (and therefore don't count) any character that matches the '\W' character set, and then count each single character that matches the 'w' set.

    You can now change this pattern to count whatever "character set" you want. For example, to match the first 10 vowels, you could use

    ^([^aeiou]*[aeiou]){0,10}

    (with the "ignore case" option set as necessary)

    Note that in this whole discussion, we will include the skipped (and uncounted) character in the overall match, but that the counting itself will be driven by the character set we have specified.

    To count words, we can use a similar structure but we first need to define what we mean by a "word". In this case I'm going to assume a word made up of alphabetic characters plus the apostrophe and hyphen (i.e the pattern element '[a-z'-]+' with the "ignore case" option set to include "words" with capitalised letters).

    Therefore we can set up the pattern:

    ^([^a-z'-]*([a-z'-]+)){0,300}

    What this does is to skip over any non-"word" characters and then capture the characters in a "word". I have included a capture group around the "word" part of that it is easy for us to get to the individual "words". (I note that you are using  the .NET regex engine: this will work in this case because of an extension in the .NET regex capability to capture the text in each repeated capture group - look up "captures"; in other regex engines, each repeated capture will overwrite the previously captured text and other techniques are needed.)

    To make it a bit easier to identify the "words", we can tell the regex engine not to "capture" the grop that includes the non-word characters, as in:

    ^(?:[^a-z'-]*([a-z'-]+)){0,300} 

    By the way, I tested this on the text of your question (as it is not 300 "words" long, I matched the first 100 only - that is to the "And" and "what" of the 2nd paragraph). This matched the words "wasn't" and "you're" correctly. If you expand the count a bit, you will find that it captured "th" and not "300th" as it would have skipped the leading digits in the same way it did the leading whitespace for the "word". This can be fixed but you will need to define exactly what you want to do first.

     

     

    Once again, thanks a ton for such a comprehensive response, they are a huge help.

    Yes, I should have been more precise. I am dealing with normal paragraphs consisting of all sorts of characters, and would like to save not the words in a wordlist, but the whole design of the sentences and paragraphs too. So that would include the whitespaces, the paragraphs, the numbers and anything that may come up in normal reading. I would then have a resulting block of text that would keep the previous' block's formatting untouched.

    Xqueries, insofar as I understand that they're useful for isolating similar features of text, wouldn't help because there's no recognisable pattern in each of the raw texts from my array. One can be about numbers, another have no paragraphs, some can even contain less than 300 words.

    If regexp can't do this, I'd be ok with an expression that would match the first 1400 entities of the text area, as it were. These would include any alpha numeric, non alpha numeric and empty space up until the 1400th, while keeping the original format, including features like line breaks or paragraphs. I suspect this would probably be an easier expression, but it'll be interesting to see if the other can be done.

     

  •  12-03-2012, 4:48 PM 87146 in reply to 87145

    Re: How can I select the first 300 words of a value with regexp?

    If you look at the patterns I have suggested, then you should be able to see how to extend the first one to include up to as many characters as you like (just replace the '300' in the quantifier with 1400 or some other number).

    All of the suggested patterns will not alter an part of the matched text is you look at the complete match. The internal matching groups are there to let you count the words (etc.) but there is no need for you to use them if you don't want to.

    Therefore I'm a little confused as to what you mean by the "formatting" of the text.You mentioned that you are extracting text from a table and THEN applying the regex pattern to the extracted text. It therefore all depends on what mechanism you used to do the text - if it is the HTML (or XML) DOM and you are using the "innerText()" function, then all you will have is text with any HTML tags (such as "<p>") removed. If you are using some other technique, and you are expecting the formatting markup that is embedded in the text to be preserved, then there may be other ways of achieving what you want.

    For example, if you have a block of text that contains the "<p>" tags, and you want to break the text up by "paragraphs", then you can use the regex "split()" operation using a pattern such as:

    <p\b.*?>

    (If your text also has the "</p>" closing tag then you may want to handle those separately after the you have each paragraph from the above split operation).

    On the other hand, if you have just plain text, you might define a "paragraph" as 2 consecutive line-breaks in which case you could separate the paragraphs using a split pattern such as:

    \n\n

    or, depending on how the line breaks are actually represented in your text:

    \r?\n\r?\n

    (I prefer this one as it takes into account nearly all variations you are likely to come across - except for old-style Apple documents that used "\r" only).

    On the subject of XQuery, remember that what the XQuery operators work on are the tags and other structural elements of the document. In this case I suspect that you might be able to use a XQuery to (for example) locate the cell within the table (e.g. find the 4th table in the page, then the 3rd row and the 5th column) and then extract the text. Also you could use an XQuery based on the paragraph tags in much the same way as I did above but on the text of the cell.

    As I mentioned in another response, if you are trying to extract items from an HTML (or XML) text, then the corresponding DOM and XPath queries is generally a much easier way to go.

    Susan 

View as RSS news feed in XML