Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

Michael Ash's Regex Blog

Regex Musings

Word Break.

The list-detail design is very commonly used with web pages, where you have a list of links that lead to more detailed information of each entry. Sometimes the text of the list is simply a snippet of a much longer string of text in the detail. A common way to handle this is to use a string function to return the first n characters of the string and display that in the list. The problem with this is that tends to make the break right in the middle of a word. Which isn’t major problem but can be aesthetically displeasing or may accidentally form another word you didn’t mean to put on your site. When facing this issue I came up with a simple regex to allow me to break on whole words. ^(?:[ -~]{n,m}(?:$|(?:[\w!?.])\s)) Where n = the minimum number of characters to match And m = the maximum number of character to allow in the match. Now in instance I’m considering a word to be one or more ACSII non white-space characters. The way it works is after matching n ASCII characters it tries to match either the end of the string or a letter or sentence ending punctuation followed by a white space. So it will accept as many characters, including white spaces as it can up to m and still satisfy the rest of the match. Otherwise it backtracks until the regex is satisfied. So if you wanted a minimum of 2 characters and a maximum of 75 the regex would be ^(?:[ -~]{2,75}(?:$|(?:[\w!?.])\s)) and if you applied it the Gettysburg Address “Four score and seven years ago, our fathers brought forth upon this continent a new nation: conceived in liberty, and dedicated to the proposition that all men are created equal. …” (only the 1st paragraph shown for the example but you could apply the full text) Taking the first match you get “Four score and seven years ago, our fathers brought forth upon this ” There are a few problems with the regex that can be improved. First off it only accepts basic ASCII displayable characters, decimal 32 to 126 with mean the text must be in that range. I did it this way because it give you the US alphabet, digits and commonly used symbols and punctuation which was all I needed at the time. Other characters would need to be added. Also if the first word character count exceeds your maximum length no match will be found You can make this regex a little dynamic by putting inside a function that takes the your string, the max and min values as input.
Published Wednesday, February 9, 2005 1:17 PM by mash



mash said:

This is so cool! Great regex!

February 11, 2005 4:55 PM
Anonymous comments are disabled