Hi All,
I hope someone out there can give me some ideas on this one. I'm using a PCRE environment to search html and URLs for words. But in some cases I'm looking for a word that could contain other words. I'm not interested in the other words. Just my word. So, for example...
My word is "asses". Of the following URLs I only want to match the first.
1. http://mysite.com/asses.html
2. http://mysite.com/assessments.html
3. http://mysite.com/biasses.html
4. http://mysite.com/glasses.html
5. http://mysite.com/gasses.html
6. http://mysite.com/bypasses.html
7. http://mysite.com/molasses.html
8. http://mysite.com/reassessments.html
9. http://mysite.com/crassest.html
10.http://mysite.com/assessor.html
I've looked a little at using negating lookarounds. So, I could exclude number 2 and 8 by doing this:
asses(?!sment)
But, excluding everything in this list in a single expression using lookarounds seems tedious and error prone. I'm looking for a way to have one, clean expression that gets the correct word but excludes a list of other words. This site shows a list of words that contain the word "asses". http://www.morewords.com/contains/asses/
I don't necessarily need to exclude ALL of them but certainly the most common ones. Also, performance is a concern. So, it has to be something that is going to perform pretty well. Is it possible? Is there a way to do it with word boundaries?
Any ideas or suggestions would be appreciated!
Thanks,
Eric
P.S. You may have guessed that I'm working on a content filter (of sorts). I'm looking for "asses" in an attempt to exclude without over-filtering.