Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

Word Filter

Last post 12-05-2008, 5:06 PM by mash. 5 replies.
Sort Posts: Previous Next
  •  12-05-2008, 12:42 PM 49167

    Word Filter

     I'm creating a filter in Java that filters out words based on skills.

    The skill I'm working on now is call Phonetic Skill #1.  The rule is that a vowel is short if it is followed by a single consonant sound.  I'm using the following regex to determin this.

    ^(y??(\w*?[\w&&[^aeiouy]]+?)*?)([aeiouy])(ch|sh|wh|th|ph|gn|kn|ck|wr|[\w&&[^aeiouyrw]])([\w&&[^aeiou]]+?[aeiouy]+?\w*?)*?$

     This works except for one exception (that I've found so far anyway).  Blends (br, cr, sc, sk, bl, pl, etc.) in multi-syllable words (zebra).  Typically consonants in multi-syllable words split (zeb-ra) but blends (br) stay together (ze-bra).  So, my regex catches this word and says it works but it really shouldn't.  How do I modify my regex to factor in the blend exceptions?

     Sorry for the lesson in Phonics, but thanks for any help.

  •  12-05-2008, 1:32 PM 49169 in reply to 49167

    Re: Word Filter

    Well I'm not 100% clear on what you are trying to match but this won't match zebra

    Original Expression ^(y??(\w*?[\w&&[^aeiouy]]+?)*?)([aeiouy](?!br|cr|sc|sk|bl|pl))(ch|sh|wh|th|ph|gn|kn|ck|wr|[\w&&[^aeiouyrw]])([\w&&[^aeiou]]+?[aeiouy]+?\w*?)*?$

    as a Java string "^(y??(\\w*?[\\w&&[^aeiouy]]+?)*?)([aeiouy](?!br|cr|sc|sk|bl|pl))(ch|sh|wh|th|ph|gn|kn|ck|wr|[\\w&&[^aeiouyrw]])([\\w&&[^aeiou]]+?[aeiouy]+?\\w*?)*?$"


    Michael

    "In theory, theory and practice are the same. In practice, they are not."
    Albert Einstein
  •  12-05-2008, 3:13 PM 49171 in reply to 49169

    Re: Word Filter

    Wow, it worked.

    If it's not evident by now I'm not very experienced in regular expressions.  Can you explain what the added statement does?

    Let me give some examples of what I'm doing.  If you have any suggestions on improving my regex I'd appreciate it.

    ad : true, because it has a vowel followed by a single consonant - [aeiouy][\w&&[^aeiouyrw]]

    add: false, because it has two consonants - $

    ban: true, single consonant after vowel, any consonant can come before the vowel - (\w*?[\w&&[^aeiouy]]+?)*?)

    band: false, two consonants

    bath: true, th is a digraph and makes a single sound - [aeiouy](ch|sh|wh|th|ph|gn|kn|ck|wr|[\w&&[^aeiouyrw]]) (the r & w are for words like war (ar) and bow (ow) these are another whole different case)

    mad: true

    maid: false, can't have adjacent vowels - (\w*?[\w&&[^aeiouy]]+?)*?[aeiouy]

    yap: true, y is a consonant at the begining of a word - (y??(\w*?[\w&&[^aeiouy]]+?)*?)

    When you get into multi-syllable words you have to take each syllable individually

    diner: false, di-ner single consonant goes with next syllable

    dinner: true, din-ner consonants split then use the above rules on each syllable -  ([\\w&&[^aeiou]]+?[aeiouy]+?\\w*?)*? (tacked on to disregard any following syllables)

    zebra: false, ze-bra typically consonants split, but blends (br) stay together and go with next syllable

     

    Let me know if I can explain the rules or parts of my regex better.

  •  12-05-2008, 3:46 PM 49173 in reply to 49171

    Re: Word Filter

    Buddha24:

    Wow, it worked.

    If it's not evident by now I'm not very experienced in regular expressions.  Can you explain what the added statement does?

     The construction I added to your pattern is called a negative lookahead. (?! pattern goes here) , which basically says continue trying to complete the match is the negative lookahead's patten is not found further ahead in the string. A more detailed explaination can be found here http://www.regular-expressions.info/lookaround.html

    I simply added it after the character class of vowels with the values you listed as blends.


    Michael

    "In theory, theory and practice are the same. In practice, they are not."
    Albert Einstein
  •  12-05-2008, 5:02 PM 49174 in reply to 49173

    Re: Word Filter

    Thanks again!
  •  12-05-2008, 5:06 PM 49175 in reply to 49171

    Re: Word Filter

    Buddha24:

    Let me give some examples of what I'm doing.  If you have any suggestions on improving my regex I'd appreciate it.

    ad : true, because it has a vowel followed by a single consonant - [aeiouy][\w&&[^aeiouyrw]]

    add: false, because it has two consonants - $

    ban: true, single consonant after vowel, any consonant can come before the vowel - (\w*?[\w&&[^aeiouy]]+?)*?)

    band: false, two consonants

    bath: true, th is a digraph and makes a single sound - [aeiouy](ch|sh|wh|th|ph|gn|kn|ck|wr|[\w&&[^aeiouyrw]]) (the r & w are for words like war (ar) and bow (ow) these are another whole different case)

    mad: true

    maid: false, can't have adjacent vowels - (\w*?[\w&&[^aeiouy]]+?)*?[aeiouy]

    yap: true, y is a consonant at the begining of a word - (y??(\w*?[\w&&[^aeiouy]]+?)*?)

    When you get into multi-syllable words you have to take each syllable individually

    diner: false, di-ner single consonant goes with next syllable

    dinner: true, din-ner consonants split then use the above rules on each syllable -  ([\\w&&[^aeiou]]+?[aeiouy]+?\\w*?)*? (tacked on to disregard any following syllables)

    zebra: false, ze-bra typically consonants split, but blends (br) stay together and go with next syllable

     

    Let me know if I can explain the rules or parts of my regex better.

    While I'm sure the rule of the English language are too complex to handle purely with a regex this pattern handles all the test cases you provided. And since you are just dealing English words I replace \w with a-z in the character classes. \w includes digits, the underscore and other Unicode alphas which don't seem to be relevant to your task. 

    ^(?:(?:[a-z&&[aeiou]]+|\b)([aeiou](?!br|cr|sc|sk|bl|pl))(?:(?:ch|sh|wh|th|ph|gn|kn|ck|wr)|(?:[a-z&&[aeiouy]](?![aeiouy]))))*\s*$

    Regular Expression
    Original Expression ^(?:(?:[a-z&&[^aeiou]]+|\b)([aeiou](?!br|cr|sc|sk|bl|pl))(?:(?:ch|sh|wh|th|ph|gn|kn|ck|wr)|(?:[a-z&&[^aeiouy]](?![aeiouy]))))*\s*$
    as a Java string "^(?:(?:[a-z&&[^aeiou]]+|\\b)([aeiou](?!br|cr|sc|sk|bl|pl))(?:(?:ch|sh|wh|th|ph|gn|kn|ck|wr)|(?:[a-z&&[^aeiouy]](?![aeiouy]))))*\\s*$"
    Replacement  
    groupCount() 1
    Test Target String matches() replaceFirst() replaceAll() lookingAt() find() group(0) group(1)
    1 ad Yes     Yes Yes ad a
    2 add No add add No No  
    3 ban Yes     Yes Yes ban a
    4 band No band band No No  
    5 mad Yes     Yes Yes mad a
    6 maid No maid maid No No  
    7 yap Yes     Yes Yes yap a
    8 diner No diner diner No No  
    9 dinner Yes     Yes Yes dinner e
    10 zebra No zebra zebra No No  

     It probably doesn't follow all the rules will want to ultimately use but it is at least a starting point.


    Michael

    "In theory, theory and practice are the same. In practice, they are not."
    Albert Einstein
View as RSS news feed in XML