Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

beginer advice plz

Last post 07-02-2009, 7:17 PM by Aussie Susan. 6 replies.
Sort Posts: Previous Next
  •  06-30-2009, 9:39 AM 54443

    beginer advice plz

    Hi guys,

     

    Hope someone can help me out, i am building a chat filter using REGEX, and i am trying to ban words, derivatives, and extentions of words such as "duck" with an F of course but keeping it clear for forum. so far i have the following;

     duck(s|er|ed|ers|ing|in|wit|head|heid|heads|less)?

     however what i am really looking for is a regex which will alow me to ban the following ddddddduuuuuuuuuucccccccckkkkkkkkkkkk*N(as in indefinite amount of chars in duck)+(s|er|ed|ers|ing|in|wit|head|heid|heads|less)?

    any suggestions? thanks all :)

    Bart

  •  06-30-2009, 10:17 AM 54446 in reply to 54443

    Re: beginer advice plz

    d+u+c+k+

    would match 1 or more duck chars.


  •  06-30-2009, 11:09 AM 54449 in reply to 54446

    Re: beginer advice plz

    thanks for your answer,

     

    and how do i add the derrivatives, this is what i currently have: where am i going wrong?

     i have:

    d+(u+c+k+|u+k+)

    how do i link it to (s|er|ed|ers|ing|in|wit|head|heid|heads|less) ??

    so that for example dduuuuucckkkers is banned ?

    i also dont want to ban common words like truck, truckers etc.

    this is what i currently have: (f+(u+c+k+|u+k+))+(s|er|ed|ers|ing|in|wit|head|heid|heads|less) which sometimes works sometimes not!

    thanks for all help

  •  06-30-2009, 7:15 PM 54459 in reply to 54449

    Re: beginer advice plz

    Just a thought - can you use a regex to remove duplicated characters as a first pass and then check for whatever words you want in a dictionary-like search? I can see that there will be many root words that you will need to test for and there are a multitude of prefixes and suffixes that could be applied - hence the use of dictionary-search techniques.

    When I am faced with issues about how to achieve some outcome, I sit down and determine how I would do it mechanically. Generally this leads me to creating a state transition diagram which then allows me to create a regex pattern. In this case, where you want to ban certain obscenities but leave other (similar) words alone, you need to work out exactly how you can differentiate between then. Perhaps using the '\b' anchor to try to match the beginning of a word (you may need a more rigourous test) would stop the classic of "Scunthorpe" being caught as an obscenity.

    If you do use regex, then you might be able to factor some of your expressions a bit. For example

    (u+c+k+|u+k+)

    is the same as

    u+(c+)?k+

    which is also the same as

    u+c*k+

    Similarly, a lot of your suffixes are the same as others but with "s" on the end.  Therefore something like

    s|er|ed|ers

    would be (almost) the same as

    (e(r|d))?s?

    Susan 

  •  06-30-2009, 11:48 PM 54463 in reply to 54459

    Re: beginer advice plz

    You might also need to consider the many variations that can be used to obfuscate the blocked word.

    Using "create variant expression" here:

    http://cs.medicine.ufl.edu/regex/

    And the f-word as the source text (10 chars whitespace option at their page) it returns this pattern (not the most efficient pattern but it is a starting point).

    [f](\W|_){0,10}[uúùûü](\W|_){0,10}([cçg]|(\[|\{|\())(\W|_){0,10}(([k])|(\|))

    But note that no matter what pattern you use for blocking a word the user will simply retype some variant until it is not blocked.

    I would submit that if the base word is blocked there would be no purpose to block the remaining suffix chars of s/er/ers/ed/etc.  Just block the main offensive word and leave the remaining non-offensive characters as-is.


  •  07-01-2009, 5:47 AM 54467 in reply to 54463

    Re: beginer advice plz

    Hi Again,

     

    Thanks for all your advice, some of it is a bit complex at this stage of my knowledge of regex, but hopefully i ll have a better understanding ot it later on.

     

    for now i think i am going to go with something along the lines of:

     

    d+(u+c+k+|u+k+)*(s|er|ed|ers|ing|in|wit|( )?head|( )?heid|( )?heads|less)

    can someone please check the syntax of the above? its weird as the above finds things like duckhead, duck head, ducker,

     

    but does not find duck nor duk. what is wrong with my syntax?

     

    if i enter a second entry as shown below just for duck then everything seems to work:

    d+(u+c+k+|u+k+)*(s|er|ed|ers|ing|in|wit|( )?head|( )?heid|( )?heads|less)
    d+(u+c+k+|u+k+)

  •  07-02-2009, 7:17 PM 54524 in reply to 54467

    Re: beginer advice plz

    The problem is that you are telling the regex engine that there must be 1 or more "d"s, there can be 0 or more of the next few letters (the u, c and k) but there MUST be a single instance of the suffix. Try putting a '?' after the ')' that closes the suffix match group which will make it optional.

    Also, there is no real need to put the space in its own match group as in '( )?head'. ' ?head' will do just as well unless you really need to know that there is a space for leter processing outside of the regex matching.

    Susan 

View as RSS news feed in XML