Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

Remove \W while ignoring White Space

Last post 08-02-2010, 7:23 AM by Asbo Panda. 4 replies.
Sort Posts: Previous Next
  •  07-29-2010, 3:46 AM 70207

    Remove \W while ignoring White Space

    Hi Folks.

    I am not to fermilliar with Reg X and i am tring to remove all none Alpa/Numeric Chars while leaving white spaces.  In c#

    Here is my code.

            string searchString = Regex.Replace(search.Text.Trim().ToLower(), @"[\W]", "");
                Response.Redirect(searchString.Replace(" ", "-") + "-search.htm");

    I need to put the - Char where there is a space.  So ive looked around a bit and i am still stumped.  I get that [\s] is all white spaces, but i do not get how to make this an exception to the rule [\W].

    I did try  string searchString = Regex.Replace(search.Text.Trim().ToLower(), @"[!\s]|[\W]", "");  This was ok but then !"$£ style Chars came through.

    And also i tryed string searchString = Regex.Replace(search.Text.Trim().ToLower(), @"[\W]", "", RegexOptions.IgnorePatternWhitespace) but this was wrong as well.

    Help....

    Thanks Panda.

    PS.  What dose the  @ symbol mean at the start of the pattern,   lots of examples ommited this, can i?

    Filed under: ,
  •  07-29-2010, 4:51 AM 70212 in reply to 70207

    Re: Remove \W while ignoring White Space

    Hi Asbo,

    The @ at the beginning of the string makes this a verbatim string which means that the text is interpreted exactly as spelled so you don't have to use escape sequences to signal characters like newline or the backslash.
    The only exception is the double quote which is represented by 2 double quotes

    See the esamples at
    http://msdn.microsoft.com/en-us/library/aa691090%28VS.71%29.aspx
    to get a feeling for what stands for

    Back to your original question on how to remove all none Alpa/Numeric Chars while leaving white spaces.
    Can you give an example of the input and how you want it to be output?

    Use the following C# statement :

    resultString = Regex.Replace(subjectString, @"\W(?<!\s)", "");

    Here follows the explenation of the regex \W(?<!\s)

    Match a single character that is a “non-word character” «\W»
    Assert that it is impossible to match the regex below with the match ending at this position (negative lookbehind) «(?<!\s)»
       Match a single character that is a “whitespace character” (spaces, tabs, and line breaks) «\s»

    I am using an advanced technique called negative lookbehind which you may read upon to understand the regex better.

    Regards, Tom


     

  •  07-29-2010, 8:02 PM 70244 in reply to 70207

    Re: Remove \W while ignoring White Space

    Panda,

    Can I suggest that you read Michael Ash's blog entry at  http://regexadvice.com/blogs/mash/archive/2008/01/31/A-touch-of-Character-Class.aspx where you can learn hoe to use character set definitions correctly.

    For example, while not exactly wrong, '[\W]' can be written as '\W'. Also '[!\s]' creates a character set that contains all of the space characters and the exclamation point. I think you want to create a character set that contains all of the characters except those you specify - the correct syntax for that is '[^\s]' with the '^' character as the first inside the set definition. In fact, this set can equally be specified as '\S'.

    By the way, the keyword for you in "IgnorePatternWhitespace" is the "Pattern" - this allows you to put whitespace (and comments, line breaks etc) inside your pattern to make it more readable. It has NOTHING to do with the way the regex engine processes whitespace within the string it is scanning, which seems to be your intent in specifying this option.

    To remove all non-alphanumeric characters but leave the whitespace characters, you can use a special pattern syntax capability that the .NET regex has by using the pattern:

    [\W-[\s]]

    [Note that the square brackets around the '[\s]' alters the "normal" meaning of the '-' character within a character set from specifying a range (as in '[a-e] being the first 5 characters lower case alphabetic characters) to being a character set "subtraction" - as far as I know this is specific to the .NET regex pattern syntax]

    This takes the test string:

    I need to {6^&*$^* put the - Char where .' there !"$£ is a space

    and with a null replacement string, the replace function outputs:

    I need to 6 put the  Char where  there  is a space

    By the way, if we take your pattern '[!\s]|[\W]', correct the syntax to '[^\s]|[\W]' and simplify it to '\S|\w', what this says is that it will match all non-space characters OR all alphanumeric (and underscore) characters. Now the '\w' set is a subset of '\S' so any character that doesn't match the '\S' will also not match the '\w'. Conversely, any character that could match the '\w' will also match the '\S' and so it will never be tested by the '\w'. In effect, your pattern is '\S' which is why the punctuation characters come through.

    However, going back to the original problem statement, we want to match any character that is NOT alphabetic and NOT whitespace. Turning the problem on its head, we can say we want to locate any character that is alphabetic or a whitespace and leave it alone. To do this, the pattern:

    [\w\s]

    will match any alphabetic or whitespace character. In reality we want to match exactly the opposite set of characters so we can use:

    [^\w\s]

    If you use there null replacement string in a replace() operation, you end up with exactly the same as above. This is a simpler representation than above and will work on just about any regex variant.

    Susan

  •  07-30-2010, 6:16 AM 70257 in reply to 70244

    Re: Remove \W while ignoring White Space

    Hello Susan,

    Character class subtraction works in .NET and in XML Scema regexes

    http://www.regular-expressions.info/xmlcharclass.html#subtract

    I prefer your regex [^\w\s] since it avoids lookaround

    Regards, Tom Pester

  •  08-02-2010, 7:23 AM 70360 in reply to 70257

    Re: Remove \W while ignoring White Space

    Woof thanks people. I will need to digest this and write a coherant reply.

    Im nearly there with your answers.  

     

    Hope you all had a good weekend. 

View as RSS news feed in XML