Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

Polish characters problem

Last post 11-09-2007, 12:59 AM by Aussie Susan. 3 replies.
Sort Posts: Previous Next
  •  11-08-2007, 5:22 PM 36351

    Polish characters problem

    Hi,

    I've to match all words from text file. I've used for this \w+ , but now I have text with polish language characters and \w+ regex doesn't work correct. It treats these characters as aseparator. Please I need your help!

    Thanks!

     

  •  11-08-2007, 9:40 PM 36357 in reply to 36351

    Re: Polish characters problem

    What is the regex and surrounding language that you are using?

    My guess is that you were dealing with only ASCII characters (or, at least 8-bit characters) and are now trying to process Unicode characters. If that is the case, the options open to you will depend heavily on your answers to the above questions.

    Susan 

  •  11-09-2007, 12:43 AM 36365 in reply to 36357

    Re: Polish characters problem

    Language: Java (Eclipse) //java.util.regex

    OS: WinXP SP2 (PL) 

     

  •  11-09-2007, 12:59 AM 36368 in reply to 36365

    Re: Polish characters problem

    Off the top of my head (can't test this just now) try

    (\p{L}|\p{N}) +

    That is all Unicode letters or all Unicode numbers. There may be some other combination that is more equivalent to the \w (alphanumerics) but I would need to look it up. However, given this as a start, I'm sure that you can lookup what Java can and cannot handle with this style of shortcut (note that Java only handles a subset of the Unicode properties, scripts and blocks that are available in other regex's).

    Susan 

View as RSS news feed in XML