Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

Finding Extended and Unicode Characters

Last post 02-28-2010, 11:42 PM by Aussie Susan. 3 replies.
Sort Posts: Previous Next
  •  01-30-2010, 5:31 PM 59122

    Finding Extended and Unicode Characters

    In a list of file names contained in a TXT document, being edited with ViM, what regular expression can I use to locate those entries containing one or more Unicode characters?

    I would also like to search for those entries containing extended code characters but excluding Unicode characters.

     


    --
    RJ Emery
  •  02-27-2010, 12:42 PM 60182 in reply to 59122

    Re: Finding Extended and Unicode Characters

    Unicode defines a set of symbols, not the encoding. I guess you want to recognise the non-ascii encoding, which might be a subset of utf16 or utf8 or utf7 or whatever the jungle of standards might define

    I assume you want to recognize the most popular of them, which is UTF8. You can find one of the definitions in rfc 3629 http://www.faqs.org/rfcs/rfc3629.html, together with the formal definition of the encoding (BNF).

    For the convenience of implementors using ABNF, a definition of UTF-8 in ABNF syntax is given here. A UTF-8 string is a sequence of octets representing a sequence of UCS characters. An octet sequence is valid UTF-8 only if it matches the following syntax, which is derived from the rules for encoding UTF-8 and is expressed in the ABNF 

       UTF8-octets = *( UTF8-char )
       UTF8-char   = UTF8-1 / UTF8-2 / UTF8-3 / UTF8-4
       UTF8-1      = %x00-7F
       UTF8-2      = %xC2-DF UTF8-tail  

       UTF8-3      = %xE0 %xA0-BF UTF8-tail / %xE1-EC 2( UTF8-tail ) /
                     %xED %x80-9F UTF8-tail / %xEE-EF 2( UTF8-tail )
       UTF8-4      = %xF0 %x90-BF 2( UTF8-tail ) / %xF1-F3 3( UTF8-tail ) /
                     %xF4 %x80-8F 2( UTF8-tail )
       UTF8-tail   = %x80-BF

    I translate it to pcre-compliant lingo, you won't find complex to rewrite it for egrep, vim and other tools.

    Because nothing in the utf8 encoding does include the bytes from \x00 to \xff, the quick decoding would be to exclude "us-english ascii"

    [^\x00-\x7f]

    If you want to properly delimit each symbol of the utf8 encoding (variable number of bytes per symbol),

     (  (   (  [\xf1-\xf3][\x80-\xbf]  | \xf0[\x90-\xbf] | \xf4[\x80-\x8f] |  [\xe1-\xec\xee-\xef]  ) [\x80-\xbf]   )   |   [\xc2-\xdf]  |  \xe0[\xa0-\xbf]  |  \xed[\x80-\x9f]  )  [\x80-\xbf]

    You might also discover that some OS, tools and utilities use \xfe\xff and \xff\xfe and \xff\xff in some cases, in which case it could be utf16. See http://www.faqs.org/rfcs/rfc2781.html . Unicode encoding is 2 to 3 bytes starting with a "byte order mark"

    ([\xfe\xff]  |   [\xff\xfe])  marks the start of utf16 encoding of unicode.

     

    since you didn't provide an example of what you wanted to match and the moderator will kill me for responding to a request non compliant with the guidelines, can I suggest you check exactly what you wish to catch with a regexp ?

    dump your .txt file using a unix/linux tool like od

    od -c -x file.txt | more

    if you see that it is alternates of 00 bytes and something else, then it is likely to be utf16, else if you find scattered bytes greater than 0x80, by pairs or triples, they are utf8

     

  •  02-28-2010, 4:21 PM 60196 in reply to 60182

    Re: Finding Extended and Unicode Characters

    Peadarin:

    since you didn't provide an example of what you wanted to match and the moderator will kill me for responding to a request non compliant with the guidelines, can I suggest you check exactly what you wish to catch with a regexp ?

    LOL  Don't be silly. First offense is merely 20 lashes. ;-)

    As your response is based off of guesses and assumptions it just points out the more the more info requested  in the posting guideline provided the less guess work is needed to give an answer.

     


    what regular expression can I use to locate those entries containing one or more Unicode characters?

    Your question as stated is basically the same as saying you want to match at least one character. Every line in this message contains at least one Unicode character.

    So unless you are looking Klingon or some such language the characters you are looking for are likely part of Unicode as well. I suspect Peadarin's assumptions are correct and that your descriptions of what you want to match is poorly phrased. You need to be more specific in your criteria.


    Michael

    "In theory, theory and practice are the same. In practice, they are not."
    Albert Einstein
  •  02-28-2010, 11:42 PM 60204 in reply to 59122

    Re: Finding Extended and Unicode Characters

    And to add in my 2c, and not knowing the regex variant that VIM uses,  you need to be careful how you go about this. Some regex variants work on bytes regardless of the encoding (i.e. a 16-bit character can be seen as 2 bytes) while others work on characters (each combination of unicode values and modifiers is seen as a single character no matter how many bytes are needed to represent it) and some do both (PCRE, for example and if compiled with the necessary options, will work on characters within the string but use/report byte offsets in the API).

    Susan

View as RSS news feed in XML