Tuesday, March 31, 2009

Another Chinese Encoding Puzzle

Someone on the Unicode list had a text where strange escape codes had replaced accented chracters. For example the word "clichés" was printed as clich\x{5ee5}. The escape code presumably represents Unicode U+5EE5 or 廥. How could that happen? It turns out that this character has the code E973 in Big5, and that E9 73 in Latin-1 is és. So somehow a Latin-1 text was read as Traditional Chinese in Big5, then read again as Unicode and the non-Latin bits converted to escape sequences.

To make such a text readible, one can convert the the \x{abcd} escapes to the ꯍ html format, view the text with a browser, copy/paste to a text doc, save as Big5, and open as Latin-1.

