Wednesday, November 22, 2006

Why Does My French Turn Into Chinese?

One of the common questions I see in the Apple Mail forum is from people using European languages who find their messages contain strange Chinese characters when received on a PC with Windows Outlook.

Here is my understanding of how this happens.

Certain kinds of messages are sent by Mail with two copies -- one in plain text with the charset UTF-8, and one in html with the charset Latin-1. There appear to be two bugs in Outlook. The first one causes it to confuse the two encodings and read Latin-1 characters beyond ascii in the html copy as if they were UTF-8. So, for example in the French phrase

pensé qu'il

it sees the é + space + q as a series of 3 bytes, E9 20 71, forming one character. (In UTF-8 a byte beginning with E signals a 3 byte character.)

E9 20 71 is not in fact a valid UTF-8 sequence, but Windows or Outlook has another bug: It doesn't care whether the sequence is valid or not. It looks at the binary for the last two bytes this sequence, which is

(E9) 00100000 01110001

and only reads the last 6 bits of each of them, assuming that the first 2 are 10 (which is what valid UTF-8 should normally have) instead of 00 and 01. So it interprets this as (E9) 10100000 10110001 or E9 A0 B1, which is valid UTF-8 for 頱. Thus "pensé qu'il" becomes "pens頱u'il."

Other accented characters may give different results, including question marks or complete absence of the character.

I don't know whether Vista will have the same behavior.

For fixes for this problem, see this note.

No comments: