Sunday, August 25, 2013

JAVA fail to convert char to utf8

JAVA fail to convert char to utf8

From wiki:
Consider a text file containing the German word f¨¹r in the ISO-8859-1
encoding. This file is now opened with a text editor that assumes the
input is UTF-8. As the first byte (0x66) is within the range 0x00¨C0x7F,
UTF-8 correctly interprets it as an f. The second byte (0xFC) is not a
legal value for the start of any UTF-8 encoded character. A text editor
could therefore replace the byte with the replacement character symbol to
warn the user that something went wrong. The last byte (0x72) is also
within the code range 0x00¨C0x7F and can be decoded correctly. The whole
string now displays like this: f�r.
I have a chinese word h, in GB3212, it is 6f22. In UTF8, it is e6bca2.
But when I try to convert it to unicode, i got EFBFBD (the replacement
char).
String correctedstring = new String(bbuf, "GB2312");//bbuf
contains 6f22
Charset charset = Charset.forName("UTF8");
CharsetDecoder decoder = charset.newDecoder();
CharsetEncoder encoder = charset.newEncoder();
ByteBuffer bb = charset.encode(correctedstring);
cbuf = decoder.decode(bb);
String utf8s = cbuf.toString();
utf8s is EFBFBD, rather than e6bca2.
I have no problem for some other chinese char, but i have problem with h.
I cant figure out the reason. If anyone know why, please kindly tell me
why and how to solve it? thanks.

No comments:

Post a Comment