Unicode到UTF-8转换的规则见笔记字符编解码的故事，下面将转换过程代码化，以演示如何手工对byte数据进行操作，以及需要注意的问题（字节位的高低定义是：high<-->low）。

package encoding; import java.io.UnsupportedEncodingException; import org.junit.Test; public class Converter { public static void main(String[] args) throws UnsupportedEncodingException { char chnChar = '联'; System.out.println("UTF-8 bytes of " + chnChar + ": " + convUnicode2utf8(chnChar)); } /* * 演示unicode到utf-8的转换过程。 * @param 要进行转换的汉字 * @throws UnsupportedEncodingException * @return 16进制表示的汉字UTF-8编码字节序列 / public static String convUnicode2utf8(char input) throws UnsupportedEncodingException { int lowByte = input & 0xff; int highByte = (input & 0xff00) >>> 8; // 第二次运行时注释掉本行 // byte highByte = (byte) ((input & 0xff00) >>> 8); // 第二次运行时取消注释本行 System.out.println("Unicode bytes of " + input + ": " + Integer.toHexString(highByte) + ", " + Integer.toHexString(lowByte)); // UTF-8的第1个字节是1110 + highByte前4位 int high4inHighByte = highByte >>> 4; System.out.println("highByte>>>4: hex:" + Integer.toHexString(high4inHighByte) + ", demical:" + high4inHighByte); int utf8Byte1 = (7 << 5) + high4inHighByte; // UTF-8的第2个字节是10 + highByte后4位 + lowByte前2位 int low4inHighByte = highByte & 0xf; int high2inLowByte = lowByte >>> 6; int utf8Byte2 = (1 << 7) + (low4inHighByte << 2) + high2inLowByte; // UTF-8的第3个字节是10 + lowByte后6位 int utf8Byte3 = (1 << 7) + (lowByte & 0x3f); String result = Integer.toHexString(utf8Byte1) + ", " + Integer.toHexString(utf8Byte2) + ", " + Integer.toHexString(utf8Byte3); return result; } public static String bytes2HexString(byte[] b) { String ret = ""; for (int i = 0; i < b.length; i++) { String hex = Integer.toHexString(b[i] & 0xFF); if (hex.length() == 1) { hex = '0' + hex; } ret += hex; } return ret; } }

第1次运行结果：

Unicode bytes of 联: 80, 54 highByte>>>4: hex:8, demical:8 UTF-8 bytes of 联: e8, 81, 94

第2次运行结果：

Unicode bytes of 联: ffffff80, 54 highByte>>>4: hex:ffffff8, demical:268435448 UTF-8 bytes of 联: 100000d8, 81, 94

结果分析

当highByte为byte型时（第二次运行），"Integer.toHexString(highByte)"的运行结果是ffffff80，这是由于toHexString(int i)方法会先将i转换为int型，Java中没有无符号数的概念，0x80作为byte型数据的值是-128，转换成int型的-128就是ffffff80。后面的highByte>>>4也一样，移位操作符（>>和>>>）要求左右的操作数是int或者long，highByte首先被转换为int值0xffffff80，然后无符号右移4位，变为0x0ffffff8，即268435448，最后导致utf8Byte1得到错误的值。可见错误的根本原因在于移位运算符对byte型数据按有符号数进行了“私下”转换。

解决方法

当要对数据进行字节位操作时，要特别注意数据是有符号还是无符号的，如果是无符号的（例如汉字编码），并且要进行移位操作（主要是向右移位，向左移位不受是否有符号位影响），应将被处理的byte保存在int型变量中。

参考http://stackoverflow.com/questions/3948220/behaviour-of-unsigned-right-shift-applied-to-byte-variable，移位运算符的说明参考"Java in a Nutsehll"一书中"Bitwise and Shift Operators"一节。

网络传输中的汉字和特殊字符

Java程序用字节数组接收网络传输数据时，字节的值只要超过0x7f，直接打印出来就是负数，这是Java对byte类型的定义（-128~127）造成的，不是错误，需要打印日志或者屏显时用上面代码中的bytes2HexString()打印hex码。

Unicode到UTF-8编码转换的Java实现

结果分析

解决方法

网络传输中的汉字和特殊字符

Published

Last Updated

Category

Tags

Contact