Java字符编码问题

今天研究了一下，记录下来

中间用的是redis，可以使用任意其他的io替代，一样的

Test1

String s1 = "我要测试";

String s2 = "I want to test";

String s3 = "경쟁력, 네이버";

redis.lpush("testencode", s1);

redis.lpush("testencode", s2);

redis.lpush("testencode", s3);

System.out.println(redis.lpop("testencode"));

结果：全部正确

注解：Java内部也是unicode，所以如果发送和接受端都是Java写的，无需任何转码（前提是发送和接受端的默认编码一致）

Java在往I/O发送和从I/O接受的时候会默认转码，一般用系统默认的编码，貌似文档本身的编码格式优先级更高

所以这里发送到时候转成utf-8，接受时再从utf-8转回unicode，所以没有问题

Test2

String s1 = "我要测试";

byte[] key = "testencode".getBytes();

byte[] b1 = s1.getBytes("gb2312"); //自己转码，而非默认转码

redis.lpush(key, b1);

System.out.println(new String(redis.lpop(key),"gb2312"));

//System.out.println(new String(redis.lpop(key)));

结果：正确

注解：由于发送的时候已经转成gb2312，所以接受的时候，必须转回来，如果用默认的（注释掉部分）就会转成默认编码utf-8，就会乱码

前面的转码都是在知道原编码的情况下，但有时在接收端无法知道原来的编码，这是就需要detect编码

使用JCharDet，这个的接口写的不好，蛮难用的

参考，http://blog.csdn.net/chenvsa/article/details/7445569

我改了一下，

import org.mozilla.intl.chardet.nsDetector;
import org.mozilla.intl.chardet.nsICharsetDetectionObserver;
import org.mozilla.intl.chardet.nsPSMDetector;

public class CharsetDetector{
    private boolean found = false;
    private String result;
    private int lang = nsPSMDetector.ALL;

    public String[] detectCharset(byte[] bytes) throws IOException
    {
        String[] prob;
        // Initalize the nsDetector() ;
        nsDetector det = new nsDetector(lang);
        // Set an observer...
        // The Notify() will be called when a matching charset is found.
        det.Init(
            new nsICharsetDetectionObserver(){
                public void Notify(String charset)
                {
                    found = true;
                    result = charset;
                }
            });
        int len = bytes.length;
        boolean isAscii = true;
        if (isAscii){
            isAscii = det.isAscii(bytes, len);
        }
        // DoIt if non-ascii and not done yet.
        if (!isAscii){
            if (det.DoIt(bytes, len, false));
        }
        det.DataEnd();
        if (isAscii){
            found = true;
            prob = new String[] {"ASCII"};
        } else if (found){
            prob = new String[] {result};
        } else {
            prob = det.getProbableCharsets();
        }
        return prob;
    }

    public String[] detectChineseCharset(byte[] bytes) throws IOException
    {
        try{
            lang = nsPSMDetector.CHINESE;
            return detectCharset(bytes);
        } catch (IOException e){
            throw e;
        }
    }

使用，

CharsetDetector cd = new CharsetDetector();
String[] probableSet = {};

try {
     probableSet = cd.detectChineseCharset(b1);
} catch (IOException e) {
     e.printStackTrace();
}
for (String charset : probableSet)
{
    System.out.println(charset);
}

posted on 2014-05-14 17:20 fxjwind 阅读(498) 评论(0) 编辑收藏举报

刷新页面返回顶部

fxjwind

Java字符编码问题

导航

公告