Java字符长度

问题

在了解字符集编码的过程中，遇到了一个关于string.length()的问题，详细如下

public static void main(String[] args) {
  String str = "\uD834\uDD1E";//𝄞
  System.out.println(str.length());
}

在获取𝄞字符的长度时，发现结果是2，与预期结果不一致。通常情况下，我们理解字符长度就是字符的个数，但是代码结果显然证明并非如此

Java中的string.length()究竟是什么？

查看JDK底层代码就会发现有一段关于string.length()的描述

Returns the length of this string. The length is equal to the number of Unicode code units in the string.

根据解释可以发现，string.length()返回的是Unicode code units（代码单元）的个数，那么代码单元又是个什东西呢？

什么是代码单元？

代码单元指一种转换格式（UTF）中最小的一个分隔，称为一个代码单元，因此，一种转换格式只会包含整数个单元

各种UTF编码方案下的代码单元

UTF-8的8是指最小8位为一个单元，也即一字节为一个单元，UTF-8可以包含一个单元，二个单元，三个单元及四个单元，对应即是一，二，三及四字节
UTF-16的16是指最小16位为一个单元，也即两字节为一个单元，UTF-16可以包含一个单元和两个单元，对应即是两个字节和四个字节。我们操作UTF-16时就是以它的一个单元为基本单位的
UTF-32的32是指最小32位为一个单元，它只包含这一种单元，它的一单元自然也就是四字节

UTF-X中的数字 X 就是各自代码单元的位数，我们知道Java中String对象在内存中以是UTF-16大端方式编码的，所以string.length()获取的就是UTF-16的代码单元数目

解决方案

如果我们真的想确切地知道有几个字符，length 显然是不能给出正确答案的。如果想要得到实际的长度，也即码点数量，可以使用 string.codePointCount()

Returns the number of Unicode code points in the specified text range of this String. The text range begins at the specified beginIndex and extends to the char at index endIndex - 1. Thus the length (in chars) of the text range is endIndex-beginIndex. Unpaired surrogates within the text range count as one code point each.

查看JDK底层代码可以发现，该方法是根据UTF-16的编码格式（代理对）进行校验判断的

static int codePointCountImpl(char[] a, int offset, int count) {
  int endIndex = offset + count;
  int n = count;
  for (int i = offset; i < endIndex; ) {
    if (isHighSurrogate(a[i++]) && i < endIndex &&
        isLowSurrogate(a[i])) {
      n--;
      i++;
    }
  }
  return n;
}