» Working Around JNI UTF-8 Strings Deprogramming
To access the string, C++ needs to retrieve the bytes of the string using a function from the JNI library, GetStringUTFChars(), like so:
JNIEXPORT void JNICALL Java_Example_printString(JNIEnv *env, jclass, jstring text) { const char* text_input = env->GetStringUTFChars(text, NULL); for (int i = 0; text_input[i] != 0; ++i) { printf("jni[%d] = %x\n", i, ((unsigned char *) text_input)[i]); } env->ReleaseStringUTFChars(text, text_input); } |
In a sample run, I get the following output:
String = AêñüC jni[0] = 41 jni[1] = c3 jni[2] = aa jni[3] = c3 jni[4] = b1 jni[5] = c3 jni[6] = bc jni[7] = 43
The five character string “AêñüC” is encoded in eight bytes under UTF-8, because three of the characters occupy two bytes each.
Now this works fine in this example. What isn’t yet apparent is that UTF-8 strings generated by JNI are not standard, but instead are modified UTF-8. According the JNI spec:
There are two differences between this format and the standard UTF-8 format. First, the null character (char)0 is encoded using the two-byte format rather than the one-byte format. This means that modified UTF-8 strings never have embedded nulls. Second, only the one-byte, two-byte, and three-byte formats of standard UTF-8 are used. The Java VM does not recognize the four-byte format of standard UTF-8; it uses its own two-times-three-byte format instead.
If there’s a technical reason JNI does not use standard UTF-8 format, I have not seen a discussion, and I cannot fathom why. A case may be made for the non-embedded nulls, but that’s easy to work around by relying on a length variable instead of null to mark the end. The avoidance of four-byte UTF-8 characters seems more mysterious.
Here’s an example of passing a valid four-byte-character: The Java routine now passes in the following string:
class Example { ... void examplePrintString() { byte[] bb = new byte[4]; bb[0] = (byte) 0xf0; bb[1] = (byte) 0xa0; bb[2] = (byte) 0x9c; bb[3] = (byte) 0x8e; String str = new String(bb, "UTF-8"); System.out.println("String = " + str); printString(str); } } |
And the output is now:
String = <unprintable>* jni[0] = ed jni[1] = a1 jni[2] = 81 jni[3] = ed jni[4] = bc jni[5] = 8e
* This blog can’t handle that character.
The Java example sets the four bytes of the character explicitly, so it is obvious this character was converted to a 5-byte sequence.
Suppose you relied on a string processing library in your native function to manipulate the strings from the Java call. And also suppose this library expects and produces standard UTF-8 encoding, because, why would it not use the standard? And suppose it reacted unpredictably when faced with non-standard, or more politely, “modified” encoding. At best, it hopefully discards characters it can’t interpret. At worst it crashes. In the case of passing strings from native back to Java, the JNI definitely crashes if not in correctly modified UTF-8, so you have this problem too.
Chances are you’d never encounter the problem lurking, because use of four-byte characters seems sufficiently rare. But I wouldn’t want to rely on the scarcity of these characters to avoid a potential bug. As I’ve learned from running code that drives popular web-sites, once running on sufficiently enough data, even the unlikeliest of bugs become commonplace.
So how to work around this without needing to write a converter in native code? Well, it turns out converting to UTF-8 in Java (as opposed to JNI) produces standard encoding. Therefore, the workaround is to convert in Java, and send a byte array in lieu of a String.
Now, the Java example looks like:
class Example { ... private static native void printBytes(byte[] text); ... void examplePrintString() { byte[] bb = new byte[4]; bb[0] = (byte) 0xf0; bb[1] = (byte) 0xa0; bb[2] = (byte) 0x9c; bb[3] = (byte) 0x8e; String str = new String(bb, "UTF-8"); System.out.println("String = " + str); printBytes(str.getBytes("UTF-8")); // Do the conversion here. } } |
JNIEXPORT void JNICALL Java_Example_printBytes(JNIEnv *env, jclass, jbyteArray text) { jbyte* text_input = env->GetByteArrayElements(text, NULL); jsize size = env->GetArrayLength(text); for (int i = 0; i < size; ++i) { printf("bytes[%d] = %x\n", i, ((const unsigned char *) text_input)[i]); } env->ReleaseByteArrayElements(text, text_input, NULL); } |
Now this prints the following expected four bytes:
String = <unprintable> bytes[0] = f0 bytes[1] = a0 bytes[2] = 9c bytes[3] = 8e
When using a UTF-8 library in JNI, I prefer byte array over String when passing data from Java.