Java Modified UTF-8 Strings In Python
Solution 1:
You can ignore Modified UTF-8 Encoding (MUTF-8) and just treat it as UTF-8. On the Python side, you can just handle it like this,
- Convert the string into normal UTF-8 and stores bytes in a buffer.
- Write the 2-byte buffer length (not the string length) as binary in big-endian.
- Write the whole buffer.
I've done this in PHP and Java didn't complain about my encoding at all (at least in Java 5).
MUTF-8 is mainly used for JNI and other systems with null-terminated strings. The only difference from normal UTF-8 is how U+0000 is encoded. Normal UTF-8 use 1 byte encoding (0x00) and MUTF-8 uses 2 bytes (0xC0 0x80). First of all, you shouldn't have U+0000 (an invalid codepoint) in any Unicode text. Secondly, DataInputStream.readUTF()
doesn't enforce the encoding so it happily accepts either one.
EDIT: The Python code should look like this,
def writeUTF(data, str):
utf8 = str.encode('utf-8')
length = len(utf8)
data.append(struct.pack('!H', length))
format = '!' + str(length) + 's'
data.append(struct.pack(format, utf8))
Solution 2:
I know this question is very very old, but I still want to contribute, since I got in the same problem and solved it
I found the implementation of this modified utf8 in the openjdk sources and translated it to python. here is a link to the gist i created.
Solution 3:
Okay, if you need to read the format of DataInput.readUTF
, I suspect you'll just have to convert the (well-documented) format into Python.
It doesn't look like it would be particularly hard to do. After reading the length and then the binary data itself, I suggest you use a first pass to work out how many Unicode characters will be in the output, then construct a string accordingly in a second pass. Without knowing Python I don't know the ins and outs of how to efficiently construct a string, but given the linked specification I can't imagine it would be very hard. You might want to look at the source for the existing UTF-8 decoder as a starting point.
Solution 4:
Maybe this can help you, although it looks like it's the reverse of what you're doing:
Solution 5:
There's a Python package that handles both reading and writing MUTF-8 strings with optional C extension: https://github.com/TkTech/mutf8
from mutf8 import encode_modified_utf8, decode_modified_utf8
unicode = decode_modified_utf8(byte_like_object)
bytes = encode_modified_utf8(unicode)
Post a Comment for "Java Modified UTF-8 Strings In Python"