OSDN Git Service

unistr.c: Enable encoding broken UTF-16 into broken UTF-8, A.K.A. WTF-8.
authorErik Larsson <mechie@users.sourceforge.net>
Fri, 8 Apr 2016 03:39:48 +0000 (05:39 +0200)
committerErik Larsson <mechie@users.sourceforge.net>
Fri, 8 Apr 2016 03:39:48 +0000 (05:39 +0200)
commitd9c61dd60ec484909f70b7a916ada3a93af94b60
tree4f287be39bc4bebb85137a9259f01c4f78718ed9
parentebdff7d4ee4e5bdcc6f08681e5858e9a50058fac
unistr.c: Enable encoding broken UTF-16 into broken UTF-8, A.K.A. WTF-8.

Windows filenames may contain invalid UTF-16 sequences (specifically
broken surrogate pairs), which cannot be converted to UTF-8 if we do
strict conversion.

This patch enables encoding broken UTF-16 into similarly broken UTF-8 by
encoding any surrogate character that don't have a match into a separate
3-byte UTF-8 sequence.

This is "sort of" valid UTF-8, but not valid Unicode since the code
points used for surrogate pair encoding are not supposed to occur in a
valid Unicode string... but on the other hand the source UTF-16 data is
also broken, so we aren't really making things any worse.

This format is sometimes referred to as WTF-8 (Wobbly Translation
Format, 8-bit encoding) and is a common solution to represent broken
UTF-16 as UTF-8.

It is a lossless round-trip conversion, i.e converting from broken
UTF-16 to "WTF-8" and back to UTF-16 yields the same broken UTF-16
sequence. Because of this property it enables accessing these files
by filename through ntfs-3g and the ntfsprogs (e.g. ls -la works as
expected).

To disable this behaviour you can pass the preprocessor/compiler flag
'-DALLOW_BROKEN_SURROGATES=0' when building ntfs-3g.
libntfs-3g/unistr.c