Corrected UTF-8

UTF-8 is decent and all but it contains some design errors, partly because its original designers just messed up, and partly because of ISO and Unicode Consortium internal politics. We’re probably going to be using it forever so it would be good to correct these design errors before they get any more entrenched than they already have.

Corrected UTF-8 is almost the same as UTF-8. We make only three changes: overlength encodings become impossible instead of just forbidden; the C1 controls and the Unicode surrogate characters are not encoded; and the artifical upper limit on the code space is removed.

The key words MUST, MUST NOT, REQUIRED, SHALL, SHALL NOT, SHOULD, SHOULD NOT, RECOMMENDED, MAY, and OPTIONAL in this document are to be interpreted as described in RFC 2119.

Eliminating overlength encodings

The possibility of overlength encodings is the design error in UTF-8 that’s just a plain old mistake. As originally specified, the codepoint U+002F (SOLIDUS, /) could be encoded as the one-byte sequence 2F, or the two-byte sequence C0 AF, or the three-byte sequence E0 80 AF, etc. This led to security holes and so the specification was revised to say that a UTF-8 encoder must produce the shortest possible sequence that can represent a codepoint, and a decoder must reject any byte sequence that’s longer than it needs to be.

Corrected UTF-8 instead adds offsets to the codepoints encoded by all sequences of at least two bytes, so that every possible sequence is the unique encoding of a single codepoint. For example, a two-byte sequence, 110xxxxx 10yyyyyy, encodes the codepoint 0000 0xxx xxyy yyyy plus 160; therefore, C0 AF becomes the unique encoding of U+00CF (LATIN CAPITAL LETTER I WITH DIAERESIS, Ï).

Not encoding C1 controls or surrogates

The C1 control character range (U+0080 through U+009F) is included in Unicode primarily for backward compatibility with ISO/IEC 2022, an older character encoding standard in which the byte ranges 00 through 1F and 7F through 9F are reserved for control characters.

It is never appropriate to use the C1 controls in interchangeable text, as they are very likely to be misinterpreted according to one of the DOS code pages that defined bytes 80 through 9F as graphic characters. Corrected UTF-8 skips over them entirely; this is why the offset for two-byte sequences is 160 rather than 128. (I would like to discard almost all of the C0 controls as well—preserving only U+0000 and U+000A—but that would break ASCII compatibility, which is a step too far.) If there is a need to represent U+0080 through U+009F, perhaps for round-tripping historical documents, they can be mapped to some convenient private-use codepoints.

Similarly, the only reason the surrogate space (U+D800 through U+DFFF) exists is to support UTF-16. These codepoints will never appear in well-formed Unicode text, and the current generation of the UTF spec actually forbids the three-byte sequences ED A0 80 through ED BF BF to be emitted or accepted at all, rather like the overlength sequences. In Corrected UTF-8, we skip this range just like we do for the C1 controls. (This unfortunately does mean that the three-byte sequences are split into two ranges with two different offsets.) Again, programs that need to represent actual surrogates (perhaps for the same reasons that motivated the creation of WTF-8) can map them into private-use space.

Removing the artificial upper limit

The original design of UTF-8 (as FSS-UTF, by Pike and Thompson; standardized in 1996 by RFC 2044) could encode codepoints up to U+7FFF FFFF. In 2003 the IETF changed the specification (via RFC 3629) to disallow encoding any codepoint beyond U+10 FFFF. This was purely because of internal ISO and Unicode Consortium politics; they rejected the possibility of a future in which codepoints would exist that UTF-16 could not represent. UTF-16 is now obsolete, so there is no longer any reason to stick to this upper limit, and at the present rate of codepoint allocation, the space below U+10 FFFF will be exhausted in something like 600 years (less if private-use space is not reclaimed). Text encodings are forever; the time to avoid running out of space is now, not 550 years from now.

Corrected UTF-8 reverts to the original definition of four-, five-, and six-byte sequences from RFC 2044; after taking the offsets into account, the highest encodable code point is U+8421 109F. The encoding schema could be extended still further by use of the lead bytes FE and FF, which RFC 2044 leaves undefined. FE would begin a seven-byte sequence, and FF would indicate that the unary count of tail bytes extends into the next byte. 1111 1111 110x xxxx would be the first two bytes of an eight-byte sequence, 1111 1111 1110 xxxx would begin a nine-byte sequence, and so on; in this way the encoding schema would not have any upper limit at all.

We are leaving that extension for the future, because the original rationale for not using bytes FE and FF (avoiding conflicts with UTF-16 byte order marks and Telnet IAC bytes) is still somewhat relevant, even though both UTF-16 and Telnet are obsolete. However, to preserve the possibility of longer byte sequences being used in the future, Corrected UTF-8 decoders MUST treat sequences beginning with FE or FF as reserved for future use and as extending until the next recognized lead byte, rather than as invalid.

Putting it all together

Here is a complete table of byte sequences up to 6 bytes long, with their offsets and the codepoint ranges they encode. Byte and codepoint values are shown in hexadecimal, offsets in decimal.

Byte Sequence Range Offset Codepoint Range
00 … 7F 0 0000 0000 … 0000 007F
C0 80 … DF BF 160 0000 00A0 … 0000 089F
E0 80 80 … EC BD 9F 2 208 0000 08A0 … 0000 D7FF
EC BD A0 … EF BF BF 4 256 0000 E000 … 0001 109F
F0 80 80 80 … F7 BF BF BF 69 792 0001 10A0 … 0021 109F
F8 80 80 80 80 … FB BF BF BF BF 2 166 944 0021 10A0 … 0421 109F
FC 80 80 80 80 80 … FD BF BF BF BF BF 69 275 808 0421 10A0 … 8421 109F

The eight-byte sequence EF B7 9D ED B2 AE 00 0A is defined as the magic number signaling text using Corrected UTF-8. It SHOULD be present at the beginning of any file encoded in Corrected UTF-8, but need not be prepended to strings whose encoding is known by other means. Like byte order marks in UTF-16, when it appears at the beginning of a file, it should not be considered part of the text.

This byte sequence is the Corrected encoding of the four-codepoint sequence U+10E7D U+ED4E U+0000 U+000A. If interpreted as traditional UTF-8, it instead encodes U+FDDD U+DCAE U+0000 U+000A, which is forbidden on two counts: U+FDDD is a noncharacter and U+DCAE is a surrogate (and an unpaired one at that). U+10E7D is RUMI FRACTION ONE THIRD, and U+ED4E is the private use character assigned by the Under-ConScript Unicode Registry to NIJI CONSONANT CH; these choices are largely arbitrary.

Other legacy control characters

As mentioned above, the major reason why the C0 controls are still encodable in Corrected UTF-8 is to preserve compatibility with ASCII, which is still important. However, these characters are also largely obsolete; the only one that should appear in a normal text file is U+000A. The others’ functions are, nowadays, better handled by binary-safe transport protocols and markup languages, or else they’re simply redundant.

Because of the common use of the lone byte 00 as a string terminator, U+0000 MUST NOT appear in a Corrected UTF-8 document except as part of the magic number defined above. Corrected UTF-8 documents SHOULD conform to the Unix definition of a text file, which means that U+000A is used by itself as a line terminator (NOT a line separator; the last character in the file should be U+000A) and U+000D and U+2028 SHOULD NOT appear. The other C0 controls, and additionally U+2029 PARAGRAPH SEPARATOR, also SHOULD NOT appear.