Character Encoding and TEI XML

Hugh A. Cayless

TEI is XML. XML is "plain text." But what does that mean? One meaning is that it is text without the additional apparatus you get in a word processor, like MS Word. So no differences in font, no bold, italic, or underline, etc., simpler. But even plain text is not so simple under the covers.

To a computer everything is made of numbers. Specifically, numbers in binary, base 2, ones and zeros only. Computers deal with binary numbers in chunks, called bytes, which are 8 digits long (leading zeroes are ok). So a byte can be in the range 00000000 (0) to 11111111 (255). That's 256 numbers. These numbers are often written using base 16 (hexadecimal, digits are 0-F instead of 0-9) because bytes can be represented as 2-digit hexadecimal numbers (00 to FF).

Representing Characters

So it is unsurprising that plain text itself is just numbers to the computer. Even the numbers 0-9 are just numbers (but 0 is not 00000000). Historically, text has been represented in different ways on the computer, but probably the most basic is called ASCII (American Standard Code for Information Interchange). This uses the numbers from 0 (00000000 - 00) to 127 (01111111 - 7F) to represent the characters you see on an American English computer keyboard. That's fine for most uses—if you're an American writing in English—but it's not enough characters if you want to use other languages with characters like ç or ψ or 丕. So what to do? Well, ASCII only uses half the numbers available in a byte, so there are another 127 to play with. Windows computers used to use a character encoding scheme called Windows-1252 or CP-1252 (http://en.wikipedia.org/wiki/Windows-1252) that uses all 256 numbers in a byte. So you can type French or German and use symbols like £. A similar system is called ISO-8859-1 (http://en.wikipedia.org/wiki/ISO/IEC_8859-1). This doesn't help if you need to write Greek, or Russian, or Arabic though. For that, a bunch of other encodings were invented that map, say, Greek characters into the same spaces used for English letters, so Greek alpha (α) would be in the same spot as the English 'a' (decimal 56, hexadecimal 41, binary 01000001), for example. So a Russian or Greek computer would use a different character encoding (and different fonts) to do everything. You could also just make a font (for use in a word processing program) that uses the 256 code points to do (for example) Greek, and this is how people used to write in other languages.

Multi-byte encodings

This still doesn't help with languages like Chinese, that don't use alphabets. It has something like 40,000 characters. So one byte isn't enough. If you use two bytes (so a 16-digit binary number), then you can have 65,536 characters (1111111111111111 = 65,535), so that's better, and there are a bunch of 2-byte encodings out there for Chinese, Japanese, and Korean.

Unicode

But it still wasn't enough to represent all human writing systems. In order to do this an international standard called Unicode was developed (http://en.wikipedia.org/wiki/Unicode). The latest version of Unicode covers more than 109,000 characters. Like other systems, Unicode specifies what character is to be represented by what number. It is a bit different however, in that there are a few different ways of encoding that character. That is, there are different ways of writing the number in bytes. The most popular of these is called UTF-8.

UTF-8

UTF stands for UCS (Universal Character Set) Transformation Format, and 8 is for the length of the binary numbers used to encode the characters. Remember, there are 109,000-odd characters in Unicode, and an 8-digit binary number only goes up to 255, so how does this work? For the characters covered by ASCII, UTF-8 is actually identical to ASCII. For characters above 127, UTF-8 uses between 2 and 6 bytes to represent the character. Let's think about this in binary: for characters in the ASCII range, the binary numbers (bytes) all start with zero. For characters above that, UTF-8 uses 2 or more bytes. The first few digits of the first byte indicate how long the sequence is (2-6 bytes, remember). If it is a 2-byte sequence, the first byte starts with '110', a 3-byte sequence starts with '1110', and so on. The remaining bytes in the sequence all start with '10'. The rest of the numbers in each byte are put together to make up the actual character number in Unicode.

So, for example, the left curly quotes character: “ is number 8220 (hexadecimal 201C). In UTF-8, thats:

11100010 10000000 10011100

The first byte starts with '1110', so it's a 3-byte sequence. Strip the 1110 off the front of first byte, that leaves 0010. The next two bytes start with '10', as they are supposed to, so we strip those prefixes off too. That leaves:

0010 000000 011100

or 8220 in decimal notation. Now, all this has a couple of interesting implications that will actually be relevant to your everyday experience with computers and text. The first is that with UTF-8, because it uses these byte sequences, the computer can tell if they're broken. If it sees something like 11100010 10000000 11000010, it knows that the third byte in the sequence starting with 11100010 is missing, and something's gone wrong. The second implication, and this is something you're bound to have seen on the web, is that if a program reads text using the wrong encoding, it will display garbage characters instead of what you expect. For example, our UTF-8 left double quote sign, if read as ISO-8859-1, is three bytes long, and will come out something like: "“". In a browser, we could "fix" this by changing the encoding the browser uses to read the page. The web server, or the page itself, is supposed to tell the browser which encoding to use, but this doesn't always work properly.

Back to XML

XML, like a web page, can tell whatever's reading it what character encoding to use, but the default (what happens if nothing is specified) is UTF-8. So that's what's actually underneath the "plain text" it uses.