Unicode is simple

Top

Older: v1

There are many many introductions to Unicode. Here is another.

Of the various introductions I've seen, one of the best and briefest is Joel Spolsky's aptly named The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!). However these introductions tend not to make what I think is the key observation about encodings:

Unicode and character encodings are really simple, once you have the right concepts.

If you have the wrong concepts – if you think that the words ‘character’ and ‘font’ mean what they obviously mean – then the whole sorry business is maddeningly, perversely, confusing.

The short version

We can expand each of those points.

Terminology

There are several layers, and the key terms are glyph, codepoint, encoding and font.

Glyphs and codepoints

Unicode has a specific meaning for the term 'glyph'. A glyph is an abstract character in a writing system. Thus the letter ‘Latin lowercase a’ is a glyph. Different fonts will represent this in different ways – a Times Roman 'a', a sans-serif 'a', and a black-letter 'a' look very different, but are all the same glyph. The 'Latin lowercase a' and the 'Cyrillic lowercase a' may look identical on the page, but they are separate glyphs. Uppercase and lowercase (in writing systems which have that distinction) are different glpyhs. Individual Chinese/Japanese ideographs are also glyphs (there's a very long and complicated story about these which we will completely ignore). Klingon and Tengwar aren't in the set of Unicode glyphs: they have been proposed in the past, but no-one has yet volunteered the committee slog required to get a definitive list agreed and taken through the standardisation process.

The Unicode consortium has collected a large number of such glyphs (120,737 of them, in Unicode version 8.0), and has given each a unique number. Thus 'Latin lowercase a' is glyph number 97 in that list, 'Latin uppercase A' is number 65, and so on.

Each of these numbers is referred to as a 'codepoint'.

So a text, in Unicode terms, is a sequence of glyphs, in the form of a sequence of numeric codepoints. For example, the text "Hello" is the sequence of codepoints 72, 101, 108, 108, and 111.

Notice that we haven't referred to computers at all, so far. You could send a Unicode text to someone else by writing those numbers down on a bit of paper, or by Morse code or semaphore.

Fonts

A font is a collection of pictures of glyphs. That is, it indicates (one choice about) what 'Latin lowercase a' actually looks like, so that it indicates to you what to put on the paper (or screen) when someone says ‘draw the glyph at Unicode codepoint 97.‘

We still haven't talked about computers. If you upended a type case of metal font letters, and scratched a Unicode codepoint number on the side of each ‘sort’ (the metal block with the letter on top), I think you could claim that was ‘a Unicode font‘

Possibly obviously, a particular font will represent only a subset of the huge possible set of codepoints.

Encodings (computers, at last!)

But of course we don't send texts around with Morse code or semaphore, and we don't print them out by picking numbered sorts from a box.

Computers don't handle numbers; they handle bytes. If you want to store a text on disk, or send a file across the network, perhaps in the form of a web page, then you will have to turn the sequence of codepoints (abstract numbers, remember), into bytes. The recipe for doing so is an ‘encoding’. UTF-8, UTF-16, and UTF-32 are well-known encodings for Unicode.

Thus if you have ‘a Unicode file’, then the process of reading it consists of determining which encoding it is in, and then using the corresponding recipe to decode the sequence of bytes in the file into a sequence of Unicode codepoints. The machine will have to be somehow told which encoding has been used, if it is to decode the bytes into numbers correctly.

That's it! You now understand Unicode.

Other ‘encodings’

This is not the only way to do this, of course. The principal other technique for turning characters into bytes is by using a ‘character set’ (which is also sometimes referred to as an ‘encoding’) Such character sets include things like ASCII, or the sequence of ISO-standardised character collections which include, for example ISO-8859-1 and friends, or ‘Shift JIS’, or ‘MacOS Roman’, or ‘KOI8-R’, or ‘Big 5 HKSCS’, or other acronym insanity along those lines.

Each of these works the same way: you identify a set of characters (we should more properly call them glyphs, of course) that are useful in some particular linguistic context, and compile a table which associates one or more bytes with each one. Thus ISO-8859-1 is a list of 191 characters useful for writing western European languages, and KOI8-R is a set of 223 characters for writing Cyrillic. If you read ‘a ISO-8879-1 file’, you read the file byte by byte, and look up the corresponding glyph in the table. If you mistakenly read the same file as ‘a KOI8-R file’, the process would proceed without obvious error, but the glyphs you'd look up would be completely different.

Character sets are referred to as ‘code pages’ in Windows.

Different ideas: platforms and keyboards

All of these ideas are platform independent. A UTF-8 file (or an ISO-8859-1 file, or...) is just a bunch of bytes. Thus such a file on a Windows machine is the same thing as it is on a Mac, or a Unix box, or an Atari, a BBC Micro, or VMS. The apparent platform dependence of some encodings comes about because the various OS companies developed solutions to this problem independently: Microsoft developed a long list of ‘code pages’ for Windows, Apple similarly developed collections of encodings for MacOS and OS X. Although there's nothing intrinsically platform specific about it, a Mac, for example, would typically not know what table to use to decode a file marked as being ‘Windows 20866’ (little did it know that this named a lookup table identical to standard KOI8-R).

Unicode (and its encodings) and the various ISO-xxx encodings are international standards, KOI8-R is a Russian standard (defined during the time of the Soviet Union); the Windows and Mac codepages are manufacturer standards. The underlying files are still just sequences of bytes.

These ideas are also distinct from the idea of ‘keyboard layouts’. A keyboard layout has nothing to do with character encoding (although since it's a part of the same jigsaw, it's very easy to presume that it does). A keyboard layout simply indicates to the machine what abstract character to insert when you press a specific physical key on the keyboard.

So...

From the top:

Easy, yes?

Norman, 2015 September 19