Codepoints and Unifon
Jan. 17th, 2010 12:36 pm![[personal profile]](https://www.dreamwidth.org/img/silk/identity/user.png)
Someone was asking about this, so I thought I would write it up here. I apologise in advance if I'm telling people things they already know.
Computers don't deal with letters directly. Instead they give a number to each letter. This number is called a "codepoint". The most famous set of numbers is ASCII, which only contains 127 codepoints. For example, 65 in ASCII is "A", and 90 is "Z".
Not so long ago, computers were only capable of dealing with codepoints between 0 and 255. For that reason, when people wanted to use symbols outside ASCII, they either had to use codepoints between 128 and 255 (which ASCII didn't define), or re-use the ASCII codepoints.
Constructed scripts tended to re-use the ASCII codepoints. For example, 65 was simultaneously the Latin letter A, the Shavian letter Ash, and the Unifon letter At. The problem then was that you had to tell the computer somehow whether you were using the constructed script or ASCII. You can do this by switching the font, but it's a bit of a nuisance, and there is no way to construct a single font containing both the ordinary alphabet and the constructed alphabet.
Scripts from natural languages often used the vacant space between 128 and 255, which wasn't quite as bad, but there were two problems. One: scripts like Chinese don't fit in the space. Two: even for scripts that did fit, you could only use one at once. You couldn't mix Latin-alphabet, Hebrew-alphabet, and Cyrillic-alphabet text in the same document, at least not without changing the font in the middle.
Various solutions were proposed, but eventually everyone got together and invented Unicode, which is a bit like ASCII but hugely larger, with hundreds of thousands of codepoints. Now the Latin, Cyrillic, Hebrew, Chinese, and all the other alphabets could have their own sets of codepoints and there would be space for everyone.
Some constructed scripts were also given their own space in Unicode. For example, Shavian has the codepoints between 66640 and 66687. Other constructed scripts were not given space in Unicode, although perhaps they will one day: there's a lot of space free.
However, there is a space in Unicode, between codepoints 57344 and 63743, which is for "private use". Michael Everson has been allocating space here for constructed scripts, so that nobody steps on one another's toes. (Since Mr Everson is very busy, someone else is also running an addendum to his list.)
In Mr Everson's list, Unifon has been given the codepoints between 59200 and 59239 (U+E740 to U+E767 in hexadecimal notation). Given a font with the correct characters at these codepoints (I have one), an entire page of text can be set in the same font and include both Unifon and Latin-alphabet characters (as well as Shavian, Arabic, Hebrew, Chinese...) Since web pages can now embed fonts which will be downloaded by browsers, this means that everything can be done simply, with a single font.
The disadvantage here is that it is no longer possible to type in the Latin alphabet and pretend it's Unifon (or Shavian, or whatever). Instead, you need to set up an input method on your computer, just as you would if you wanted to type in Greek or Russian. It's like graduating to become a "proper" alphabet within the system.
Computers don't deal with letters directly. Instead they give a number to each letter. This number is called a "codepoint". The most famous set of numbers is ASCII, which only contains 127 codepoints. For example, 65 in ASCII is "A", and 90 is "Z".
Not so long ago, computers were only capable of dealing with codepoints between 0 and 255. For that reason, when people wanted to use symbols outside ASCII, they either had to use codepoints between 128 and 255 (which ASCII didn't define), or re-use the ASCII codepoints.
Constructed scripts tended to re-use the ASCII codepoints. For example, 65 was simultaneously the Latin letter A, the Shavian letter Ash, and the Unifon letter At. The problem then was that you had to tell the computer somehow whether you were using the constructed script or ASCII. You can do this by switching the font, but it's a bit of a nuisance, and there is no way to construct a single font containing both the ordinary alphabet and the constructed alphabet.
Scripts from natural languages often used the vacant space between 128 and 255, which wasn't quite as bad, but there were two problems. One: scripts like Chinese don't fit in the space. Two: even for scripts that did fit, you could only use one at once. You couldn't mix Latin-alphabet, Hebrew-alphabet, and Cyrillic-alphabet text in the same document, at least not without changing the font in the middle.
Various solutions were proposed, but eventually everyone got together and invented Unicode, which is a bit like ASCII but hugely larger, with hundreds of thousands of codepoints. Now the Latin, Cyrillic, Hebrew, Chinese, and all the other alphabets could have their own sets of codepoints and there would be space for everyone.
Some constructed scripts were also given their own space in Unicode. For example, Shavian has the codepoints between 66640 and 66687. Other constructed scripts were not given space in Unicode, although perhaps they will one day: there's a lot of space free.
However, there is a space in Unicode, between codepoints 57344 and 63743, which is for "private use". Michael Everson has been allocating space here for constructed scripts, so that nobody steps on one another's toes. (Since Mr Everson is very busy, someone else is also running an addendum to his list.)
In Mr Everson's list, Unifon has been given the codepoints between 59200 and 59239 (U+E740 to U+E767 in hexadecimal notation). Given a font with the correct characters at these codepoints (I have one), an entire page of text can be set in the same font and include both Unifon and Latin-alphabet characters (as well as Shavian, Arabic, Hebrew, Chinese...) Since web pages can now embed fonts which will be downloaded by browsers, this means that everything can be done simply, with a single font.
The disadvantage here is that it is no longer possible to type in the Latin alphabet and pretend it's Unifon (or Shavian, or whatever). Instead, you need to set up an input method on your computer, just as you would if you wanted to type in Greek or Russian. It's like graduating to become a "proper" alphabet within the system.