Jun. 15th, 2010

marnanel: (Default)
(You may know this already, but if not I thought you might be interested.)

With reference to your column at http://www.drdobbs.com/blog/archives/2010/06/unicode_and_the.html : the reason the translator at pinyin.info choked on the Shavian characters you gave it is because all Shavian characters have codepoints above 0xFFFF, and therefore (if you're using UTF-16, which the pinyin.info translator appears to be) they won't fit in a single word and will have to be represented using surrogate pairs. Wikipedia has a reasonable coverage of surrogate pairs: http://en.wikipedia.org/wiki/Surrogate_pair , but briefly, it's a way to represent a Unicode character whose codepoint is too high by using a pair of otherwise illegal characters, both of whose codepoints are low enough. Hence the effect you noted of having "the wrong codes, and twice too many of them".

The fault is presumably with the pinyin.info translator, which shouldn't give out surrogate pairs unless explicitly asked, but it does go to show that, as Wikipedia puts it, "code is often not tested thoroughly with surrogate pairs. This leads to persistent bugs, and potential security holes, even in popular and well-reviewed application software", or as you put it, "computing still is not mature".

Thomas (author of the transliterator script on shavian.org)

Profile

marnanel: (Default)
Monument

January 2022

S M T W T F S
      1
2345678
9101112131415
1617 1819202122
23242526272829
3031     

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Jun. 9th, 2025 05:58 pm
Powered by Dreamwidth Studios