marnanel: (Default)
I'm typesetting Alice in an interlinear Shavian edition. Here's the first chapter; there's more to come, obviously. Your feedback is greatly valued.

The only problem I can foresee is this exchange:
“I’m a poor man, your Majesty,” the Hatter began, in a trembling voice, “--and I hadn’t begun my tea—not above a week or so—and what with the bread-and-butter getting so thin—and the twinkling of the tea——”

“The twinkling of what?” said the King.

“It began with the tea,” the Hatter replied.

“Of course twinkling begins with a T!” said the King sharply. “Do you take me for a dunce? Go on!”
The word twinkling (𐑑𐑢𐑦𐑙𐑒𐑤𐑦𐑙) begins with the Shavian letter 𐑑, which is usually known as "tot". The names given in Unicode and used canonically these days were originally only examples (thanks, [personal profile] pne) given in the card which came with Androcles. The joke clearly doesn't work if the Hatter says, "the twinkling of the tot".

However, in Shaw Script, that letter was known as "tea". I may add a footnote to this effect. If you have other solutions, I'd like to hear them.

Update: [personal profile] pne suggests that Shavian and italics don't mix well, and suggests some alternatives.
marnanel: (Default)
I mentioned earlier that I had a system that could go some way towards typesetting "Alice" in Shavian. Demonstration:



This is kind of fun, though the text should really diminish towards the bottom of the page.
marnanel: (Default)
I mentioned earlier about an idea I had for automatic part of speech disambiguation based only on the part of speech of the preceding word. I also mentioned that I believe this would be a workable solution for disambiguating the pronunciation of most homonyms.

I would therefore like to create a distributable database which mapped conventional spellings of English words either to (part of speech, phonemic representation) pairs, or (in the case of ambiguous spellings) to a mapping from sets of parts of speech to such pairs; the part of speech of the previous word would be used in choosing the new one.

Sources of data would be:
  • the Shavian wiki, where possible (licence is cc-by)
  • cmudict where the Shavian wiki wasn't possible (licence is BSD-like)
  • the Brown tagger for the parts of speech (licence is MIT)
So, two things I need to consider:
  1. what this database would be called
  2. how to evaluate it.
I think one way to evaluate it might be to take a corpus which is already POS-tagged, and evaluate it by:
  1. assuming all words are nouns
  2. assuming all words which the Shavian wiki believes are ambiguous are nouns, and using the Brown tagger for the rest
  3. using the POS-of-the-previous-word method outlined above
  4. using the Brown tagger
and checking that (3) is closer to (4) than (2).  Other ideas are welcome, of course.
marnanel: (Default)
One of the difficulties inherent in automated transliteration is that of homonyms: words which are pronounced differently but spelt the same in the Latin alphabet.
  • I live near a live wire.
  • I like to read. I read a book yesterday. I will read one tomorrow.
  • I advocate happiness to the advocate.
  • He does love to play with does.
  • Please lead me to the box of lead.
  • It used to be used for that. Now it is used for this.
  • I never knew a number number.
On the Shavian wiki we've solved this problem manually, but it's a bit of a pain. With things like the Shavian Firefox extension, it's just been necessary so far to pick one randomly.

The other day I was on a plane and got to thinking. In most of these cases, the two words have different parts of speech: for example, does (he does) is a verb, does (more than one doe) is a noun. What if we could do part-of-speech tagging? But that's rather a complicated field. How simple can we make a part-of-speech tagger?

So I took the lexicon from the Shavian wiki and added the default part-of-speech tags from the Brown tagger to each word. Then rather than marking the homonyms with the part of speech they represented, I marked them with the part of speech which should precede them. That is, rather than having "does" choose between "N" and "V", it chooses between "D/J/I" (for N) and "N/P/V" (for V).

This works surprisingly well-- it even handles the "read" case as well as can be expected-- and I think it's simple and effective enough to use whenever I get around to updating the Firefox extension, and possibly elsewhere.  It handles all the most common cases; other than occasional misclassifications, the only major failing I can find is that it cannot distinguish nouns and adjectives.  There are only a few cases where an adjective and a noun are known to the system to have a different pronunciation:
  • agape
  • arithmetic
  • content
  • invalid
  • minute
  • number
  • pasty
I think in all these cases there's one option which is far more common than the other, and so we can get away with choosing that one always.  ("Content" is probably the least unbalanced; I think the noun form still has the edge.)

Sunday

Jun. 28th, 2009 09:37 pm
marnanel: (Default)
Woke up at a good time, around seven.  Promptly and stupidly decided to go back to sleep to see what the end of the dream was; it turned out to be a nightmare.  Woke up again at about eleven and went to the gym.  Continued the run of stupid mistakes by forgetting to get lunch for Rio.  Sharon came by and brought her lunch instead.  I hate getting up late. :(

Later, went to the diner for dinner.  Talked to Alex about a shelving project he's working on.

Did a little tidying, but not very much.  But I've got some way towards Inbox Zero: I'm now down to four emails.

Today I learned that cd - changes to the directory you were in before the current one.

Fin gave me an old notebook of zirs to use as a logbook.  It's lovely.

It occurs to me that the simple system I built a while ago which mostly allows Ubuntu to come up in Shavian would also work to get Deseret, Unifon and Tengwar.  I wonder whether there's much of a market for Ubuntu in Tengwar.  Possibly good Slashdot fodder, anyway.

Joule-for-Dreamwidth is edging closer.  I also need to implement a per-day view with a paging system to get around this problem.

Five days until GCDS starts.

marnanel: (Default)
You might have discovered by now that I'm rather a fan of the Shavian alphabet. That doesn't mean I'm entirely uncritical of its design. Here are some of my gripes:
  1. Most of the letters are visually distinct enough. But 𐑓 and 𐑝 (f and v) are too similar to 𐑐 and 𐑚 (p and b), to which they are unrelated.  Likewise for the vowels 𐑩 uh 𐑨 a 𐑧 e 𐑪 o:  they are far too similar to one another, especially when handwritten.
  2. Similarly 𐑯 and 𐑥 (n and m) are too similar when handwritten to the rather rare vowels 𐑷 and 𐑭 (awe and ah).
  3. Since most Americans merge 𐑷 and 𐑪 anyway, and some merge both with 𐑭, we could avoid the previous problem simply: just write them all as 𐑪 and be done with it. I don't believe this merger causes the Americans to have trouble understanding one another. (And Shavian does without a character for wh already, presumably because mergers have brought it to extinction in most dialects of English.)
  4. The rule about pairing off voiced and unvoiced consonants is a good one. But 𐑘 and 𐑢 (y and w) bear no relation to one another and shouldn't be paired.
  5. In the same way, it's perhaps not unreasonable to pair 𐑙 and 𐑣 (ng and h) since these sounds occur in opposition. But they should probably have been written the other way up, since 𐑙 is now the only voiced tall letter.
  6. All the ligatures, 𐑸 ar, 𐑹 or, 𐑼 uhr, 𐑺 air, 𐑽 ear, 𐑻 err, and especially 𐑾 ia and 𐑿 yu were a mistake (though it's nice to be able to write "𐑲♥𐑿"). People would already run screaming from an alphabet with forty letters; there's no call to add eight more redundant ones. Even Shaw Script didn't use them, though that's because they're too wide for a typewritten character.
  7. The naming dot (𐑥𐑸𐑒 is mark, ·𐑥𐑸𐑒 is Mark) is a nuisance for automated transliteration, though I understand that this was less of a big deal in 1960. It doesn't add much that's useful.  The caselessness of Shavian is a strength, and this seems to be a concession to case.
  8. The Alphabet Trust marketed it in the wrong way (though this wasn't really their fault, since the money was taken away). What they should have done, even before printing Androcles, is sponsored classes across the country in institutes of further education. (They were legally obliged to print Androcles under the terms of the will, and they did a good job with it. It was the right decision to print it rather than produce a facsimile of calligraphy.)
  9. They should also have produced a standard lexicon so that people could look up the Shavian transliteration of any common word in the Latin alphabet. The lack of such a lexicon made adoption much harder.
  10. Shaw wanted the script to represent English as spoken in the North, yet Androcles standardised on RP spelling throughout.
  11. Also, whoever transliterated Androcles was not as enlightened as the alphabet's designer. In particular they represent syllabic consonants with a leading schwa: "battle" is transliterated 𐑚𐑨𐑑𐑩𐑤 and not 𐑚𐑩𐑑𐑤 as you might reasonably expect.
  12. The designer of Shavian, Kingsley Read, conducted a large number of trials after Shavian was released, and produced a new script called Quikscript (also known as "Second Shaw"). It was based on Shavian, but with fixes for the problems identified by the trials.  Such a large-scale trial should really have been done before Shavian was ever launched.
Yet I'm not calling for Shavian to be abandoned (more than it already is) and a new alphabet to be started like Kingsley Read's or others.  There's little enough life in the trunk, and branches would wither immediately.  Whatever problems Shavian may have, the conventional spelling is a thousand times worse. And once you have a well-known and fairly standard form like Shavian, which anyone can read about in the history of spelling reform and which is in both ISO 15924 and Unicode, I like to stick to it unless there's a really compelling reason not to.  As a parallel, Esperanto may have been a failure as a constructed world language, but it still has around a million speakers.  How many speakers can its various reforms boast?

a few days

May. 12th, 2009 11:42 pm
marnanel: (Default)
On Saturday we went to help a friend of ours move house; then we went and ate at a diner called Tom Jones, which was rather good really. On Sunday we went and played D&D again at Bae's house; my elven cleric used up several saving throws against dying in battle. And today I made dinner: it was spaghetti.

The Mutter maintainers have decided that Mutter will henceforth be a proper fork of Metacity and that the projects will go their own ways. This means, of course, that Metacity will not ship as standard in GNOME 3. I am wondering what should happen to Metacity now; I have a couple of branches to merge, and then I think I would really rather work on Mutter than carry on with a project that practically nobody will use. It would be good to work with a team of others again, too: I've been mostly alone on Metacity for a while now.

I have modified the Shavian wiki so that the metadata is held on article pages instead of talk pages. It looks like this. I have been discussing some ideas about this wiki with some people, and I am wondering whether it would be generally more useful if the data was held in IPA format and the Shavian text was produced using a transformation on that data, just as Unifon and so on are now. I am also wondering whether allowing anonymous editing would increase participation enough to be worth the risk of vandalism.
marnanel: (Default)
It's been raining for days. Rio (whose website is now a little out of date) says we should put the rain into jars and call it "bottled annoyance".

Speaking of Rio, she's been learning the trumpet for a few months now. Tonight we went to a concert her school were putting on. There was a high school jazz band, too, and now she's decided she wants to be a jazz trumpeter. She's asking for trumpet jazz CDs, and Fin is asking whether you have any recommendations. All this makes me want to pick up the bass again. Perhaps I need to take lessons.

We had to take Rothko to the vet. He'll be fine. The other cats are missing him rather.

I didn't get much done this weekend; I've been feeling kind of out of sorts recently. I did manage to spend an hour or so on Sunday adding Digg support to Joule, and later I added support for Doug Ewell's spiky rune-like Ewellic alphabet to the Shavian wiki here. Which is your favourite of the scripts we have so far? (You'll need IE, Safari, or Firefox 3.5 to see them without downloading fonts.)
marnanel: (Default)

The lexicon of the Shavian wiki has passed 14,000 words, and the system is now smart enough to transliterate all of Alice's Adventures in Wonderland. (That might be worth printing, too.)

Some important policy questions we're currently turning over before we go much further include:

3. We currently have a rule that all spellings in Androcles are canonical, and set precedents. Should this rule be kept?

7, 8. Should syllabic -n or -l have a schwa ("ado") before them? For example, should "bottle" be "𐑚𐑪𐑑𐑤" or "𐑚𐑪𐑑𐑩𐑤"? Language Log has a good argument why it should be the former (not that they mention Shavian there).

10, 11. Should we retain apostrophes where they're used in the Latin alphabet? Androcles appears to use them for possession ("beast's") but not for elision ("don't"). Two editors have suggested removing them for possession too, which accords with GBS's own practices.

21. Should we retain the trap/bath split? Androcles has it, but GBS specifically asked for the play to use Northern English, which does not have it.

I've also considered using the existing system, now that we have a substantial lexicon of phonemic spellings, with other spelling reform systems; we'd have tables in the wiki to map each character to whatever other character it would be, and then you'd be able to choose whether to view each page in Shavian, Deseret, Unifon, any of the various Latin-alphabet respelling systems...

Profile

marnanel: (Default)
Monument

January 2022

S M T W T F S
      1
2345678
9101112131415
1617 1819202122
23242526272829
3031     

Syndicate

RSS Atom

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Jul. 7th, 2025 03:08 pm
Powered by Dreamwidth Studios