marnanel: (Default)
[personal profile] marnanel
One of the difficulties inherent in automated transliteration is that of homonyms: words which are pronounced differently but spelt the same in the Latin alphabet.
  • I live near a live wire.
  • I like to read. I read a book yesterday. I will read one tomorrow.
  • I advocate happiness to the advocate.
  • He does love to play with does.
  • Please lead me to the box of lead.
  • It used to be used for that. Now it is used for this.
  • I never knew a number number.
On the Shavian wiki we've solved this problem manually, but it's a bit of a pain. With things like the Shavian Firefox extension, it's just been necessary so far to pick one randomly.

The other day I was on a plane and got to thinking. In most of these cases, the two words have different parts of speech: for example, does (he does) is a verb, does (more than one doe) is a noun. What if we could do part-of-speech tagging? But that's rather a complicated field. How simple can we make a part-of-speech tagger?

So I took the lexicon from the Shavian wiki and added the default part-of-speech tags from the Brown tagger to each word. Then rather than marking the homonyms with the part of speech they represented, I marked them with the part of speech which should precede them. That is, rather than having "does" choose between "N" and "V", it chooses between "D/J/I" (for N) and "N/P/V" (for V).

This works surprisingly well-- it even handles the "read" case as well as can be expected-- and I think it's simple and effective enough to use whenever I get around to updating the Firefox extension, and possibly elsewhere.  It handles all the most common cases; other than occasional misclassifications, the only major failing I can find is that it cannot distinguish nouns and adjectives.  There are only a few cases where an adjective and a noun are known to the system to have a different pronunciation:
  • agape
  • arithmetic
  • content
  • invalid
  • minute
  • number
  • pasty
I think in all these cases there's one option which is far more common than the other, and so we can get away with choosing that one always.  ("Content" is probably the least unbalanced; I think the noun form still has the edge.)

Profile

marnanel: (Default)
Monument

January 2022

S M T W T F S
      1
2345678
9101112131415
1617 1819202122
23242526272829
3031     

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Jun. 14th, 2025 05:44 am
Powered by Dreamwidth Studios