Very simple automated disambiguation
Aug. 17th, 2009 05:55 pm![[personal profile]](https://www.dreamwidth.org/img/silk/identity/user.png)
One of the difficulties inherent in automated transliteration is that of homonyms: words which are pronounced differently but spelt the same in the Latin alphabet.
The other day I was on a plane and got to thinking. In most of these cases, the two words have different parts of speech: for example, does (he does) is a verb, does (more than one doe) is a noun. What if we could do part-of-speech tagging? But that's rather a complicated field. How simple can we make a part-of-speech tagger?
So I took the lexicon from the Shavian wiki and added the default part-of-speech tags from the Brown tagger to each word. Then rather than marking the homonyms with the part of speech they represented, I marked them with the part of speech which should precede them. That is, rather than having "does" choose between "N" and "V", it chooses between "D/J/I" (for N) and "N/P/V" (for V).
This works surprisingly well-- it even handles the "read" case as well as can be expected-- and I think it's simple and effective enough to use whenever I get around to updating the Firefox extension, and possibly elsewhere. It handles all the most common cases; other than occasional misclassifications, the only major failing I can find is that it cannot distinguish nouns and adjectives. There are only a few cases where an adjective and a noun are known to the system to have a different pronunciation:
- I live near a live wire.
- I like to read. I read a book yesterday. I will read one tomorrow.
- I advocate happiness to the advocate.
- He does love to play with does.
- Please lead me to the box of lead.
- It used to be used for that. Now it is used for this.
- I never knew a number number.
The other day I was on a plane and got to thinking. In most of these cases, the two words have different parts of speech: for example, does (he does) is a verb, does (more than one doe) is a noun. What if we could do part-of-speech tagging? But that's rather a complicated field. How simple can we make a part-of-speech tagger?
So I took the lexicon from the Shavian wiki and added the default part-of-speech tags from the Brown tagger to each word. Then rather than marking the homonyms with the part of speech they represented, I marked them with the part of speech which should precede them. That is, rather than having "does" choose between "N" and "V", it chooses between "D/J/I" (for N) and "N/P/V" (for V).
This works surprisingly well-- it even handles the "read" case as well as can be expected-- and I think it's simple and effective enough to use whenever I get around to updating the Firefox extension, and possibly elsewhere. It handles all the most common cases; other than occasional misclassifications, the only major failing I can find is that it cannot distinguish nouns and adjectives. There are only a few cases where an adjective and a noun are known to the system to have a different pronunciation:
- agape
- arithmetic
- content
- invalid
- minute
- number
- pasty