marnanel: (Default)
Monument ([personal profile] marnanel) wrote2009-09-12 02:22 pm
Entry tags:

Shavian and disambiguation

I mentioned earlier about an idea I had for automatic part of speech disambiguation based only on the part of speech of the preceding word. I also mentioned that I believe this would be a workable solution for disambiguating the pronunciation of most homonyms.

I would therefore like to create a distributable database which mapped conventional spellings of English words either to (part of speech, phonemic representation) pairs, or (in the case of ambiguous spellings) to a mapping from sets of parts of speech to such pairs; the part of speech of the previous word would be used in choosing the new one.

Sources of data would be:
  • the Shavian wiki, where possible (licence is cc-by)
  • cmudict where the Shavian wiki wasn't possible (licence is BSD-like)
  • the Brown tagger for the parts of speech (licence is MIT)
So, two things I need to consider:
  1. what this database would be called
  2. how to evaluate it.
I think one way to evaluate it might be to take a corpus which is already POS-tagged, and evaluate it by:
  1. assuming all words are nouns
  2. assuming all words which the Shavian wiki believes are ambiguous are nouns, and using the Brown tagger for the rest
  3. using the POS-of-the-previous-word method outlined above
  4. using the Brown tagger
and checking that (3) is closer to (4) than (2).  Other ideas are welcome, of course.