marnanel: (Default)
[personal profile] marnanel
I mentioned earlier about an idea I had for automatic part of speech disambiguation based only on the part of speech of the preceding word. I also mentioned that I believe this would be a workable solution for disambiguating the pronunciation of most homonyms.

I would therefore like to create a distributable database which mapped conventional spellings of English words either to (part of speech, phonemic representation) pairs, or (in the case of ambiguous spellings) to a mapping from sets of parts of speech to such pairs; the part of speech of the previous word would be used in choosing the new one.

Sources of data would be:
  • the Shavian wiki, where possible (licence is cc-by)
  • cmudict where the Shavian wiki wasn't possible (licence is BSD-like)
  • the Brown tagger for the parts of speech (licence is MIT)
So, two things I need to consider:
  1. what this database would be called
  2. how to evaluate it.
I think one way to evaluate it might be to take a corpus which is already POS-tagged, and evaluate it by:
  1. assuming all words are nouns
  2. assuming all words which the Shavian wiki believes are ambiguous are nouns, and using the Brown tagger for the rest
  3. using the POS-of-the-previous-word method outlined above
  4. using the Brown tagger
and checking that (3) is closer to (4) than (2).  Other ideas are welcome, of course.

Profile

marnanel: (Default)
Monument

January 2022

S M T W T F S
      1
2345678
9101112131415
1617 1819202122
23242526272829
3031     

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Feb. 12th, 2026 09:33 pm
Powered by Dreamwidth Studios