Sep. 12th, 2009

marnanel: (Default)
I mentioned earlier about an idea I had for automatic part of speech disambiguation based only on the part of speech of the preceding word. I also mentioned that I believe this would be a workable solution for disambiguating the pronunciation of most homonyms.

I would therefore like to create a distributable database which mapped conventional spellings of English words either to (part of speech, phonemic representation) pairs, or (in the case of ambiguous spellings) to a mapping from sets of parts of speech to such pairs; the part of speech of the previous word would be used in choosing the new one.

Sources of data would be:
  • the Shavian wiki, where possible (licence is cc-by)
  • cmudict where the Shavian wiki wasn't possible (licence is BSD-like)
  • the Brown tagger for the parts of speech (licence is MIT)
So, two things I need to consider:
  1. what this database would be called
  2. how to evaluate it.
I think one way to evaluate it might be to take a corpus which is already POS-tagged, and evaluate it by:
  1. assuming all words are nouns
  2. assuming all words which the Shavian wiki believes are ambiguous are nouns, and using the Brown tagger for the rest
  3. using the POS-of-the-previous-word method outlined above
  4. using the Brown tagger
and checking that (3) is closer to (4) than (2).  Other ideas are welcome, of course.
marnanel: (Default)
Everything under *.marnanel.org is currently down.  I have no ETA for its return at present.

Two people have recently asked for a list of the programs I've worked on. This will be a long list. It's not in any particular order.  Links to *.marnanel.org sites would be fairly useless at present.
  1. I maintain Metacity, the GNOME window manager.
  2. I maintain matchbox2, the Maemo window manager, as part of my work for Collabora.
  3. Theoretically I maintain fast-user-switch-applet, but in fact I don't, because anyone sensible has started using the version that's maintained by the GDM maintainers, and everyone else is doing fine looking after it on their own.
  4. I maintain Joule, which tracks changes in friendslists on LiveJournal, Twitter, and a half-dozen other sites.
  5. Joule also exists as a Mozilla addon.
  6. I maintain the Welsh spellcheck dictionary for Mozilla; an update to FF3.5 is desperately needed and is coming this weekend.
  7. I wrote a program to transliterate all webpages in Mozilla into the Shavian alphabet. It needs some work to take data from the Shavian wiki and to do basic automated disambiguation.
  8. Yarrow is a web client for the Cambridge RGTP protocol. It is also the blogging engine which powers marnanel.org.  It is fairly mature and reliable.
  9. Spurge is a free server for the same RGTP protocol, because the original rgtpd was not available.  It implements only a subset of the protocol, but it's enough for everyday use.
  10. Archangel was an experimental Mozilla RGTP client, so you could just go to URLS like "rgtp://...". It has rotted.
  11. Gnusto is a pure JavaScript z-machine compiler. It has mostly rotted, but was reincarnated by someone else as Parchment.
  12. Raeddit is a reddit client for the N900 which I'm writing as a demo.
  13. I maintain the port of robotfindskitten to the N900.
  14. Belltower is a N900 app to find belltowers.  (Screencast here.)
  15. I also have a working N900 gopher client, but I haven't released it.  I imagine it might not be the best way to advertise the platform. :)
  16. And I'm contributing some code to a rememberthemilk client for the N900.
  17. Gehazi is a rather nice photo gallery app which one day may be good enough to use; it exists in several versions, none of which are completed.
  18. Plough is a simple system to map arbitrary SQL queries to Perl structures and to run Template Toolkit over them; it powers several of the sites on dorothy.  It hasn't been released, but it could be.
  19. The Shavian wiki is a system for automatic transliteration of the conventional alphabet to Shavian and several other phonemic alphabets.  It has allowed me to transliterate several books, which I may print one day.
  20. There will eventually also be a transliteration of Ubuntu into Shavian.  (This can't be done in Launchpad's Rosetta subsystem, for reasons I don't well understand.)
  21. I maintain several Perl modules in CPAN: Lingua::EN::Phoneme, Lingua::EN::Alphabet::Shaw, Lingua::EN::Alphabet::Deseret (whose purposes should be fairly clear), DateTime::Calendar::Liturgical::Christian (which finds which liturgical feast corresponds with a given date; I really want to port this to Maemo and include the relevant part of the Daily Office, which is public domain), Net::RGTP (whose use should again be fairly obvious), and Flickr::Embed (which embeds photos in blog posts, and is currently broken).
  22. blt is a Twitter client for the command line, written in Perl. It's working, but needs some further development.
  23. I used to maintain the Picons plugin to squirrelmail (which added logos to incoming mail representing the sender's domain), but I stopped using squirrelmail and the plugin rotted. I think this was the first piece of free software I produced, back in 2001.
  24. Avaricius was a graphical adventure game for DOS, produced in the late nineties.
  25. Avalot was another graphical adventure game for DOS.
  26. There were various other small games I wrote back then, including one about a wizard called Spellchick (I wasn't familiar at the time with the slang meaning of "chick" and Spellchick was a male wizard).
My blog is called "full of grandiose schemes" because I was given my medical notes when I emigrated, and the psychiatrist had written that about my explanations of my programming projects.

Profile

marnanel: (Default)
Monument

January 2022

S M T W T F S
      1
2345678
9101112131415
1617 1819202122
23242526272829
3031     

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Jul. 4th, 2025 12:38 pm
Powered by Dreamwidth Studios