improbable research
Mar. 29th, 2009 06:23 pmBoth the existing freely available electronic pronouncing dictionaries (CMUDICT and MOBYPRON) are based on American English. Because of this, both dictionaries exhibit splits and mergers common to that dialect, including particularly the father-bother merger. If work is to be carried out which requires the ability to distinguish these phonemes, there is no freely available pronouncing dictionary which will meet the requirement. Creation of a pronouncing dictionary of any size from scratch, however, is a prohibitively difficult and time-consuming process.
There is a small online community of enthusiasts of the Shavian alphabet, an alphabet designed in the 1950s to represent British English phonology more exactly than the Latin alphabet can. This community frequently produces versions of public domain texts such as The Wizard of Oz transliterated into the Shavian alphabet. In this discussion I present a method of aligning these texts with their Latin-alphabet equivalents (henceforth, "the Shavian version" and "the Latin version"), and producing a pronouncing lexicon from the alignment.
However, there are features which CMUDICT and MOBYPRON mark which the Shavian alphabet can never supply. In particular, there is no marking for stress in the Shavian alphabet.
The texts used for this experiment were A Christmas Carol by Charles Dickens, and the United States constitution. Permission was sought and received from the transliterator to perform this experiment on the transliterated text. The Latin and Shavian versions were placed into one XML file per text. A Perl script was then used to retrieve the two versions word by word and output the results together.
A problem inherent in any translation or transliteration effort is copyist error. The transliteration of the United States constitution was remarkably free from transliteration slips, but errors had crept into the transliteration of A Christmas Carol. For example, "over and over and over" became "over and over", and an entire clause about Spanish friars had been dropped.
In order to find these errors, a system was developed and added to the Perl script whereby a word in the Latin alphabet was reduced to its consonants and a word in the Shavian alphabet was reduced to the equivalent consonants in the Latin alphabet; vowels were marked as nulls and not compared, since the correspondence of vowels in the alphabets is far from simple. The two reduced versions were then compared, with some special case rules to allow "S" in the Latin alphabet to match "S" or "Z", for use with plurals, and to allow "C" in the Latin alphabet to match "K" or "S". If the percentage of correspondence dropped below 50 for six consecutive words, the versions were judged to have become misaligned and the entire process was halted.
The letter "H" was also considered a null, because of its common and misleading use in digraphs. However, the first version of this system was shown to be deficient when a character's exclamation of "Ha, ha! Ha, ha, ha, ha!" stopped the process by causing six consecutive matches of empty strings with empty strings, giving 0%. This problem was solved by allowing an empty string to match an empty string without caluclating percentages.
When given A Christmas Carol only, the system produced 4,334 unique lexemes. Adding in the United States Constitution brought the total up to 4,818. The existing word list from Androcles and the Lion, the only book ever printed in the Shavian alphabet, which contained 2,454 unique lexemes, was then combined, giving 6,012 unique lexemes.
This is a disappointingly low result, compared to the 133,827 words in CMUDICT and the 177,267 in MOBYPRON. However, the corpus used is very small. It seems reasonable to assume that adding new Shavian works will increase the number of lexemes further. (Because of the Zipfian distribution of words in the English language, the new words will mostly be less and less common as new works are added; this means that even the existing results are likely to match many of the words of any given English document.)
(what would be useful here is taking a random English document and saying how many of the words matched)