marnanel: (Default)
The main purpose of the Shavian wiki is to build a lexicon. But when I started the site, it also allowed you to upload documents in the Latin alphabet, and it would transliterate them on the fly. It got to be quite clever, allowing you to add unknown words to the lexicon inline, and disambiguate homonyms.

However, the transliteration was done with a MediaWiki extension (called George), and therefore was written in PHP. It needed to build a fairly large cache of transliterations in memory, and because PHP runs in-process in the webserver, this resulted in the process taking up far too much memory. So I turned the transliteration off.

The sources of these documents used wiki markup. About the same time, I started building a set of typesetting tools which used DocBook markup. This was much more flexible. But then there were two sets of incompatible source documents, and the tools which were available to the wiki documents were not available to the DocBook documents.

It occurred to me a while ago that an equivalent CGI script would be just as good as the MediaWiki extension, and would return the memory once it was done. It occurred to me today that I should take this opportunity to stop using wiki markup and use DocBook for everything. I could easily write a translator for the documents which already exist. And writing the word-adding and disambiguating tools as CGIs would mean a lot more flexibility.

I really like this idea. (I won't be doing it just yet, because I'm busy, but it's certainly something worth thinking about.)
marnanel: (Default)
(You may know this already, but if not I thought you might be interested.)

With reference to your column at http://www.drdobbs.com/blog/archives/2010/06/unicode_and_the.html : the reason the translator at pinyin.info choked on the Shavian characters you gave it is because all Shavian characters have codepoints above 0xFFFF, and therefore (if you're using UTF-16, which the pinyin.info translator appears to be) they won't fit in a single word and will have to be represented using surrogate pairs. Wikipedia has a reasonable coverage of surrogate pairs: http://en.wikipedia.org/wiki/Surrogate_pair , but briefly, it's a way to represent a Unicode character whose codepoint is too high by using a pair of otherwise illegal characters, both of whose codepoints are low enough. Hence the effect you noted of having "the wrong codes, and twice too many of them".

The fault is presumably with the pinyin.info translator, which shouldn't give out surrogate pairs unless explicitly asked, but it does go to show that, as Wikipedia puts it, "code is often not tested thoroughly with surrogate pairs. This leads to persistent bugs, and potential security holes, even in popular and well-reviewed application software", or as you put it, "computing still is not mature".

Thomas (author of the transliterator script on shavian.org)

nice

Jun. 4th, 2010 10:40 pm
marnanel: (Default)
The Christian Science Monitor just linked to shavian.org rather prominently in article text.  (For those outside the US, despite its name, the CS Monitor is a largely secular national paper which mostly publishes online these days, although it still puts out a weekly print edition.)
marnanel: (Default)


Of course now it's a trivial matter to make Firefox in other phonemic scripts; I wonder if Boing Boing would be interested in Firefox in Tengwar...
marnanel: (Default)
Today it occurred to me that there is also a fifth option to my pondering of the other day: keep it on the command line, and write it in Perl.  I already have a Shavian transliterator in CPAN which could do with an overhaul.  Most of the file formats it would need to read also have code to read them (although .po support is kind of iffy).  And command-line tools built in Perl can be easily distributed over CPAN, as ack has shown.

The problem of updating the wiki is then easily fixed because we can use CPAN's support for the MediaWiki API, and upload a list of the missing words to a user's userspace on the wiki.

One thing I'd like to get right in my head before I start is where the database would go.  Currently, the CPAN Shavian transliterator keeps it with other Perl data in /usr.  But it would be really useful to be able to check for updates from shavian.org.uk and download them, without constant updates of the CPAN package.  So maybe the data should go in ~/.cache/shavian ?

Someone is apparently already using the name "Sparkle" for a free software project, so maybe I'll rename this to "makeshaw" or something.

marnanel: (Default)
(written while waiting for a compile; this is me thinking aloud...)

Earlier, I mentioned my existing tool to transliterate into the Shavian alphabet, and my thoughts of doing it as a web tool rather than tidying up and releasing the existing command-line stuff.  There are four ways I can see of doing this:
  1. Make it almost entirely separate from the wiki.  You would upload your translation catalogue, and the page would let you specify how you wanted it transliterated, and then download the result.  The only connection with the wiki would be the ability to specify transliterations of each word which would then be rolled back into the wiki.
  2. Allow people to upload catalogues to the wiki, but then convert it to wiki markup.  In this case, we would have a system which could turn various formats of translation catalogue into standard MediaWiki markup, and another system which would turn them back; then there could be a bot which updated the MediaWiki text.  This would mean that hand-editing would be fairly easy, but brings in the great nuisance of writing the conversion systems.
  3. Allow people to upload catalogues to the wiki, and treat them as files.  MediaWiki allows file storage, which is usually used for images.  It's turned off on the Shavian wiki, but it could be turned on for translation catalogues.  Then all the faff of making the system upload and download things would be done for us.
  4. Allow people to upload catalogues to the wiki, and treat them as text.  After all, they are text.  This idea means that there would be a separate namespace for translation catalogues within wiki pages, such as Translation:Metacity.po.  The system would probably render all these as a message saying "Please view source", at least until I got around to a system to render the various catalogue formats to HTML.  Then a bot controlled from a separate page could easily update them to change the Shavian transliteration, and the files could be hand-edited without too much difficulty.
I think I like the fourth option best.

In other news, part of this journal (the part tagged "nimyad") is now on the conlang aggregator; I am pondering the idea of a "Planet Shavian".
marnanel: (Default)
I have a number of versions of the Shavian transliterator around (a program that can be used to make Shavian-alphabet versions of programs). The most general of them is a tool called Sparkle. It does .po files, and in theory files in Mozilla's formats and .srt subtitles files.

I was thinking of packaging up Sparkle in case anyone else wants to use it, since it's pretty useful to me.

However, there's the question of what happens to words found during translation which aren't in the dictionary. At the moment, it spits out an HTML page with links to the dictionary pages so you can fill them in. That requires you to load the page in your browser and follow each link, which isn't very helpful. I was also thinking of making it able to output a simple XML file, where you could fill in transliterations; then it would have a switch to read the file back and update the wiki using the MediaWiki API. That's kind of clunky, though.

There is another solution: I could just add an HTML front end and make Sparkle into a web tool instead. You'd be able to upload translatable files, press a few buttons, and download them again in Shavian (or another altscript, at your choice); you'd also be given a list of words to fill in which weren't found in the dictionary, and they would be added to the wiki automatically.

I'm not sure who else would use such a tool, though, in either of its possible incarnations. Let me know if it would be interesting to you.
marnanel: (Default)
I made this earlier this week; it was pretty simple. I thought I'd post an image in case anyone would like to see what it looks like.



I doubt anyone would want the whole thing, but if you do, it exists.
marnanel: (Default)
Then again, it might show people how to do this sort of thing.


marnanel: (Default)
I mentioned the other day that I'd been playing around with subtitling Sita into Shavian. I said I wasn't going to work on it for a while, but in the back seat on the long drive back from seeing Amy and John at the New Year, I was too sleepy to do anything that wasn't very repetitive, so I finished off the translation.

I believe this is (unsurprisingly) the only film ever to have been given a full set of Shavian subtitles.

Here's the .srt file. It's a static copy of generated content; the master copy is this wiki page, which also contains drakar2007's original Latin-alphabet subtitles.

However, the .srt file isn't very useful to most people, and besides most players won't play it because of flaky Unicode support (Totem being an honourable exception). You can play it with mplayer, but it involves creating a special font and running the .srt through a shifting filter.

I've run off a copy of the entire film in this way, and it's 1.3GB. I am wondering what to do with it now so that I can show it to people:
  • put it on YouTube. More people look at this than anything else, but this would involve breaking it up into twelve ten-minute chunks.  Workable, but I'd rather be able to link to one particular page from shavian.org.uk than twelve.
  • put it on Vimeo. This would involve breaking it up into 500MB chunks (or buying a pro account, which I'm not going to). It's possible that I could get it down to two chunks by shrinking the images a little (they're currently at quite a high resolution).
  • put it on the moving images section of archive.org. This would allow it to be complete, but uploads don't seem to work for me: I get the whole file uploaded and then my session has timed out. Maybe it's worth the trouble of asking in the forums, though.
  • all or many of the above.
  • just leaving it as an .srt and expecting people to work it out for themselves.
What do you think?

Image from Sita Sings the Blues copyright © Nina Paley, licensed under the Creative Commons Attribution Share-Alike licence.
marnanel: (Default)
Some people have shown interest in the Shavian-subtitled Sita.

I'm not going to do any work on it in the next few weeks, because I'm quite busy at the moment, but I foresee a fairly major stumbling-block being the transcription of names, and in particular how to represent the vowels.  Obviously there won't be a one-to-one mapping with English-language vowels anyway, but I want to do the best I can.

I am thinking that
  • Sita = ·𐑕𐑰𐑑𐑨, that is, the vowels in tree and cat, not the vowel in ago
  • Rama = ·𐑮𐑭𐑥𐑨, that is, the vowels in mast (in the RP pronunciation) and cat
  • Hanuman = ·𐑣𐑨𐑯𐑵𐑥𐑨𐑯, that is, the vowels in hand, soup, and man.  Or is the final vowel more like the a in mast?
Wikipedia isn't helping much, since it doesn't give IPA.
marnanel: (Default)
Someone asked about handwritten Shavian.  There are a number of letters which can possibly be confused.  Just wanted to jot this down and come back to it later.
  • The worst group: 𐑲, 𐑳, 𐑶 can all turn into one another if written fast.
  • Also some pairs: 𐑯/𐑷, 𐑥/𐑭, 𐑱/𐑬.
  • 𐑯𐑥 can be written as straight diagonal lines and 𐑷𐑭 as zigzags.
  • We could perhaps suggest that 𐑶𐑬 be written crossed to distinguish them from 𐑱𐑲𐑳.
  • That takes care of everything except the confusion of 𐑲 with 𐑳.  I'm not sure what to suggest for that.  Any thoughts?
  • There are many more possible confusions if you don't take care to distinguish tall/deep and short letters: 𐑓/𐑨 𐑝/𐑩 etc.
marnanel: (Default)
I haven't posted for a while. Here are four things:

  1. Collabora have been supporting the CSS-on-window-borders project recently by letting me work on it during work hours. Here is a status update.

  2. Recent updates to shavian.org.uk include a gentle Shavian tutorial and translations of all the recent XKCDs into Shavian.

  3. Many years ago, I wrote a sonnet for use on a server's custom 404 page:
    So many years have passed since first you sought
    the lands beyond the edges of the sky,
    so many moons reflected in your eye,
    (familiar newness, fear of leaving port),
    since first you sought, and failed, and learned to fall,
    (first hope, then cynicism, silent dread,
    the countless stars, still counting overhead
    the seconds to your final voyage of all…)
    and last, in glory gold and red around
    your greatest search, your final quest to know!
    yet… ashes drift, the embers cease to glow,
    and darkened life in frozen death is drowned;
    and ashes on the swell are seen no more.
    The silence surges. Error 404.

    It's been spreading itself around, mostly without my permission, so I'm releasing it under a Creative Commons licence. Fly free, little sonnet! Please feel free to copy it onto your own sites, and if you would, let me know you've done so.

  4. I need to write more of the Maemo tutorials. They will be coming soon. Sorry; things have been busy.

marnanel: (Default)
I have redesigned the front page a little.

I have released the lessons and the software which creates their web pages under a Creative Commons licence.  You can get your own copy using:

git clone http://shavian.org.uk/learn/.git

Also, I want to bring the Shavian Wiki back.  This needs some thought.  The original purpose of the wiki was threefold:
  1. For hosting useful pages about Shavian.  But shavian.org.uk can do that now.
  2. For automatic translation of documents.  But that's computationally expensive, and we now have two other ways of doing this: in one SQL join for long documents (say, if we wanted to put up content from Wikibooks), and generated static content (how most of the current pages are made).
  3. For hosting a mapping of conventional to Shavian spellings, since there are no adequate ways of doing this using existing databases.
Number three is the only function which needs to continue.  However, it's not necessarily necessary that it should happen using a wiki, and certainly not necessary that it should be MediaWiki, which is large, complicated, and written in PHP.  I could
  • write my own front end to let people edit the database,
  • use a different and perhaps simpler wiki system
  • stick with the familiar and use MediaWiki.
There's also no particular reason, if we do use MediaWiki, that it has to run on the same server as the static-HTML-and-Python-CGI remainder of the site.

Update: I did indeed bring the old version back.  The extensions are turned off, so it doesn't do transliterating, but it's enough to let people continue updating the lexicon.  It also needs re-skinning rather a lot.
marnanel: (Default)
The first CC picture Flickr gave me for "cook" was the King of Jordan making dinner.

You can see all the "learn" pictures at once: quite a feast of randomness. What I would like to do is make an automated quiz at the end of each lesson, a bit like the one Rosetta Stone has, where you get given a word you've learned and have to click on the one out of four pictures which represents it.

There is a partial samizdat of Shaw Script.

I also made a game of hangman last night, but I think the player needs more chances than normal because there are more letters in the alphabet. Also, a letter frequency chart would be really useful.
marnanel: (Default)
Some people have suggested I produce a Shavian edition of Borrowable. This is an interesting idea, because it would be easy to do, and all the Shavian addicts in the world would want a copy. It would also make rather an interesting book for kids (and adults) interested in codes and ciphers.

However, in order to do this, there would have to be an introduction to the alphabet aimed approximately at middle-grade readers. So I wrote one. I'd like to know what you think, and what you think should be changed.

There are still a lot of spaces where I'm going to put pictures, but all the text is there.
marnanel: (Default)
  • IT IS RIORDON'S BIRTHDAY.  Happy birthday to the kid who is officially the most awesome kid in the world.
  • The nicest Joule comment ever.
  • Two reviewers in Canada want physical copies of Borrowable: one says they will probably review it and one might review it. One other reviewer doesn't want it because it's self-published. I will therefore be ordering more author copies when my paycheque comes through.
  • The Launchpad people will allow Shavian translations only if we first fix the bugs in Launchpad which are holding it up. Arc does not seem to think this will be a major difficulty.
  • Cambridge University Library "would be delighted" to add Borrowable to their collections.
  • I have just received a (free) review copy of Writing Children's Books For Dummies.
  • I have finished A Tale of Two Guinea-Pigs and thoroughly enjoyed it. A review follows, later today.
  • Did I mention there was a quiz on the Borrowable site now? And the start of a recipe collection?
marnanel: (Default)
A demo of:
  • Transliteration done with an SQL join (not that you can see that)
  • My POS-based automated disambiguation system (can you find the ambiguous word which was correctly resolved?)
  • Fin's interlinear highlighting system applied to Shavian text
marnanel: (Default)
  1. I wonder whether it should still be called "the Shavian wiki" when it can be used to show lots of other alphabets too.  Maybe "the phonemic wiki" or "the spelling reform wiki" or something.  On the other hand, it's still primarily in and about Shavian.
  2. Maybe it could do with a visual redesign; I've just been using the Cologne Blue theme from the standard MediaWiki distribution.
  3. Maybe it should move from shavian.marnanel.org and have its own .org.
Your thoughts are, as ever, appreciated.
marnanel: (Default)
Three things:
  1. Who are the audience?  There's the few but determined people who go to the Shavian wiki because they want to build a lexicon.  I know a lot about that audience.  But there are also people who want to read texts in Shavian or one of the other alphabets, or who want to transliterate text into Shavian or one of the other alphabets (even if only for a joke), or to learn about the alphabets.  I'm not sure who or how many these people are, or how best to cater for them, or how best to attract the people who would like to know about the site but don't.  For example, would people be better served having Shavian in one column and conventional spelling in the other, or would they prefer all Shavian all the way through?  Would people like to be able to read Wikipedia in Shavian?  It could easily be done once the transliteration script had been made separate from the wiki.
  2. How do we store disambiguation information?  At the moment we do this inline in the texts.  I would like to find a way of separating the disambiguation data from the text itself, so that we could keep pristine copies of (say) the contents of en.wikisource and just add disambiguation notes to override automatic disambiguation (e.g. "read 'number' as 𐑯𐑳𐑥𐑼, not 𐑯𐑳𐑥𐑚𐑼") in a separate place.  (I'm talking about the back end here, not the user interface.)  We could of course use character or lexeme offsets, but then we should consider what would happen if the text got updated.  One possible option, which I tried with the "existing" system, is to store the position as the nth occurrence of a particular word.
  3. How can we be most efficient?  Caching makes everything faster, obviously, but also gives us a huge memory footprint.  It would be possible to download the lexicon to a local copy every so often to make things faster.  I'm also toying with the idea of storing each document as a series of word records, rather than as a single record, and doing the lookup on the database side using a left outer join.
Your thoughts are, as ever, welcomed.

Profile

marnanel: (Default)
Monument

January 2022

S M T W T F S
      1
2345678
9101112131415
1617 1819202122
23242526272829
3031     

Syndicate

RSS Atom

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Jul. 5th, 2025 01:16 am
Powered by Dreamwidth Studios