Sep. 22nd, 2009

marnanel: (Default)
I'm typesetting Alice in an interlinear Shavian edition. Here's the first chapter; there's more to come, obviously. Your feedback is greatly valued.

The only problem I can foresee is this exchange:
“I’m a poor man, your Majesty,” the Hatter began, in a trembling voice, “--and I hadn’t begun my tea—not above a week or so—and what with the bread-and-butter getting so thin—and the twinkling of the tea——”

“The twinkling of what?” said the King.

“It began with the tea,” the Hatter replied.

“Of course twinkling begins with a T!” said the King sharply. “Do you take me for a dunce? Go on!”
The word twinkling (𐑑𐑢𐑦𐑙𐑒𐑤𐑦𐑙) begins with the Shavian letter 𐑑, which is usually known as "tot". The names given in Unicode and used canonically these days were originally only examples (thanks, [personal profile] pne) given in the card which came with Androcles. The joke clearly doesn't work if the Hatter says, "the twinkling of the tot".

However, in Shaw Script, that letter was known as "tea". I may add a footnote to this effect. If you have other solutions, I'd like to hear them.

Update: [personal profile] pne suggests that Shavian and italics don't mix well, and suggests some alternatives.
marnanel: (Default)
Introduction. I've had a few people ask me what happened to *.marnanel.org, which has been down for several days now. The short answer is that there were software memory problems in each case; the long answer differs for each application.

joule.marnanel.org: I'm putting this first because I expect most people reading this will want to know about Joule.

One of the main parts of Joule is the comparator, which compares the old state of your friends list with the new. The old comparator, "currant", had worked fine for around a year, but could only compare about 1000 records a second. Several months ago I introduced Twitter support to Joule, and because I wondered whether people with millions of followers might like to use it, I rewrote the comparator entirely, producing "raisin". This was a bad move for two reasons:

Firstly, although it worked, if you have millions of followers you also have hundreds of friendings and unfriendings every day, more than anyone would want to wade through. So Joule is rather useless for such people.

Secondly, the new comparator compared everything in memory rather than in the database. Although it worked with large datasets in testing, when it was put into production there were some scalability issues. Eventually it allocated so much memory that it crashed the server Joule was running on, which isn't acceptable because it's used for many other things and by many other people.

I thought I fixed both of these problems by adding a check that a user didn't have more than a few thousand followers. However, it seemed that this wasn't enough, and the server eventually crashed again. The server admins asked me, quite reasonably, not to run Joule in its present state on that server.

Ways forward from here.

I don't know. Out of all the sites, this is the toughest one to bring back. Unfortunately, it's also by far the most used.

I could revert back to "currant", although that would be fiddly because the database structure is rather different and I'd need to write something to convert back to the old format.

I suppose I could run Joule in its present state with "raisin", on a dedicated server.

I could convert Joule to CGI so that it was easier to profile.

The Yarrow sites: marnanel.org and rgtp.thurman.org.uk. These are less of a problem because they're CGI, not in-process. The problem with them is that they occasionally load the entire index into memory while creating a cache of it; this is easily fixed.

The Shavian wiki. This was also a memory hog for a quite different reason: it cached all the transliterations while rendering. I could easily turn this off, but it would become very slow. Ways forward: I would actually like to do the rendering in a separate script, so I wasn't writing it in MediaWiki and therefore PHP any more; I'd like it if it could render text taken from other sources than its own wiki, such as simple.wikipedia.org and en.wikisource.org; I'd like it if it could take pronunciations from another, less perfect source such as CMUdict where they weren't already supplied, except in wiki-building mode, and if it could make some attempt at automatic disambiguation. This is a medium-sized rewrite, though, but this may be a good excuse for it.
marnanel: (Default)
Three things:
  1. Who are the audience?  There's the few but determined people who go to the Shavian wiki because they want to build a lexicon.  I know a lot about that audience.  But there are also people who want to read texts in Shavian or one of the other alphabets, or who want to transliterate text into Shavian or one of the other alphabets (even if only for a joke), or to learn about the alphabets.  I'm not sure who or how many these people are, or how best to cater for them, or how best to attract the people who would like to know about the site but don't.  For example, would people be better served having Shavian in one column and conventional spelling in the other, or would they prefer all Shavian all the way through?  Would people like to be able to read Wikipedia in Shavian?  It could easily be done once the transliteration script had been made separate from the wiki.
  2. How do we store disambiguation information?  At the moment we do this inline in the texts.  I would like to find a way of separating the disambiguation data from the text itself, so that we could keep pristine copies of (say) the contents of en.wikisource and just add disambiguation notes to override automatic disambiguation (e.g. "read 'number' as 𐑯𐑳𐑥𐑼, not 𐑯𐑳𐑥𐑚𐑼") in a separate place.  (I'm talking about the back end here, not the user interface.)  We could of course use character or lexeme offsets, but then we should consider what would happen if the text got updated.  One possible option, which I tried with the "existing" system, is to store the position as the nth occurrence of a particular word.
  3. How can we be most efficient?  Caching makes everything faster, obviously, but also gives us a huge memory footprint.  It would be possible to download the lexicon to a local copy every so often to make things faster.  I'm also toying with the idea of storing each document as a series of word records, rather than as a single record, and doing the lookup on the database side using a left outer join.
Your thoughts are, as ever, welcomed.

Profile

marnanel: (Default)
Monument

January 2022

S M T W T F S
      1
2345678
9101112131415
1617 1819202122
23242526272829
3031     

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Jun. 12th, 2025 11:51 pm
Powered by Dreamwidth Studios