nlsplit -- a small tool to split natural language text in natural chunks

Another one of those standard small tools which should exist but that I couldn't find... nlsplit is a tool to split natural language text in chunks at reasonable language boundaries. The program takes as argument a maximal size for chunks, reads stdin and produces chunks smaller than the maximal size on stdout.

If you want more info, see the README. You can also retrieve the C source directly.

Imitating GLaDOS with espeak

If you've played Portal, you must be acquainted with GLaDOS and her very odd way of speaking, which sounds like TTS but was actually recorded by an actress (Ellen McLain) and digitally edited.

It turns out that there is a common free TTS program, espeak, packaged for Ubuntu and Debian, which is not a very convincing imitation of a human, but which sounds a bit like GLaDOS:

echo "Hello, and, again, welcome to the Aperture Science " \
  "Computer-aided Enrichment Center" |
  espeak -ven+f3 -m --stdout -p 60 -s 180 |
  oggenc -q1 - > espeak1.ogg

In my opinion, though, the most striking thing about GLaDOS's speech is its seemingly random pitch. My point is that we can approximate this by adding some SSML markup.

echo "Hello, and, again, welcome to the Aperture Science " \
  "Computer-aided Enrichment Center" |
  for a in `cat`; do
    V=$(((($RANDOM) % 100) - 50))
    echo -n "<prosody pitch=\"+$V\">$a</prosody> " |
      sed 's/+-/-/' 
  done |
  espeak -ven+f3 -m --stdout -p 60 -s 180 |
  oggenc -q1 - > espeak2.ogg

Approximating Still alive using this technique is left as an exercise to the reader. ;-)

Russian roulette: a mathematical analysis

A pompous title, because the underlying math isn't that complicated, but still, here we go.

The game of Russian roulette is played as follows:

We have a revolver with a p-round cylinder loaded with only one round. The cylinder is spun so that the round lies in a position which is assumed to be random. We have n players numbered 0, 1, 2, ..., n-1 standing in a circle. Each player, starting at player 0, will try to shoot himself in the head. If the player dies, the game stops. If the player misses, they pass the revolver to the next player (player n-1 passes to player 0).

There are two main variants: either players don't spin the cylinder when they give it to the next player, or they spin it (and we assume that it puts the bullet in a new random position chosen uniformly at random and independently from other draws).

Players spin

In this version, the cylinder is spun before passing the revolver. This is equivalent to each player rolling a q-sided die, or drawing from an urn with replacement. In this section, we will write f = 1/p for brevity.

Here is my take at solving this one. Let us denote by q_i the probability that player i is the one to die. Notice that this game is memoryless in the sense that if a certain number k of rounds have been played without anyone dying, then the situation is exactly the same as in the beginning of the game (except the revolver might be in someone else's hands). For this reason, we have q_i = (1 - f)·q_i-1 for all i: a player's odds of dying are that of the previous player times the probability of the revolver not firing. It isn't very hard to see that this is true for the first round, and the observation about the game being memoryless explains why this continues to be true.

Now, we use the fact that the probabilities sum to one, ie. q₀ + q₁ + ⋯ + q_n-1 = 1. This means that q₀ + q₀(1 - f) + ⋯ + q₀(1 - f)^n-1, which is q₀((1 - f)⁰ + (1 - f)¹ + ⋯ + (1 - f)^n-1), is equal to 1. Using the geometric series formulae, we can write that q₀(1 - (1 - f)ⁿ) = 1 - (1 - f) = f. Now that we know q₀ and can express q_i as a function of q₀, we have the general solution:

q_i = f(1 - f)ⁱ / (1 - (1 - f)ⁿ)

Notice that for an (countable) infinite number of players, the denominator becomes 1, and the formula q_i = f(1 - f)ⁱ. This makes sense: for an infinite number of players, each players gets the revolver at most once, and their odds of dying is the odds that each of the previous players survived times the odds that they die. Also note that, obviously, the game is not fair, and that for i < j, q_i > q_j. Of course, the game is more unfair for larger values of f (because the penalty for being among the first to play is higher if the probability of dying is higher and the number of expected rounds is lower).

Players don't spin

This would be the version where players draw from an urn without replacement.

At first glance, this might seem more complicated, because the game isn't memoryless anymore. You might be tempted to think of it as a tree: either player 1 dies with probability 1/p, or player 2 dies with probability (1 - 1/p) · (1/(p-1)) = 1/p (interesting!), or player 3 dies with probability (1 - 1/p) · (1 - 1/(p-1)) · (1/(p-2)) = 1/p (very interesting!), and a pattern seems to emerge...

Well, this is not at all the right way to think about the problem, and I am ashamed to admit that I did not see why before someone explained to me. The right way to see things is that when you're spinning the cylinder at the beginning, you are effectively choosing which player will die, uniformly at random. If p ≤ n, then it is now obvious that the first p players have probability 1/p of dying and the others have probability 0. If p > n, then it's harder to write but not much more complicated: writing p = u n + v (where u is the quotient and v the remainder), we have q_i = (u + 1)/p for i < v and q_i = u/p for i ≥ v.

As a consequence, this variant is fair if (and only if) p is a multiple of n.

Efficient email management

I love email. I really love email. It is federated, which means I can run it on my own server. It is flexible, which means I can bend it to unusual uses. It is pretty simple, which means I can automate stuff with it or even write my own client. It has been around forever, which means it'll probably stay around for some more time. And so, as for everything which I spend a lot of time dealing with (and don't feel guilty about), I find myself trying to manage email the right way.

This post is not about the technical details of my current email setup (which is not perfect yet), but a general description of what I consider to be desirable features for an email setup. I might blog about the technical side once I get it to work exactly like I'd like it to work. Also, this post isn't about what you shouldn't do when sending email to someone, it's about dealing with incoming mail.

Archiving

Most people seem to keep all their email in their inbox (with occasional mass-archiving of all the inbox in a "sort later" folder), and either manage incoming email immediately or hope that they will somehow think about it later. It amazes me that people find this usable and usually remember important things. I use my inbox for messages which require some action on my part (like reading it carefully, reading something they refer to, doing something, answering, etc., or maybe just archiving the message). Whatever is in the inbox has to be moved out at some point. My inbox is a todo list.

The reason why people never archive their mail is because they usually have complicated schemes to archive messages. For instance, I've seen people archive messages in folders labeled by sender (hey, I also used to do that back in the days), which is quite pointless because this information is in the message anyway. My solution is different: one big archive folder. I'll sort later if I need to. Most stuff which gets there is never needed again. Archives like that should be optimized for writing.

Actually, I'll probably never need to sort anything because I can just search my mail (using indexing to get decent speed). True, it can be useful to keep some tags in order to add information which is not easy to find in the message (say I want to find quickly all email about some project...), but I'll only do it if I think I could probably need those messages later. If this isn't the case, just archive, and search later if need be.

Speed

The system should be keyboard-driven, and fast. "keyboard-driven" isn't just about having keyboard shortcuts, it's about having an interface designed with the keyboard in mind. "fast" isn't just about optimizing the hell out of some operations, but about having no useless noticeable delays, which is surprisingly rare.

Integrity

Unless you're working under very tight space constraints or you're being mailbombed, I don't see why you would delete a message, ever. Seriously. Archive it, and have a tag for things that you don't want to see unless you explicitly search for them. There is no reason for this to be hard (just bind it to the key you would ordinarily use to delete stuff).

This section is called "integrity", because I feel that you should distinguish the pristine data that you received from the stuff you add on top of it. Deleting a message violates this rule; modifying a message also does. Mainstream email clients usually don't do that (except to toggle flags), but I've seen mutt plugins silently change the message headers when you did some thing or the other. There should be a pristine copy of the mail which arrives, augmented by a datastore of what you did to the mail, probably supplemented by a cache to speed up things like search. The point is that you should distinguish the base data (email), the user data (labels, annotations, whatever) and the cache (which is just there for speed and can be regenerated).

Threading

Email threading is an absolute must. I cannot imagine what my inbox would look like without threading. I remember I was very surprised to learn that threading isn't guesswork based on subject and citations but a standardized, elegant and simple mechanism (with the headers Message-ID, In-Reply-To and References) which is unexpectedly followed by almost all email clients, even the ones which do not use them for display.

By email threading, I mean regular threads, ie. trees. The Gmail linear approximation (put messages in a thread, but don't show the tree, order messages by date) is better than nothing, but not satisfactory.

Priority inbox and spam filtering

People seem to rely on spam filtering to get rid of unwanted messages and, for Gmail users, rely on Priority inbox to cherry-pick important messages. Those two tools are useful, but it is unacceptable to delegate the task of choosing what to read and what to ignore to some software outside your control which tries to be intelligent.

As it turns out, I don't need those two tools, because I don't get much spam and I am able to deal with individual messages manually. If I needed to, though, I would insist on rolling my own there, more than anywhere else, to be sure that the filtering is sensible and that no important message gets lost because someone (or rather something) thought they knew what was interesting to me.

Mailing-lists

Email is not necessarily designed for messages that you want to read. It's also a convenient way to receive stuff you might want to catch up on later. Think of it as a neat way to keep a mirror of a mailing-list archive and browse it at some point if needed. It's okay to subscribe to tons of mailing-lists, as long as you filter them and redirect them in a folder which is separated from your real mail.

There is very little motive to unsubscribe from anything except technical limitations. If a list used to interest you, it might interest you again later--just redirect it directly to the archive, and tag it so it won't pollute your search results unless you want to.

Choices

If the inbox is a todo list, then there is a tradeoff to find between guidance ("next step is: do X") and flexibility ("actually, I feel motivated about Y right now"). Presenting the list of messages in the inbox and asking the user to pick one is very flexible, but the high number of possible choices can be daunting. It's interesting to imagine what the opposite approach would feel like: the system would display threads one after the other, and requires you to either do what is to be done, or add a note explaining what you're waiting for...

Scanning ISBN bar codes

I have a book collection which I would to have an index of. Fortunately, most recent books have a barcode on their cover which indicates their ISBN, an identifier which is unique for the book and can be used (at least in theory) to retrieve details about the book (title, author, cover image sometimes...). Thus, to keep a list of what is in my bookshelves, I just have to scan the barcode of each book and save the list somewhere.

To do this, I need a tool which takes a video stream as input and outputs the list of recognized EAN-13 codes, one per line. I do not need a fancy app with a slick captive user interface which would try to be smart and store a database in some mysterious place while making it harder for me to script things. Do one thing, and do it well.

If someone else intends to do this, here are two possible ways, and ideas to use the ISBNs.

Barcode Scanner on Android

This is actually a pretty natural thing to do with an Android phone, except you have to find the right barcode scanning app. (Most turn out to be riddled with ads, or ridiculous limitations, or don't have the required feature.)

The ZXing library has an Android port called Barcode Scanner which was able to read almost all tested barcodes with the webcam of my HTC Desire. Activate the Bulk scan mode option in Settings, scan items, go to History, select Send history (beware, Clear history is right next to it and has no confirmation step), and send an email to yourself with the csv attachment. It is free as in freedom (Apache License 2.0).

zbarcam

The zbarcam command (packaged for Debian) is exactly what I needed, but my computer's webcam was too crappy to scan barcodes. This might work for you if you have a better webcam. For the record, before learning about Barcode Scanner, I tried to use my HTC Desire as a webcam over USB using AndroidUsbCamera. Sadly, it does not work on recent kernels because it requires V4L1 support which was dropped in kernel 2.6.38.

What to do with the ISBNs

I thought that an ISBN lookup program would be packaged for Debian, but apparently not. The simplest API to use seems to be that of Open Library (a nice project, by the way, which provides dumps which are apparently in the public domain): example for ISBN 9780517223628. 56 out of my 115 scanned books were found.

For better options, a good starting point seems to be this StackOverflow thread. It seems that the Google Books API is the best one, but interfacing with Google APIs seems to require API keys, registration, etc., so I didn't bother yet.

a3nm's blog