Haspirater -- identifying initial aspirated 'h's in French words

English version

I just wrote haspirater, a system to detect if the initial 'h' in a French word is aspirated or not. (I happened to need this, and no one had apparently done it yet.) For those who are unfamiliar with the context, the thing is that for French words which start with an 'h', the 'h' can be aspirated or non aspirated, which changes the behavior of the word regarding elision and liaison. Of course, there are no known rules to find out from the structure of a word whether it is aspirated or not...

The simple approach would be to use word lists, but of course they are never complete and will fail for unseen words. A natural solution for them would be to assume that they behave in the same way that the closest known word. This forces us to define "closest", and it seems reasonable to look at the word with the longest common prefix, because it looks like the property of whether the 'h' is aspired or not should be mostly conditioned by the beginning of the word. (Actually, it would be better to look at the pronunciation of the beginning of the word if we could afford it).

This suggests a simple optimization. If we are going to take the result of the closest word in this fashion, we might as well drop words from the known words list which do not contribute anything to the result. In other words, if all words starting with "hach" in the list have an aspirated 'h', then there is no need to store them all; just storing "hach" will be enough. In fact, this means that the appropriate structure for our word list is a trie, and the optimization that I mentioned is apparently called compression.

Another trick is that we can try to infer word lists automatically from a corpus using some simple rules. If we read "la hache" in a text, it means that the initial 'h' is aspirated, whereas "l'hirondelle" indicates that "hirondelle" starts with a non-aspirated 'h'. We can easily process megabytes of text and get word lists in this way.

This approach works quite well, yielding a dataset which is under 5 KB in json (and less than 1 KB compressed) and a lookup program which is just 40 lines of Python. It gives the right result for any example I could come up with (though I had to add a few exceptions for words missing from the corpus, both manually and using Wikipedia and Wiktionary); hopefully it will also work for you, please tell me if it doesn't. Note that we could even compile the trie into a very efficient C program if speed was of the essence.

Another amusing thing is that we can draw the data and try to see what the "rules" are. Here is the trie (it is quite messy). The node border and edge thickness are are a logarithmic function of the number of occurrences, the node labels indicate the prefix, and the color is red for aspirated h and blue for non-aspirated (or a mix thereof for ambiguous cases, depending on the proportion).

The Wikipedia article I linked above mentions that this approach is used to store Unicode character properties efficiently; it seems like it could be used for a lot of other things. For instance, you could imagine the trie indicating the gender of nouns (though you would probably build it on suffixes rather than prefixes in this case), the trie of prepositions used for a given noun (say "в" vs "на" in Russian), and probably tries for a lot of other arbitrary things that are so obvious to native speakers.

Version française

English version above.

Je viens d'écrire haspirater, un système pour détecter les 'h' aspirés au début des mots français. (Il se trouve que j'en avais besoin, et, apparemment, personne ne l'avait encore fait.) Pour ceux qui ne connaissent pas le contexte, le problème est que pour les mots français qui commencent par un 'h', le 'h' peut être aspiré ou non aspiré (ce qui ne change rien à la prononciation, mais change le fonctionnement de l'élision et de la liaison). Évidemment, il n'y a pas de règles pour déterminer à partir de la structure d'un mot si le 'h' initial est aspiré ou non...

L'approche la plus simple serait d'utiliser une liste de mots, mais évidemment de telles listes ne sont jamais complètes et ne fonctionneront pas pour des mots inconnus. Une solution naturelle pour ces derniers serait de supposer qu'ils se comportent de la même manière que le mot connu le plus proche. Cela nous oblige à définir "proche" : il paraît raisonnable de regarder le mot ayant le plus long préfixe commun avec le mot inconnu, parce qu'on a tendance à se dire que c'est le début du mot qui a le plus d'influence sur le fait que le 'h' soit aspiré ou non. (En fait, ce serait plutôt la prononciation du début du mot qu'il faudrait regarder si on la connaissait, mais passons...)

Cela nous mène à une optimisation simple. Si on prend toujours le résultat du mot le plus proche de la façon décrite plus haut, on peut très bien retirer de la liste de mots connus ceux qui ne sont pas nécessaires pour déduire le résultat. En d'autres termes, si tous les mots qui commencent par "hach" dans la liste ont un 'h' aspiré, alors il n'est pas nécessaire de les stocker tous, il suffit de stocker "hach". En fait, cela veut dire que la bonne structure pour notre liste de mots est un trie, et l'optimisation que je viens de décrire s'appelle apparemment la compression.

Une autre astuce est qu'on peut essayer d'inférer automatiquement des listes de mots à partir d'un corpus en utilisant quelques règles simples. Si on lit "la hache" dans un texte, cela signifie que le 'h' initial est aspiré, alors que "l'hirondelle" indique qu'"hirondelle" commence par un 'h' non aspiré. On peut facilement traiter plusieurs mégaoctets de texte pour en tirer des listes de mots de cette façon.

Cette approche fonctionne plutôt bien : le fichier de données fait moins de 5 Ko en json (et moins de 1 Ko compressé), et le programme pour chercher un mot fait 40 lignes de Python. Il donne le bon résultat pour tous les exemples que j'ai essayés (quoique j'ai dû ajouter à la main quelques exceptions pour des mots absents du corpus, en utilisant les listes de Wikipédia et du Wiktionnaire) ; j'espère qu'il marchera aussi pour vous, et vous prie de me signaler toute erreur. Notez qu'on pourrait même compiler le trie en un programme en C très efficace si la vitesse était importante.

Un autre truc amusant qu'on peut faire, c'est faire un dessin des données et voir quelles semblent être les "règles". Voici le trie (il est plutôt bordélique). L'épaisseur du bord des nœuds et des arêtes est en fonction du nombre d'occurrences (logarithmiquement), les étiquettes des nœuds indiquent le préfixe, les nœuds en rouge indiquent l'aspiration, les nœuds en bleu indiquent la non-aspiration, et un mélange indique les cas ambigus (en fonction de la proportion).

L'article Wikipédia cité plus haut indique que cette approche est utilisée pour un stockage efficace des propriétés des caractères Unicode ; j'ai l'impression qu'elle pourrait être utilisée pour tout un tas d'autres choses. On pourrait imaginer, par exemple, un trie du genre des noms (quoiqu'il serait probablement préférable de regarder les suffixes plutôt que les préfixes dans ce cas), un trie du choix des propositions en fonction du nom ("в" vs "на" en russe, entre autres), et sans doute des tries pour un tas d'autres choses arbitraires qui sont tellement évidentes pour les locuteurs natifs.

A review of the TypeMatrix 2030

I was convinced by a friend who wanted to buy a TypeMatrix 2030 to buy one too so that we could get a discount. Here is a review of it, after roughly one month and a half of nearly exclusive use. The version I have is the blank one (because looking at the keyboard is a bad idea, and because it's kinda pretty).

To give a bit of background: I've been using the Dvorak US layout (with dead keys to get French accents) for about two years. I touch-type, and can reach around 110 WPM on speed typing games (but am nowhere as fast in real life usage). Before that, I touch-typed on the French Dvorak layout of Josselin Mouette adapted from that of Francis Leboutte (which got removed from Xorg for stupid reasons), and still before, I hunted and pecked on the Azerty layout. I love exclusive keyboard usage whenever I can afford it, and hate having to move my hands to reach things away from the home row like the arrow keys, the numpad, or (gasp) the mouse. Oh, and I love the command line and commandline apps, and use vim.

General comments

The blank version of the keyboard is very stylish (though it's sad that the design is so asymmetrical). It is guaranteed to confuse or impress people, which can be fun, and, if you're using an alternative layout, it is a gentle hint to other people that they'd better not try to use your keyboard.

The keyboard does have some unexpected features, like a hardcoded Dvorak layout that you can toggle and which is managed by the keyboard not the OS. In other words, turning this on will make the keyboard interpret what you're typing as Dvorak and translate to the OS, which will do what you expect if the OS is configured to receive Qwerty. Of course, if it already expects Dvorak, then you get garbage. This is useless to me because I need non-standard dead keys (and seldom have to share my machines with Qwerty users anyway), but can be useful to others. Or you might also be disgusted to see that the keyboard tries to do some fancy logic like this. Or maybe regret that since it does, it would have been cool to also have fancy features like the ability to remap keys and record macros on the fly...

There are a few multimedia shortcuts and all. Some of them are actual multimedia keys which you can map to whatever you want, and some of them (like cut, copy, paste) are hardcoded sequences which are indistinguishable from the separate keys. The precise status of those keys is described in this document.

The dots on the index home row keys are there, and at the center of the key, which can be surprising if you expect them to be at the bottom. There is also a dot on the delete key (which I don't see the use for), a dot on the lower row pinky key (which is '/' on Qwerty but 'z' on Dvorak) which I find useless and slightly confusing, and a dot on the down arrow (which is a nice touch to help you reach those keys without looking whenever you have to use software which requires them).

The TypeMatrix 2030 does not have N-key rollover (NKRO). That's a bit disappointing for a keyboard of this kind...

Adaptation period

Adapting to the TypeMatrix 2030 takes a bit of time. It's nowhere as hard as learning a new layout, but it is definitely not instantaneous, and you might want to count one week before you're up to speed. Here is a list of the things that I had to adapt to:

Touch: The key touch isn't really special (and I'm not really picky about that sort of thing anyway), but it is slightly hard. This, along with the fact that the position of the modifier keys somehow confused me at the beginning, meant that my wrists suffered a bit and that it was literally slightly painful at the beginning. This didn't last, fortunately.
Modifiers: The main modifiers that I use are left shift, left control, super, alt, and altgr. Finding out where there are to be able to press the right ones without even thinking about it takes some time.
Enter and backspace: One of the most original things about this layout is the fact that the enter and backspace keys are in the middle of the keyboard (and pressed with the index) rather than far at the right (and pressed with the pinky). This means that you have to replace the very low level reflex of reaching for enter when done and backspace when wrong by the reflex of going at the center of the keyboard.
Matrix keys: The other important feature of the keyboard is that the keys are in a matrix (duh). This isn't that much of a deal, except for those keys which seem to be off by one relative to their position on usual keyboards. The worst for me was the right half of the lower row, and numbers.
Real touch-typing: Unless you're completely touch-typing (and it's easy to be mistaken), the absence of markings will make you notice those keys where you sometimes peek at the keyboard. In my case, the letter keys were fine, but not the numbers and symbols...

Assessment

Overall, I have to admit that I am not that enthusiastic about the benefits of this keyboard. Though aligning the keys in a matrix seems more logical, I do not feel it makes much of a difference. Maybe it's better somehow--but then, maybe not.

Another slight problem is that damn enter key. Putting it in the middle seems like a good idea; however, this means that pressing it by mistake still happens now and then, whereas I never had this problem with a regular keyboard. Yes, it's easier to reach, but then the usual enter and backspace can also be reached with the pinky almost without moving the hand, so it's not much of a benefit.

My main disappointment, though, is that it is doesn't make you type faster, or give you the impression that you're typing faster, or make you feel better, or whatever. It feels like just another keyboard; a good keyboard, with a cool design, but definitely not worth the effort of carrying it around when you're using a laptop, and probably not worth the money and adaptation time. Granted, if you have RSI, you might want to see if this helps. If you do most of your work on a fixed workstation and you're willing to pay extra, this might still be a reasonable choice. But otherwise, if you're just a normal typist not especially dissatisfied with normal keyboards and just intellectually satisfied by the TypeMatrix design choices, if you're using a laptop or multiple computers, don't buy it and expect it to be enormously better to use. If you're like me, you won't really notice much.

Why are conlangs so obsessed with vocabulary?

Quite many conlang grammars I've read start by introducing their alphabet, their writing system and their vocabulary, and leave the grammar for the very end. This is often frustrating: what I'm interested in usually isn't the alphabet or choice of words used by the language, and the interesting stuff (the grammar) is introduced using all the vocabulary that I'd like to skip (bridi? selbri? and don't get me started about ithkuil...).

I wonder why conlangs are never described the other way round. You could start with the grammar, explaining how to build valid ASTs and their meaning, and represting the ASTs with s-expressions and not sentences with complicated morphology and vocabulary. Then, you can get on with the rest, but it would be great to have an idea of the grammatical workings of the language before worrying about how it is actually written or spoken.

In other words, it is one thing to describe the grammar of your language using ASTs, and it is another thing entirely to describe the real syntax of your language which will serialize the ASTs into written or spoken sentences; in my opinion, things should be done in this order.

[The same goes for programming languages: when trying to invent a language, it is tempting to start with the details of the syntax and keep the hard business (the grammar) for later, whereas I think it would be more productive in most cases to start with a Lisp-like syntax, think about the grammar, and optionally invent a specific syntax later.]

Of course, in the case of conlangs, you could object that you cannot describe the grammar without introducing at least some words of the conlang. I mean, your conlang is likely to have some basic grammatical words with no English equivalents (what would be the point, otherwise?) and you would need them to describe the grammar. Even in this case, things are likely to be more readable if you use a short English periphrase instead, but, if you really can't, using specific words is reasonnable. However, this does not mean that you have to use the same words as in the final syntax in your language! For exemple, assume that your language uses declension, and has a case called "baritive" and a word "foo" which means "sheep": don't explain how the baritive of "foo" is actually "bazquuux" for convoluted reasons, just write "(baritive sheep)".

Most importantly, you should not start with real world words. It's hard to think of an interesting way to come up with words for the various kinds of animals, fruit, colors and the like; if you just say something like "(attribute (any [violet]) [color] [blue])" instead of "PéargnH i'ch'bẃļorthnutŝb", I really won't mind.

Encode stdin in UTF-8

I just spent 3 miserable hours doing something which seemed easy enough, and I wouldn't wish it on anyone so I'll explain here and hopefully this will be useful to someone.

The task is: write a program which reads text on stdin and outputs it in UTF-8 to stdout. The painful job of guessing the encoding of a text stream has already been done, but I couldn't find any program which implemented the seemingly straightforward task of using this to convert text to UTF-8 in a shell pipe. I wasted some time with enca, which can be useful if you know the language of the file (and even then, it doesn't support latin1, which was a problem) before deciding I had to write this very simple thing myself.

if you know that the input encoding is either utf8 or latin1, I have a simpler solution which does not depend on chardet.

Well, it took more time than expected, and here is the result, using chardet, in Python 3. Note that we are assuming UTF-8 as a default and only fallback to chardet when it doesn't work.

#!/usr/bin/python3

"""a2utf8: print stdin in UTF-8 using chardet if needed"""

import sys
import chardet

sys.stdin = sys.stdin.detach()
sys.stdout = sys.stdout.detach()

data = sys.stdin.read()
try:
  sys.stdout.write(data.decode().encode())
except UnicodeDecodeError:
  encoding = chardet.detect(data)['encoding']
  if encoding == None:
    sys.stdout.write(data)
  else:
    sys.stdout.write(data.decode(encoding).encode())

Notice the mysterious calls to detach(), which are the main surprise. I thought that you just had to use the -u option to Python to get this behaviour, but it turns out that it does not switch stdin to binary mode anymore like it did in Python 2. By the way, to get that behavior when you open a file, you would use:

f = open("file", mode="rb")

The rest is easier. We read the data, try to write it as UTF-8, and, if it fails, try to detect the encoding. If one is found, we use it to decode, but it might also happen that none is found, in which case we output as-is.

annoyingly enough, Mark Pilgrim disappeared and took down chardet, diveintopython3 and the rest of his projects, so the following remark is outdated

[On a side note, you might be interested to know that as of this writing, the Python 3 version of Chardet won't detect UTF-16 and UTF-32 correctly because of a bug in the BOM detection. That's quite unlucky, since the porting of Chardet to Python 3 is the subject of a case study in the Dive Into Python 3 book by Mark Pilgrim, the developer of chardet. I'm just writing this in case someone also got confused trying to test the code above on UTF-16 or UTF-32 files.]

Internet investigations

We don't notice often enough the incredible exploration games that the Internet has to offer. When you're looking at the tip of the iceberg, you see well-known, high-pagerank content: it usually makes sense, and you can easily find context information if you need to. But when you start exploring obscure things, it can be a lot more puzzling. Here are three examples of armchair investigations.

The Hybrid RPG

Wikipedia has a mysterious article (which got flagged for deletion, but kept for lack of a consensus and finally deleted it in 2018 but you can still see it on archive.org) about something described as "a role-playing game", "a model of physical reality", and "unmitigated nonsense". It links to an active blog, a review, and the official site which seems to have vanished.

The blog is enough to give you a glimpse of the thing, but is far from complete. Actually, it only dates back to April 2011. The Wayback Machine has a 2006 copy; it turns out that the author has been deleting old posts. But the archive of the dead original site is much, much more impressive. Over 2 megabytes of purely nonsensical plaintext which is of a particularly intense and interesting flavor of madness. I would really love to believe that this is actually computer-generated, but it doesn't seem likely.

Apparently, the first version of the Hybrid rules were posted, massively and mercilessly, on the Usenet group rec.games.frp.super-heroes. Searching for Hybrid on the Google Groups archives will yield a mass of people complaining about these posts, but I strangely did not manage to find a single message by the guy who posted them; they must have been filtered in some way (or maybe had a do-not-archive header)?

Philippe Tromeur seems to have done the dirty work of compiling these posts into the now-defunct webpage mentioned earlier. The page, of course, is incomplete: it is hard to tell how far back it goes, and it does not include the author's more recent creations, of which some might have been lost forever given his annoying tendency to erase and start over. In a way, it doesn't matter--more than two megabytes is more than enough to pick a random passage and wonder. But in a way, it does matter--this madman has been writing this thing for over ten years now, and I can't help but feel it ought to be saved...

The Ethereal Convent

If you study the OpenPGP web of trust (which I did for school), you might notice a large set of extremely weird keys referring to ranks in an organization called the "Ethereal Convent". The original website seems dead, but blog.nun.org still points to a mostly empty blog, not updated since December 2008.

However, we can still find former versions of nun.org ranging between 1998 and 2004. The Ethereal Convent seems to be some vague religious organization which pretends to have ties with the Order of Perpetual Indulgence, which seems associated to a defunct site selling the well worn undergarments of young men (no, I'm not making that up), and its activities included, apparently, to sell absolute indulgences at a special cut price rate. And apparently, it was based in Thailand.

I should add for completeness' sake that you can also find another related blog (last updated November 2005), a member of which pretends to be reborn into Gay Male Nunhood. It points to another address for the Convent, apparently down. And the rabbit hole continues, and it turns out that there are more URLs, and more obscure mysticism and strange partnerships. (Whois lookups can yield info too, but I won't say more, for privacy's sake.) The world is a very weird place.

The main question, though, is to understand why this appears on the OpenPGP web of trust. The answer is that apparently, from the very beginning, nun.org suggested the use of crypto for email privacy. As time passed, they continued to use crypto: they signed their messages and only accepted encrypted mail. They also had the unexpected idea of selling cryptographically signed "Plenary indulgence" and "Excellence" certificates. Hence, probably, the diverse array of keys on the web of trust, which must have been used to sign the various certificates they sold or intended to sell.

Prime numbers in Haiti

This one will probably appeal more to my French readers. The trail starts on the Haitian Creole Wikipedia article on prime numbers. You might think you don't know this language, but if you can read French, you can read Haitian creole to some extent--just read it out loud. I'll start you off: "Un nombre qui pas capable divisé par aucun autre nombre sinon que par lui même ou sinon 1." (The similarity is no coincidence, of course, and there are a lot of interesting linguistic observations to make, but I won't go into this now.)

On the surface of it, the article seems to have some content (it is even a featured article). Look closer, though. This "Lainé Jean Lhermite Junior" should not be confused with Charles Hermite, and actually, most of the article is devoted to proving his two big formulae you see near the top. This looks like math. However, if you look at the "Premye konsekans imedyat" and "Dezyèm konsekans imedyat", you're bound to notice that this is actually totally trivial. What is going on here?

Well, it turns out that these formulae are a convoluted but correct way of computing the n-th prime (simpler variants exist). However, Lhermite doesn't seem to know about the existing results, and apparently derived others on his own. This still makes sense.

However, one of the sections is weird. Arrows, sheep, red, blue, and a reference to the Bible, along with two dead links. Hmm. Here's the doc mentioned with a dead link, which confirms that prime numbers have a link with 1 Samuel 20:20 ("And I will shoot three arrows on the side thereof, as though I shot at a mark."). Here's the page, mentioning a few "models": "Modèle des boules rouges et des boules bleues" (blue and red balls), "Modèle des flèches" (arrows), "Modèle des quadrilatères et des triangles" (quadrilaterals and triangles), "Modèle des brebis de l'autre pâturage" (ewes from the other pasture), "Modèle du successeur" (sucessor), "Modèle des petits triangles et des grands triangles" (small triangles and big triangles), "Modèle du fils prodigue" (prodigal son). Adequately enough, the page ends by a tribute to the creator of the universe...