a3nm's blog

Why are conlangs so obsessed with vocabulary?

— updated

Quite many conlang grammars I've read start by introducing their alphabet, their writing system and their vocabulary, and leave the grammar for the very end. This is often frustrating: what I'm interested in usually isn't the alphabet or choice of words used by the language, and the interesting stuff (the grammar) is introduced using all the vocabulary that I'd like to skip (bridi? selbri? and don't get me started about ithkuil...).

I wonder why conlangs are never described the other way round. You could start with the grammar, explaining how to build valid ASTs and their meaning, and represting the ASTs with s-expressions and not sentences with complicated morphology and vocabulary. Then, you can get on with the rest, but it would be great to have an idea of the grammatical workings of the language before worrying about how it is actually written or spoken.

In other words, it is one thing to describe the grammar of your language using ASTs, and it is another thing entirely to describe the real syntax of your language which will serialize the ASTs into written or spoken sentences; in my opinion, things should be done in this order.

[The same goes for programming languages: when trying to invent a language, it is tempting to start with the details of the syntax and keep the hard business (the grammar) for later, whereas I think it would be more productive in most cases to start with a Lisp-like syntax, think about the grammar, and optionally invent a specific syntax later.]

Of course, in the case of conlangs, you could object that you cannot describe the grammar without introducing at least some words of the conlang. I mean, your conlang is likely to have some basic grammatical words with no English equivalents (what would be the point, otherwise?) and you would need them to describe the grammar. Even in this case, things are likely to be more readable if you use a short English periphrase instead, but, if you really can't, using specific words is reasonnable. However, this does not mean that you have to use the same words as in the final syntax in your language! For exemple, assume that your language uses declension, and has a case called "baritive" and a word "foo" which means "sheep": don't explain how the baritive of "foo" is actually "bazquuux" for convoluted reasons, just write "(baritive sheep)".

Most importantly, you should not start with real world words. It's hard to think of an interesting way to come up with words for the various kinds of animals, fruit, colors and the like; if you just say something like "(attribute (any [violet]) [color] [blue])" instead of "PéargnH i'ch'bẃļorthnutŝb", I really won't mind.

Encode stdin in UTF-8

— updated

I just spent 3 miserable hours doing something which seemed easy enough, and I wouldn't wish it on anyone so I'll explain here and hopefully this will be useful to someone.

The task is: write a program which reads text on stdin and outputs it in UTF-8 to stdout. The painful job of guessing the encoding of a text stream has already been done, but I couldn't find any program which implemented the seemingly straightforward task of using this to convert text to UTF-8 in a shell pipe. I wasted some time with enca, which can be useful if you know the language of the file (and even then, it doesn't support latin1, which was a problem) before deciding I had to write this very simple thing myself.

if you know that the input encoding is either utf8 or latin1, I have a simpler solution which does not depend on chardet.

Well, it took more time than expected, and here is the result, using chardet, in Python 3. Note that we are assuming UTF-8 as a default and only fallback to chardet when it doesn't work.

#!/usr/bin/python3

"""a2utf8: print stdin in UTF-8 using chardet if needed"""

import sys
import chardet

sys.stdin = sys.stdin.detach()
sys.stdout = sys.stdout.detach()

data = sys.stdin.read()
try:
  sys.stdout.write(data.decode().encode())
except UnicodeDecodeError:
  encoding = chardet.detect(data)['encoding']
  if encoding == None:
    sys.stdout.write(data)
  else:
    sys.stdout.write(data.decode(encoding).encode())

Notice the mysterious calls to detach(), which are the main surprise. I thought that you just had to use the -u option to Python to get this behaviour, but it turns out that it does not switch stdin to binary mode anymore like it did in Python 2. By the way, to get that behavior when you open a file, you would use:

f = open("file", mode="rb")

The rest is easier. We read the data, try to write it as UTF-8, and, if it fails, try to detect the encoding. If one is found, we use it to decode, but it might also happen that none is found, in which case we output as-is.

annoyingly enough, Mark Pilgrim disappeared and took down chardet, diveintopython3 and the rest of his projects, so the following remark is outdated

[On a side note, you might be interested to know that as of this writing, the Python 3 version of Chardet won't detect UTF-16 and UTF-32 correctly because of a bug in the BOM detection. That's quite unlucky, since the porting of Chardet to Python 3 is the subject of a case study in the Dive Into Python 3 book by Mark Pilgrim, the developer of chardet. I'm just writing this in case someone also got confused trying to test the code above on UTF-16 or UTF-32 files.]

Internet investigations

— updated

We don't notice often enough the incredible exploration games that the Internet has to offer. When you're looking at the tip of the iceberg, you see well-known, high-pagerank content: it usually makes sense, and you can easily find context information if you need to. But when you start exploring obscure things, it can be a lot more puzzling. Here are three examples of armchair investigations.

The Hybrid RPG

Wikipedia has a mysterious article (which got flagged for deletion, but kept for lack of a consensus and finally deleted it in 2018 but you can still see it on archive.org) about something described as "a role-playing game", "a model of physical reality", and "unmitigated nonsense". It links to an active blog, a review, and the official site which seems to have vanished.

The blog is enough to give you a glimpse of the thing, but is far from complete. Actually, it only dates back to April 2011. The Wayback Machine has a 2006 copy; it turns out that the author has been deleting old posts. But the archive of the dead original site is much, much more impressive. Over 2 megabytes of purely nonsensical plaintext which is of a particularly intense and interesting flavor of madness. I would really love to believe that this is actually computer-generated, but it doesn't seem likely.

Apparently, the first version of the Hybrid rules were posted, massively and mercilessly, on the Usenet group rec.games.frp.super-heroes. Searching for Hybrid on the Google Groups archives will yield a mass of people complaining about these posts, but I strangely did not manage to find a single message by the guy who posted them; they must have been filtered in some way (or maybe had a do-not-archive header)?

Philippe Tromeur seems to have done the dirty work of compiling these posts into the now-defunct webpage mentioned earlier. The page, of course, is incomplete: it is hard to tell how far back it goes, and it does not include the author's more recent creations, of which some might have been lost forever given his annoying tendency to erase and start over. In a way, it doesn't matter--more than two megabytes is more than enough to pick a random passage and wonder. But in a way, it does matter--this madman has been writing this thing for over ten years now, and I can't help but feel it ought to be saved...

The Ethereal Convent

If you study the OpenPGP web of trust (which I did for school), you might notice a large set of extremely weird keys referring to ranks in an organization called the "Ethereal Convent". The original website seems dead, but blog.nun.org still points to a mostly empty blog, not updated since December 2008.

However, we can still find former versions of nun.org ranging between 1998 and 2004. The Ethereal Convent seems to be some vague religious organization which pretends to have ties with the Order of Perpetual Indulgence, which seems associated to a defunct site selling the well worn undergarments of young men (no, I'm not making that up), and its activities included, apparently, to sell absolute indulgences at a special cut price rate. And apparently, it was based in Thailand.

I should add for completeness' sake that you can also find another related blog (last updated November 2005), a member of which pretends to be reborn into Gay Male Nunhood. It points to another address for the Convent, apparently down. And the rabbit hole continues, and it turns out that there are more URLs, and more obscure mysticism and strange partnerships. (Whois lookups can yield info too, but I won't say more, for privacy's sake.) The world is a very weird place.

The main question, though, is to understand why this appears on the OpenPGP web of trust. The answer is that apparently, from the very beginning, nun.org suggested the use of crypto for email privacy. As time passed, they continued to use crypto: they signed their messages and only accepted encrypted mail. They also had the unexpected idea of selling cryptographically signed "Plenary indulgence" and "Excellence" certificates. Hence, probably, the diverse array of keys on the web of trust, which must have been used to sign the various certificates they sold or intended to sell.

Prime numbers in Haiti

This one will probably appeal more to my French readers. The trail starts on the Haitian Creole Wikipedia article on prime numbers. You might think you don't know this language, but if you can read French, you can read Haitian creole to some extent--just read it out loud. I'll start you off: "Un nombre qui pas capable divisé par aucun autre nombre sinon que par lui même ou sinon 1." (The similarity is no coincidence, of course, and there are a lot of interesting linguistic observations to make, but I won't go into this now.)

On the surface of it, the article seems to have some content (it is even a featured article). Look closer, though. This "Lainé Jean Lhermite Junior" should not be confused with Charles Hermite, and actually, most of the article is devoted to proving his two big formulae you see near the top. This looks like math. However, if you look at the "Premye konsekans imedyat" and "Dezyèm konsekans imedyat", you're bound to notice that this is actually totally trivial. What is going on here?

Well, it turns out that these formulae are a convoluted but correct way of computing the n-th prime (simpler variants exist). However, Lhermite doesn't seem to know about the existing results, and apparently derived others on his own. This still makes sense.

However, one of the sections is weird. Arrows, sheep, red, blue, and a reference to the Bible, along with two dead links. Hmm. Here's the doc mentioned with a dead link, which confirms that prime numbers have a link with 1 Samuel 20:20 ("And I will shoot three arrows on the side thereof, as though I shot at a mark."). Here's the page, mentioning a few "models": "Modèle des boules rouges et des boules bleues" (blue and red balls), "Modèle des flèches" (arrows), "Modèle des quadrilatères et des triangles" (quadrilaterals and triangles), "Modèle des brebis de l'autre pâturage" (ewes from the other pasture), "Modèle du successeur" (sucessor), "Modèle des petits triangles et des grands triangles" (small triangles and big triangles), "Modèle du fils prodigue" (prodigal son). Adequately enough, the page ends by a tribute to the creator of the universe...

Curses continuity

— updated

Having played the Continuity Game, I didn't like to have to use Flash. So I rewrote a similar game in Python and curses. It's an old, quick job in less than 500 lines of code, and doesn't really include a lot of levels, but it's playable and hackable.

You can get continuity on its repository. You can also directly download a source archive which is current as of this writing. Public domain.

Playing a note with the PC speaker on Linux

— updated

At some point, I needed to design a command that you could use to play a note (identified by its scientific pitch notation) on the PC speaker. For instance, I would like to play an A-sharp, 3rd octave on the speaker with:

play_note 'a#3'

For this, we need the beep program which allows us to beep the PC speaker at a given frequency. We then need something to go from "a#3" to the frequency of that particular note (233.081879 Hz in this case). The main point of this post is actually to advertise a2freq, a small tool that I wrote to do precisely this, ie. convert scientific pitch notation to a frequency. It also supports MIDI note numbers (which it uses as an intermediate representation), if you ever need to convert a MIDI file to a beep sequence... For instance:

$ a2freq a4
440.000000

Hopefully the next person who needs to do this can land here and save half an hour by using this code rather than writing it.

To put the pieces together, we can now define play_note as:

beep ${@:1:$(($#-1))} -f `a2freq ${!#}`

The complex crap that you see here is used to make it possible to pass additional arguments to beep, so that you can play a 100-millisecond A with:

play_note -l 100 a

The last parameter is taken to be a note, the rest is passed to beep. (Another decent solution would have been to play multiple arguments as a note sequence, rather than just playing one note and passing all previous arguments to beep.)

One caveat of this: sometimes frequencies get rounded in a bad way, meaning that the notes aren't right. This isn't a2freq's fault; it might be beep's, the kernel's, or the hardware's. On my machine, a#5 and b5 sound the same because of that. So don't assume that this will always work.