a3nm's blog

Trois petits chats assistés par ordinateur

This post is about a kind of French song where syllabes are repeated from one verse to the next. To my knowledge this does not exist in English; I'd be delighted to be proven wrong! In the meantime this post is only in French.

La chanson Trois petits chats est une ritournelle bien connue :

Trois p'tits chats, trois p'tits chats, trois p'tits chats, chats, chats,
Chapeau d'paille, chapeau d'paille, chapeau d'paille, paille, paille,
Paillasson, paillasson, paillasson, -son, -son,
Somnambule, somnambule, somnambule, -bule, -bule...

La chanson continue de la sorte, chaque vers répétant en première syllabe la dernière syllabe du vers suivant, pour généralement finir par revenir à "Trois p'tits chats".

zorun a eu la mauvaise idée de me poser la question : serait-il possible de générer des chansons de ce type, voire des chansons plus longues, de façon automatique ? Ce billet répond par l'affirmative à cette importante question. Voici un exemple de chanson générée automatiquement, avec 86 vers, en utilisant Lexique :

Lingala, lingala, lingala, -la, -la,
Laticlave, laticlave, laticlave, -clave, -clave,
Clavecin, clavecin, clavecin, -cin, -cin,
Sainte nitouche, sainte nitouche, sainte nitouche, -touche, -touche,
Touchotter, touchotter, touchotter, -ter, -ter,
Télégramme, télégramme, télégramme, -gramme, -gramme,
Graminée, graminée, graminée, -née, -née,
Nécessaire, nécessaire, nécessaire, -saire, -saire,
Cervidés, cervidés, cervidés, -dés, -dés,
Démesure, démesure, démesure, -sure, -sure,
Zurichois, zurichois, zurichois, -chois, -chois,
Quadrature, quadrature, quadrature, -ture, -ture,
Turquerie, turquerie, turquerie, -rie, -rie,
Richissime, richissime, richissime, -sime, -sime,
Symétrie, symétrie, symétrie, -trie, -trie,
Triqueballe, triqueballe, triqueballe, -balle, -balle,
Baliveau, baliveau, baliveau, -veau, -veau,
Vocalise, vocalise, vocalise, -lise, -lise,
Liséré, liséré, liséré, -ré, -ré,
Réticule, réticule, réticule, -cule, -cule,
Culbuto, culbuto, culbuto, -to, -to,
Taurobole, taurobole, taurobole, -bole, -bole,
Bolcheviks, bolcheviks, bolcheviks, -viks, -viks,
Victimaire, victimaire, victimaire, -maire, -maire,
Mercredi, mercredi, mercredi, -di, -di,
Dipsomanes, dipsomanes, dipsomanes, -manes, -manes,
Managers, managers, managers, -gers, -gers,
Jerricane, jerricane, jerricane, -cane, -cane,
Canapé, canapé, canapé, -pé, -pé,
Pécheresse, pécheresse, pécheresse, -resse, -resse,
Restaurant, restaurant, restaurant, -rant, -rant,
Rancissure, rancissure, rancissure, -sure, -sure,
Survenue, survenue, survenue, -nue, -nue,
Numismate, numismate, numismate, -mate, -mate,
Matelot, matelot, matelot, -lot, -lot,
Locataires, locataires, locataires, -taires, -taires,
Thermostat, thermostat, thermostat, -stat, -stat,
Talmudiste, talmudiste, talmudiste, -diste, -diste,
Dysthymie, dysthymie, dysthymie, -mie, -mie,
Misogyne, misogyne, misogyne, -gyne, -gyne,
Gynéco, gynéco, gynéco, -co, -co,
Connétable, connétable, connétable, -table, -table,
Tableautin, tableautin, tableautin, -tin, -tin,
Tintamarre, tintamarre, tintamarre, -marre, -marre,
Marabout, marabout, marabout, -bout, -bout,
Boulangère, boulangère, boulangère, -gère, -gère,
Gerberas, gerberas, gerberas, -ras, -ras,
Radicale, radicale, radicale, -cale, -cale,
Califat, califat, califat, -fat, -fat,
Phallophore, phallophore, phallophore, -phore, -phore,
Formica, formica, formica, -ca, -ca,
Capitale, capitale, capitale, -tale, -tale,
Thalamus, thalamus, thalamus, -mus, -mus,
Muscadine, muscadine, muscadine, -dine, -dine,
Dynamo, dynamo, dynamo, -mo, -mo,
Monoplaces, monoplaces, monoplaces, -places, -places,
Placebo, placebo, placebo, -bo, -bo,
Bogomiles, bogomiles, bogomiles, -miles, -miles,
Milonga, milonga, milonga, -ga, -ga,
Galalithe, galalithe, galalithe, -lithe, -lithe,
Liturgie, liturgie, liturgie, -gie, -gie,
Gyrophare, gyrophare, gyrophare, -phare, -phare,
Pharmacie, pharmacie, pharmacie, -cie, -cie,
Silicate, silicate, silicate, -cate, -cate,
Catalpa, catalpa, catalpa, -pa, -pa,
Palikare, palikare, palikare, -kare, -kare,
Carambar, carambar, carambar, -bar, -bar,
Barricade, barricade, barricade, -cade, -cade,
Caducée, caducée, caducée, -cée, -cée,
Séfarade, séfarade, séfarade, -rade, -rade,
Radical, radical, radical, -cal, -cal,
Calamines, calamines, calamines, -mines, -mines,
Minéral, minéral, minéral, -ral, -ral,
Rallumage, rallumage, rallumage, -mage, -mage,
Magistrat, magistrat, magistrat, -trat, -trat,
Traversine, traversine, traversine, -sine, -sine,
Cinéma, cinéma, cinéma, -ma, -ma,
Martingale, martingale, martingale, -gale, -gale,
Galoubet, galoubet, galoubet, -bet, -bet,
Betterave, betterave, betterave, -rave, -rave,
Ravioli, ravioli, ravioli, -li, -li,
Libérale, libérale, libérale, -rale, -rale,
Ralenti, ralenti, ralenti, -ti, -ti,
Tisserande, tisserande, tisserande, -rande, -rande,
Rendement, rendement, rendement, -ment, -ment,
Mandoline, mandoline, mandoline, -line, -line,
Lingala, lingala, lingala, -la, -la,

Le reste de ce billet donne plus de détails sur la façon dont cette chanson a été calculée ; le code est également disponible.

J'utilise Lexique, que je trie en mettant d'abord les formes non-dérivées (e.g., le singulier avant le pluriel, même s'il y a parfois des erreurs), puis par ordre de fréquence décroissance (pour privilégier les mots courants en faveur des mots rares, même si le programme n'hésitera pas à utiliser un mot rare quand c'est la seule manière d'aller d'une syllabe à une autre).

Lexique indique heureusement des informations de découpage en syllabes au niveau phonétique et orthographique, donc il est facile de retenir les mots de trois syllabes. Je ne retiens que les noms, à peu près comme dans la chanson originale (mais il y a là encore des erreurs). Je ne garde que les mots de lexique avec des informations de découpage complètes (elles sont parfois absentes dans Lexique).

La chanson originale contient, outre des noms, des groupes nominaux (par exemple "Trois p'tits chats"), que Lexique ne contient pas. J'ai envisagé de tenter de récupérer de tels groupes (par exemple, à partir de titres d'articles Wikipédia), qu'il faudrait ensuite consolider avec Lexique, mais finalement je ne l'ai pas fait, eu égard par exemple à la difficulté qu'il y aurait à déterminer si les élisions entre un mot et le suivant sont possibles.

Je filtre encore les mots : étant donné qu'en répétant la dernière syllabe on veut toujours garder une consonne avant, on peut éliminer tous les mots qui commencent par une voyelle, ou qui se terminent par deux sons vocaliques en succession (par exemple "hévéa" ou "casoar").

J'observe ensuite que, dans la chanson originale, on aime bien répéter un 'e' muet final sur les lignes paires (par exemple "pail-leuh", "somnambu-leuh"). J'identifie les 'e' muets en testant pour une finale en 'e' ou "es" qui ne soit pas précédée de 'i', 'u', ou 'é', et je décide d'imposer l'alternance entre de telles finales et des mots qui n'en disposent pas, un peu comme l'alternance entre rimes féminines et masculines en poésie (les règles diffèrent, cependant, par exemple "-ie" est tout de même une finale féminine en poésie).

Je considère qu'un mot u peut être atteint à partir d'un mot v si la dernière syllabe de la prononciation v, entière, est un préfixe de la prononciation de u. Ainsi, j'impose que la dernière syllabe soit entièrement préservée, afin de pouvoir la répéter ("radiogramme, -gramme, -gramme", et non "radiogramme, -ramme, -ramme"), mais on peut aboutir à un mot dont la première syllabe est suivie de davantage de sons consonantiques que la finale du vers précédent.

J'obtiens ainsi un graphe orienté de 205 sommets et 629 arêtes : pour chaque passage d'un couple "syllabe, genre" (masculin ou féminin) à l'autre, je ne retiens que le premier mot qui le permet dans la liste d'entrée (donc, normalement, le plus courant). Ce graphe a été écrémé pour ne conserver que les sommets accessibles et co-accessibles pour un point de départ arbitrairement choisi. J'interdis également, par souci esthétique, les arêtes qui restent sur la même syllabe en changeant de genre, e.g., "trinitrine"

Je recherche ensuite un cycle dans ce graphe, que je tente de rendre aussi grand que possible, mais je lui interdis de passer deux fois par la même paire "syllabe, genre", parce que dans ce cas on peut raccourcir la chanson. Ce problème d'optimisation est difficile : il est en fait NP-dur, vu qu'on peut facilement y réduire celui de la recherche d'un cycle hamiltonien. Je ne connais pas de manière simple d'approximer ce problème pour chercher un grand cycle, donc j'ai à la place implémenté une approche naïve qui énumère tous les chemins par parcours en profondeur. Cette approche génère le cycle de longueur 86 que j'utilise pour formatter les paroles ensuite. Sans doute peut-on en trouver un plus long ; c'est certainement le cas si on tente de corriger Lexique pour pouvoir conserver les mots avec des données incomplètes...

À titre expérimental (et pour récompenser le lecteur patient qui est parvenu jusqu'ici), j'ai tenté de combiner la mélodie avec le système de synthèse vocale Festival. Malheureusement, je ne sais pas comment lui indiquer la prononciation des paroles, et je ne peux donc indiquer que les paroles sous forme orthographique, qui est par ailleurs interprétée comme de l'anglais par le logiciel, avec des problèmes manifestes de gestion des caractères accentués. Le résultat se passe de commentaires...

Wake-on-LAN: suspend and resume machines from the network

Do you leave computers running, just on the off chance that you will need to ssh into them or otherwise access them remotely? I used to do this and feel a bit bad about wasting energy to power mostly idle machines, but I recently discovered Wake-on-LAN, which allows you to power or resume your computer from the network. I vaguely knew this existed, but I had no idea that it was something stable and available on normal computers. In fact, as it turns out, it works mostly out-of-the-box on my machines.

The basic principle of Wake-on-LAN is to setup your computer's network interface to react to specific incoming packets by powering up or resuming the machine, allowing you to suspend or power down the machine, and turn it up or wake it up remotely.

When to use it

When can you use Wake-on-LAN? First, of course, you cannot do this on machines that have to stay on because an actual server is running on them. Of course, however, such services should probably be run on low-power machines to avoid wasting energy. Second, it only works with Ethernet, so if your machine doesn't have a wired connection you are out of luck.

Third, your network interface must support Wake-on-LAN, which you can check as follows, using the ethtool command (packaged in Debian with the same name). I'm assuming that eth0 is your network interface:

sudo ethtool eth0 | grep 'Wake-on: g'

If Wake-on: g is indeed returned, then your machine supports Wake-on-LAN.

How to enable it

You then need to active Wake-on-LAN on the target machine (the one to remotely wake up), which you can do punctually by issuing:

sudo ethtool -s eth0 wol g

To have it persist across reboots, what worked for me on Debian is to use udev following these instructions. In fact, this is an excellent guide and I am paraphrasing a lot from it.

For future purposes you should also retrieve the MAC address of the interface of the target machine, I will call this MACADDR:

ip addr show dev eth0 | grep 'link/ether' | awk '{print $2}'

And retrieve its public IP address (I use IPv4), I will call this IPADDR:

curl -4 icanhazip.com

How to send the packets

Now that Wake-on-LAN is enabled, you will need another machine to send the magic packet to wake up the target machine. To do this, I use the command wakeonlan from the Debian package of the same name.

The easiest is to do it from a computer on the same LAN, where you can just do:

wakeonlan MACADDR

If you are not on the same LAN, then either you should ssh to a computer of the LAN where you can do this, or you should set up forwarding. For instance, with a Freebox, you can make it work like this:

  • enable the specific Wake-on-LAN proxy option;
  • set up a permanent DHCP lease for your machine based on MACADDR so it always has the same IP address in the LAN (if you haven't done so already),
  • configure a redirection from UDP and TCP port 9 from the WAN to the target machine (use different public ports if you have different target machines on the LAN);
  • reboot your Freebox and renew the DHCP leases from the target machines

You can then do, from any computer:

wakeonlan -i IPADDR MACADDR

Of course, in terms of security it is a bit worrying that this is possible, because there is no authentication. You can be reassured by the fact that an attacker would need to know your MACADDR to be able to wake up your machines in your stead, but as it is sent in the clear when doing a Wake-on-LAN, that's not so reassuring. Also, you have to hope that your network controller, with its fancy Wake-on-LAN features, does not have flaws that could allow an attacker to do more than just wake the computer up...

To test the setup, an easier way than actually suspending the machine, especially if you can't resume it manually in case of failure, is to check with socat that the target machine can actually receive the magic packets. On the target machine, do:

sudo socat -u udp-recv:9,reuseaddr -

Now test sending the magic packet; if things work correctly, some garbage (corresponding to the raw packet) should appear. If nothing appears, the packet is not being received by the target machine and you should troubleshoot this.

If the target machine can receive the packet, you can make it suspend to RAM using pm-suspend (package pm-utils in Debian). Do the following on the target machine:

sudo pm-suspend

The target machine should go to sleep, spending a minimal amount of energy to keep the RAM powered. Now, sending the magic packet from another machine should cause the target machine to resume. Hopefully all network interfaces should become usable (and all hardware should remain in the same state as before), so you can then ssh into the target machine a few seconds later, use it in the normal way, and just put it back to sleep with pm-suspend once you're done with it.

Minifying files with Delta

— updated

In many situations, e.g., when filing bug reports or asking questions on mailing-lists or forums, one needs to take a file which triggers a certain behavior and reduce it to a file of minimal size that still triggers the behavior. For instance, you have written a long program that makes your compiler segfault, and you want to extract from it a minimal program that does the same. This is called minification, and the minimal file is often called a minimal working example.

You can minify your file by hand, testing again each time you remove something, but this is quite inefficient. This post is a brief tutorial on how to use the tool Delta, which does this automatically.

First, you should install Delta. On Debian systems, it is packaged as delta, and its main command, that we will use, is named singledelta.

Second, the interesting part, you should create a shell script test.sh that takes a file as parameter and decides whether this file triggers the behavior of interest, returning 0 if the file is interesting and 1 if it is not. singledelta will use this script to test intermediate versions of the file while minifying.

For instance, to detect a segfault:

#!/bin/bash
myprogram --option "$1"
if ! test $? = 139; then
  exit 1
fi
exit 0

To test whether the output matches the contents of file "reference":

myprogram --option "$1" > output
! diff output reference

To test if the standard output or standard error contain the string "problem":

myprogram --option "$1" 2>&1 | grep problem

Third, you just copy your original file to a different name, say "minified_file", then run singledelta, which will minify "minified_file" in-place.

cp original_file minified_file
singledelta -in_place -test=./test.sh minified_file

The process is very chatty. Once it completes, "minified_file" is a file that still triggers the behavior and is as small as possible.

Well, technically, this is not true, because I have observed that in some cases, for reasons unknown, rerunning singledelta again on the supposedly minified file can minify it further. I have written a trivial script to run singledelta repeatedly until the file no longer shrinks. Use it thus:

cp original_file minified_file
manydelta ./test.sh minified_file

Once again, this will minify "minified_file" in place by invoking singledelta repeatedly. Of course, once the process has completed, you may still be able to apply human intelligence to minify the file further in ways that singledelta cannot do. Indeed, singledelta only tries to remove lines, it will not, e.g., shorten identifiers or strings.

If you need more advanced features, Delta can also be used for other things, e.g., running on multiple files. See for instance this guide.

Mobile apps I use

Smartphones nowadays seem to have a "social" aspect, where people talk about which hardware and accessories they use, and which apps and services they rely on. I don't care so much about this as about my computer setup, but it seems that for most other people the opposite is true. Hence, this post describes the software part of my smartphone setup.

For reasons related to privacy, ethics, and ideology, I try to avoid proprietary software to the extent possible. Hence, my phone runs mostly free software (with exceptions, see below). This implies that my choice of apps to run on my phone is limited, which is part of the reason why my phone is a comparatively unimportant part of my general computing setup, and why this list of apps is short and easy to write. Other reasons why my phone is not important are:

  • personal preference: I'd much rather use a computer with a real keyboard than a phone, and I often have a computer at hand.
  • ethics: the phone ecosystem is much less open-source-friendly and much less privacy-friendly than the computer one (at least when you run Linux); on phones it takes way more effort to evade proprietary software, anti-features, tight integration with cloud services that I don't want, etc.;
  • laziness: it's nevertheless clear that I should be doing more with this Internet-connected computer I always carry with me, even if it means I have to write my own software, but this would be a lot of work.

CyanogenMod

I have never tried non-Android smartphones. Apple and Microsoft smartphones are obviously a no-go in terms of freedom and privacy, but Firefox OS and Ubuntu phone look like decent options that I would need to investigate. For now, however, I use Android.

The precise operating system I use is CyanogenMod. There are multiple reasons. I prefer using a system which is (in principle) community-managed, and I prefer using one I installed myself rather than the default one (which I feel is more likely to come with anti-features). Further, CyanogenMod's slight power-user focus means that it sometimes includes some additional options and/or is less eager on hiding advanced settings to protect users against their own foolishness.1 The most important reason, however, is that I know no easier way to have a mostly open-source Android system without the proprietary Google applications. I do have a Google account for various reasons,2 but nothing I'd like to associate with my phone. Using CyanogenMod allows me to have an Android-system which is as Google-free as possible.3

A slight downside of this choice is that you cannot use some of the proprietary Google applications that have nice features even though they shouldn't in principle require a Google Account, e.g., the Google Maps app. Instead, I use the Web version, even though it is inferior. The major drawback, however, is that you cannot use the Google Play Store client. As it is not possible otherwise to download the APK packages of applications (even the freeware ones), it means you cannot install any of the Android apps which are only available on Play Store, i.e., a vast majority of them. No Google apps also means no Google Play Services, and even some apps available outside of the Play Store, e.g., TextSecure, cannot be used then because they require them (see, e.g., TextSecure's FAQ entry about this).

For many real-world services (banks, restaurants, cinemas, etc.), as "mobile apps" means "Android and iPhone", and as "Android" means "Google Play", this means that I cannot use their apps. This would not be a problem if such services had decent Web interfaces, but many of them spend most of their efforts polishing their closed apps rather than their websites, so it is sometimes annoying.

Another downside is that CyanogenMod still contains proprietary software and drivers, so it would be better to switch, e.g., to Replicant, but its support of my phone is not sufficient for me to consider it yet.

F-Droid

To install apps and keep them up-to-date, instead of the Google Play Store, I use F-Droid, a repository of apps which only includes free software.

The main drawbacks of F-Droid is that there are comparatively little applications available (1563 as of this writing, against 1.4 M on the Play Store), and that the interface is a bit primitive in some respects (e.g., apps are not downloaded in the background, apps must be manually updated one by one, etc.). The advantage is that there is only free software, and that the developers are serious about identifying anti-features, building apps without tracking or advertising libraries, etc. This saves me the effort of figuring out whether all those apps are ethical.

OsmAnd

To display maps and compute routes when moving my body in the real world, I use OsmAnd, which relies on OpenStreetMap (OSM) data. OSM is a free map database which is the main open-source competitor to Google Maps. OsmAnd uses Android's feature to figure out the phone's location (using GPS, for instance), and it can either draw maps retrieved directly using an Internet connection, or it can draw them from offline OSM data downloaded beforehand, which requires no data connection. The offline feature is extremely useful abroad, where data is unreliable and prohibitively priced, and it is something that Google Maps does not always allow (for licensing reasons that don't exist with OSM).

Except the nicety of being usable offline, OsmAnd has a lot of drawbacks relative to the Google Maps app. The interface is ridiculously counter-intuitive. It is sometimes sluggish (especially when drawing the offline maps); this is quite legitimate as it's harder to decompress and draw maps than to just fetch them, but it seems like it could be improved (for instance, it doesn't seem to cache a lot of what is drawn). Offline routing is quite brittle. There is no public transportation routing, and of course no Street View. Also, addresses need to be input in a structured format, there is no good unstructured search.4

As a proud OSM editor, I use OsmAnd's editing features to fix OSM or leave notes whenever it does not match the real world.

WeeChat

I use IRC to chat, mostly on a self-hosted server, and I use XMPP via a bitlbee gateway. On my server, I use weechat. To support this on my phone, I use the Weechat-Relay Android client.

The reason why I use weechat is mostly because of this relay feature, to replace a former setup with irssi and the bip proxy which was unmaintained and had bugs. Still the result doesn't really work that well, and I can't really trust weechat to reconnect reliably whenever my data connection disappears and reappears. (In contrast with previous setups, however, this one has a crucial feature which isn't necessarily a given: when the Android client displays a message that I sent, then it has got the confirmation that the message was actually sent; it won't display my messages before sending them and then silently realize it cannot send them because it disconnected.)

ConnectBot

I use ConnectBot as an SSH client to my various machines. I use it whenever I need to check something on the machines, be it searching mail archives with notmuch,5 administering mailing-lists with listadmin, anything really...

I also use ConnectBot to create SOCKS proxies, for various purposes: whenever I need to evade network restrictions; whenever I have the feeling that my data connection lags when opening new TCP connections but is usable otherwise; when I want to use Tor on my phone through one of my machines. Port forwards and SOCKS proxies can be configured by long-tapping hosts in ConnectBot.

DAVdroid

I use DAVdroid to synchronize my calendar with my server: more details in my blogpost about this. I use the default AOSP Calendar application to manage my calendar.

K-9 mail

I use K-9 mail to manage mail on my phone, accessing my account via IMAP. I use IMAP IDLE to get my mail when it arrives without having to refresh periodically. I haven't yet invested the time to support reading OpenPGP-encrypted mail on my phone.

When I'm abroad and data is expensive, I use a procmail filter to send myself an SMS for each incoming mail using my mobile provider Free's API. I pipe the mail through this script and then through an invocation of this script. This gives me the sender, subject, and the first few lines of the message. That way, I can at least read email in real-time for free, and punctually enable data when I want to get the full email, or reply to it.

Aard Dict

I query Wikipedia very often, and I want to be able to do so when I have no Internet access, and to do it fast when data is too slow. For this, I use Aard Dict, with the French Wikipedia. (I would prefer the English one for most purposes, but it is too large, given that I also want to store music on my phone.) The dump is almost two years old, but it doesn't matter so much for most of what I need it for. Beyond their use when fact-checking, Wikipedia pages make for fairly interesting reading when bored.

I don't use yet version 2 of Aard Dict, because it hasn't been packaged for F-Droid yet. I'm not in a hurry, I'm quite satisfied with version 1.

On my computers, where I have more space, I use Kiwix with more dumps, see the blogpost about this. This is very useful on planes or in other places with no or crappy Wi-Fi.

Firefox Mobile and default browser

To browse the Internet, I use either the default browser or Firefox for Android. I'm not especially fond of either, especially their sluggishness, and their inexplicable habit of flushing pages after they have been downloaded, so that if there is no longer a data connection they can't be displayed anymore.

I need Firefox Mobile whenever I need to use a SOCKS proxy, because the default browser doesn't seem to support it: configuring Firefox Mobile to use a SOCKS proxy must be done via the about:config page, with the following settings: network.proxy.socks, network.proxy.socks_port, network.proxy.socks_remote_dns, and network.proxy.type. Conversely, I rely on the default browser's tolerably functional feature to save pages for offline viewing.

Anki

I haven't used it in a while, but Anki is a very good flashcard application, and spaced repetition, that this app implements, is a nice way to optimize writes to your brain.


  1. Not that I'm not foolish, but with software that takes it for granted that you are stupid, you usually have little opportunity to become wiser. 

  2. The main Google services I use are Webmaster Tools, Alerts, Code Jam, Hash Code, and the occasional Google Calc shared document. 

  3. CyanogenMod, however, is certainly not perfect in terms of Google-proofing, but it's better than other Android alternatives I know of.  

  4. Google Map's search is extraordinarily good, with a brilliant ability to disambiguate between place names, addresses, search queries, and good tradeoffs between popular places and impopular but nearby places (depending on where the user is currently looking). I have the impression that this must be a major engineering achievement precisely because no one notices how hard it must have been to pull off. 

  5. One day I'll write more about my email setup... 

Cruising at -41.8 million feet: Antipodal airports

In the original version of my previous entry about the most distant airports by minimal number of connections, I made an offhand remark about the alternative problem of finding the two most geographically distant airports. I originally answered it by copying a random answer to that question from the Web without thinking much about it, but after stumbling upon a different answer, I started having doubts. After more investigation, it appears that the Web has no clue what It is doing, and that these answers were erroneous.

Having recomputed everything myself from the OpenFlights dataset, I believe that the two most geographically distant airports in the world are the following, with a distance of 20002 km:

Dataset and preparation

OpenFlights gives you a (slightly noisy) dataset with airport codes, latitudes, longitudes, and altitudes. As I am just interested in the one most distant airport pair, and not in the complete rankings, I just had to clean up the one bad offender that I saw, namely, the Budapest Keleti station and its mysterious antipode double: it is probably a dataset bug, and anyway this is a train station so it has limited aerial value. See the whole script for details about the preprocessing.

The silly method: distances on cylinders, and equirectangular projections

The latitude is a decimal value between -90 and 90 that describes on which parallel a point is located: 0 is the equator, 90 is the Geographic South Pole, and -90 is the Geographic North Pole. The longitude is a decimal value between -180 and 180 that describes on which meridian a point is located: 0 is the IERS Reference Meridian, positive values go eastwards, and negative values go westwards.

What's the dumbest way to find out, given a bunch of such latitude/longitude coordinates, which points seem to be exact opposites on Earth? Well, just compute the latitude and longitude of the antipodal point, namely, the opposite point on Earth relative to the center of the Earth. This is fairly easy:

  • given a latitude \(\phi\), the antipode has latitude \(-\phi\) (it is opposite with respect to the equator);
  • given a longitude \(\lambda\), the antipode has longitude \(180+\lambda\) (i.e.,you go around the globe for half a turn) except you have to bring it back to \([-180, 180]\) by adding or subtracting 360.

Now, you estimate the distance of the second point to the antipode of the first point in the crudest possible fashion: just pretend that the latitude and longitude are two-dimensional coordinates and use the Euclidean distance, where the square brackets \([\cdot]_{-180, 180}\) denote the operation of going back to \([-180, 180]\).

\(d_{\text{e}}((\phi_1, \lambda_1), (\phi_2, \lambda_2)) = \sqrt{(\phi_1 + \phi_2)^2 + [\lambda_1 - \lambda_2 - 180]_{-180, 180}^2}\)

Of course, this distance estimation is wildly inaccurate. In geographical terms, it amounts to computing distances directly on the equirectangular projection of Earth. This is inaccurate, because the Earth is not flat, and the distortion in distances depends on latitudes and longitudes. In particular, distances near the poles are wildly overestimated. Yet, my friend Mc, when he suggested this crude approach to me, claimed that this would probably not matter much, because the distances in consideration for nearly antipodal airports are small and there aren't many airports near the poles. As we will see, he was right.

To perform this computation, I simply evaluated it on all pairs with a simple C++ program, that completes in seconds. To get a distance estimation from this, I subtract the distance between the antipode and second point to the distance of going around the Earth obtained from the Earth's mean radius. The best pairs are here, sorted with more distant pairs at the bottom (i.e., the most antipodal, so the most interesting). Hence, according to this method, the best pair is the following:

I did not check that the other results corresponded to anything sensible (and not, e.g., train stations that don't really exist), so take them with a grain of salt. For NVA-PLM, however, you can check from online sources that these airports indeed exist at the given coordinates.

The sensible method: distances on spheres, and haversines

Can we do better? Well, let us notice that the Earth is not a cylinder but a sphere, with mean radius of 6,371.009 kilometers according to the IUGG. Given two airports, what interests us is the length of the shortest route from one to the other on the sphere, which is known as the great-circle distance: the name is because the route from one point to the other will follow a great circle, a circle on the sphere whose center is that of the sphere. The standard way to do so is to use the Haversine formula. So I just plugged it in the previous code.

The results are here, and we can see that Mc was right: they don't change much. In particular, the top pair is still the same, though our distance estimate changed by 3 kilometers, i.e., quite a lot compared to the distance to the second best pair. The first difference is at position 6, where the 6th best pair and the 8th best pair are swapped. The result file contains all 3435 pairs estimated to be at a distance of at least 19800 km by the Haversine formula, with the computed distances differing from up to 107 km compared to the results of the previous section on these pairs.

The serious method: distances on ellipsoids, and WGS-84

Do you recall anything from school about the Earth not being a perfect sphere, but being a bit rounder at the Equator? How bad is this? Well, the radius varies by about 33 km; it's small compared to the overall radius, but huge relative to the distance differences between our top airport pairs. So we have to be more precise to get a more definite answer.

Fortunately, serious people already know how to deal with this problem, and model the Earth, not as a sphere, but as an ellipsoid, more precisely an oblate spheroid. An ellipsoid is less scary than it sounds: it is just what you get by rotating an ellipse around the North-South pole axis, rather than a circle (in which case you get a sphere). The most common such model seems to be the reference ellipsoid of WGS-84, the World Geodetic System used by GPS. I wouldn't want to implement geodesic calculations on ellipsoids, but fortunately other people have done it before: I used GeographicLib, CLI bindings of which are packaged for Debian as geographiclib-tools. For instance, the GeodSolve -i tool eats latitude/longitude pairs as input and produces as output two azimuths (that we don't care about) and more importantly the distance of the shortest path from one point to the other.

I wasn't sure about the feasibility of feeding all 32 million airport pairs to this tool to perform complicated computations, so I restricted the study to the 3435 top pairs given in the previous output. This is more than sufficient to be sure not to miss anything: the difference in radii is < 33 km and the margin to the best is > 200 km, which is clearly more than the maximal error we could have made. The computation on these 3435 pairs is essentially instantaneous: the results are here.

Observe that the top pair changed to the one given at the beginning of this post, with its longest distance estimate decreased by 8 km. Overall, the error between the Haversine and ellipsoid estimates is at most 23 km on these 3435 pairs. Of course, discriminating the first and second best pair, with only about 700 m difference, is a bit problematic, as the "position" of an airport is hard to define (and 700 m is less than the length of a runway...). So I picked the first one as a winner, but we start hitting the limits of the definition of our problem.

The surreal method: accounting for altitude

Having reach the limit, let us go even further! The Earth is not flat, is not a sphere, but it is not a perfect ellipsoid either. Notably, it has local variations in radius, a.k.a. mountains. What if the altitude of the airports changed the ranking? Two airports with a large altitude difference could be more distant than a more antipodal pair where both airports are at sea level, because of the need to cover the altitude difference. Further, even if two airports have the same altitude, the distances between both are greater if the altitude is high, because you will travel further from the ellipsoid center. This is not entirely negligible as there are airports with an elevation of 4-5 km, which is larger than our 700 m margin between the top pair and the second one.

Of course, thinking about it will make you realize that we cannot conceivably account for this. If you start taking precise altitude into account, the shortest path between two points may be twisted because of the need to avoid mountains. Further, this question is totally disconnected from reality: commercial flights going from a point to another first ascend to a cruise altitude of about 11-12 km, and then descend to the target, and of course this contributes to the distance travelled, and means that accounting for airport altitude would not be doable.

As a purely theoretical exercise, though, let me continue to model the Earth as the WGS-84 ellipsoid, with no mountains, and let us position the airports at their latitude/longitude and with an altitude that elevates them away from the spheroid's surface. Now, we must bound the length of the shortest path between two given airports. Clearly, the shortest path with altitude is at least the length of the path at sea level, which we computed before. Further, it is at most the sum of both altitudes plus the sea level path, because one possible path is to go down to sea level, travel, and go up again, so this path must be longer than the shortest path. I computed this dumb upper bound, and it seems like, for our 3435 pairs, the bound suffices to show that the path with altitude is never longer that that of the top pair. So taking altitude into account cannot imply that other pairs will beat the PGK-LMC pair.

The subterranean method (bonus)

In all this post I have made the implicit assumption that we are going from an airport to the other while remaining above sea level. However, the literally minded will note that when I talked about the "most distant" airport pair, I never said anything about having to go around the big spherical, er, ellipsoidal, obstacle that bars the way. So, neglecting the existence of planet Earth, what are the two most distant airports by straight line distance in three-dimensional space, going straight from one to the other?

To answer this pressing question, we can again rely on GeographicLib, whose CartConvert program takes latitude, longitude (and even altitude!) and produces Geocentric coordinates, i.e., coordinates in a three-dimensional Cartesian coordinate system. The straight line distance is then computed with the usual Euclidean distance formula:

\(d_{\text{e2}}((x_1, y_1, z_1), (x_2, y_2, z_2)) = \sqrt{(x_1 - x_2)^2 + (y_1 + y_2)^2 + (z_1 - z_2)^2}\)

Computing this only for the restricted pair subset of the previous sections, we obtain this. Amusingly, the most distant pair by this criterion is NVA-PLM, the same one as with our initial method (which also used Euclidean distance but in two-dimensional space). As far as I can tell, this is entirely coincidental. Note that the distance of 12758 km is slightly more than the average diameter of Earth (though it is less than the maximal diameter, of course): this is explained by the fact that these airports are close to the equator.

Confidence in the result

I now conclude by giving more information about whether my results can be trusted to be accurate.

The GPS coordinates for LMC and PGK in OpenFlights are essentially the same as those given in Wikipedia and are both, according to OpenStreetMap, at some position on each of these airport's main runway (LMC, PGK), which is further confirmed by the satellite imagery of Google Maps (LMC, PGK).

The fact that the distance between these two points is indeed 20001571.135 meters was computed directly by GeographicLib. However, here I must point out that this result seems to disagree with many websites that offer such calculations. This is probably caused by the fact that nearly antipodal points are corner cases for geodesic computations on ellipsoids.

It is conceivable that the sources that overestimate the distance (except Great Circle Mapper, where the difference is too high, and otherwise explained) are using a geoid model, to account for variations in radius beyond what the WGS-84 ellipsoid accounts for. This is beyond the scope of what I did.

That this pair is indeed the most distant depends on whether positions for other airports in OpenFlights are accurate, and whether this source is exhaustive, i.e., contains all airports. I did not really try to check any of these two points.