a3nm's blog

It's a big world after all: Computing the diameter of the flights graph

— updated

What are the two most distant airports on Earth? A natural definition is by geographical distance: see my followup post for this definition. However, an alternative question asks what are the two most distant airports in terms of the number of different flights required to go from one to the other. For the most geographically distant pair, LMC and PGK, the answer appears to be 5 flights, because you must go to BOG, then fly through e.g. FRA and BKK to CGK, and then to PGK. Are there other pairs where more flights are needed? Note: this article was written in 2015, so some of the info here may have gone out of date. Proceed at your own risk.

The only relevant reference I found is this thread. Quora has the awful habit of blocking the content to unregistred users, so I will summarize it: at the time of writing, the only answer estimates that there should be no airport pair requiring more than 5 flights, as you should always be able to fly via a regional hub, a global hub, a global hub, and a regional hub. There is also this presentation which claims that the diameter of the flight graph is more than 20, but no examples are given. This post tries to give a more definite answer; as we will show, up to 8 flights can be needed.

Dataset

While any form of precise timing or pricing information about flights is hard to obtain, one would think that a public database of flights with flight number, company identifier, origin airport, and destination airport, should not be hard to come by. This is not the case: the only public structured database I could find is the OpenFlights route database, which is apparently derived from the one distributed with the Airline Route Mapper, a mysterious proprietary Windows program with no contact information other than a FlyerTalk thread.1 The dataset does not include the flight number, which we will see is extremely problematic.

We can start cleaning up the routes by removing codeshare flights and flights by airlines that the OpenFlights airline database lists as defunct. For flights with an airline unknown to OpenFlights, we generate fresh airline identifiers instead. See the exact code.

Computation

Now we just compute the diameter of this graph, i.e., the maximum over all airport pairs of the length of the shortest path from one airport to the other, using a straightforward C++ program that implements Floyd-Warshall with path reconstruction. This runs in cubic time in the number of airports (around 3100), taking about 20 seconds on my machine.

The problem is that OpenFlights is not perfect, also because of the fuzzy definition of what a flight is. Indeed, the naive result gives the following paths of length 15:

15: XEQ -> YPO: XEQ QUV QFN JNN JJU UAK GOH KEF YYZ YQT YTS YMO YFA ZKE YAT YPO 
15: YPO -> QFN: YPO YAT ZKE YFA YMO YTS YTZ IAD KEF GOH UAK JJU JNN XEQ QUV QFN

However, investigating on the website of Air Greenland2, Tasiusaq Heliport and Nanortalik Heliport are connected by a direct flight (GL9402), with the intermediate stops listed as "technical stops". It a bit problematic to call those changes of flights, as you can probably just remain in the aircraft. This could be accounted for if only we had the flight number in the dataset, but sadly we do not have it. Likewise, Thunder Airlines seems to be selling the flight from Timmins to Peawanauck as one single flight (flight 105).

So I tried to fix these problems until I found an example that I liked, which required an surprising amount of manual intervention. I added a bunch of missing flights by considering stopovers with the same flight number as being free. I had to merge a few airports because some complicated routes were jumping through hoops to get from the domestic airport of a city to its international airport: for instance, to go from Sde Dov to Ben Gurion by plane only, you need to fly to Eilat and back. Last, I had to eliminate many airports for which no flights could actually be bought online from mainstream providers. I only retained flights featured either on Google Flights3 or on Skyscanner. This extremely tedious iterative process took me several hours and I don't know why I did it, but I did, hopefully without introducing grave errors. All paths from length 15 down to length 9 turned out to be spurious according to my criteria.

An eight-flight result

Here is the result for eight flights, after sufficiently refining the dataset to rule out all longer paths as they were erroneous. Now, the vast majority of these paths are probably crap, but here is an example of one that seems correct to me:

8: AUY -> IOQ: AUY TAH VLI SYD LHR KEF KUS CNP OBY

Indeed, Anatom Airport in Vanuatu seems to be served only by a flight to Tanna, from which it is only useful to connect to Port Vila; Air Vanuatu sells them as separate flights (NF253 and NF239) with different planes, and you can buy the tickets from any online source (beware, however, it seems that the flights only run on Saturdays and Tuesdays). From Port Vila you can fly only to Oceanian hubs, for instance Sydney, so you need three hops in total to get there.

Conversely, to go to Ittoqqortoormiit Heliport in Greenland, the only way is from Nerlerit Inaat Airport (and only on Wednesdays). To go there, no matter what the Wikipedia page says, it seems that there are only two reasonable ways: with Air Greenland from Kulusuk on Wednesday early afternoon (in time for your connection), which you can reach with Air Iceland (in the earlier morning, with a tighter connection) from the domestic Reykjavík Airport; or directly with Air Iceland (and on Wednesdays, but too late to connect) but only from Akureyi in Iceland, which you can then reach from Reykjavík on the same day (the two legs are flown by the exact same plane, TF-JMT, but they are sold with different flight numbers, NY-112 and NY-511). So that's three hops altogether from Reykjavík in either case. (Alternate options are to arrive from Nuuk, which is e.g. reachable from Copenhagen, but they are not competitive.) The path via Kulusuk can be bought easily from many online sources (but not Google Flights), though you may have to buy separately the Air Iceland and Air Greenland parts.

Now, to go from Sydney to Reykjavík, I claim that you need at least one stop. This is clear4 because a direct flight would beat, by far, the current longest scheduled flights. However, some flights from Australia to Europe are marketed as a single flight even though they include a refueling stop. This is particular the case of flights BA16 (via SIN) and QF1 (via DXB). Fortunately, however, no such flight goes directly to Reykjavík, so even though you can go from Sydney to London in one "flight" with this trick, you need an additional flight to reach Reykjavík.5 Hence, following our choice of sticking with flight numbers, that's two flights.6

So we have a total of eight flights. Of course, while many websites offer the individual legs, none of them will manage to sell them simultaneously. Still, eight flights is quite a far cry from the five flights initially conjectured. Of course, the fuzziness of the problem means that this numbers does not mean that much: if you include flights that can only be bought from their own companies, charter flights, count differently stops that involve no change of aircraft, etc., your mileage may vary. In particular, the "flights" from SYD to LHR probably require you to leave the plane, whereas I'm not sure that the "change" in Akureyi would require doing this, so this illustrates a bit the absurdity of the problem if you look too closely.

There is a related question on travel.SE.


  1. This leads to an awful dilution in responsibility: you cannot report errors in the data, because OpenFlights says (e.g., in this bug report) that the data comes from a third party, but that third party has no apparent way to report bugs... As of this writing it looks like the discussion is happening on FlyerTalk and there is no way yet to edit the dataset collaboratively... 

  2. By the way, how ironic is it that the website of Air Greenland is one of the neatest and fastest airline websites I ever interacted with? 

  3. I did not keep Google Flights options for which the price had to be checked on the airline website: the redirection usually points to the website's landing page and not to a specific flight, suggesting that Google probably has as little information about the flight as I do. 

  4. A previous version of this post claimed that a direct flight from SYD to KEF would have been impossible because of fuel requirements. This is incorrect, as there have been longer exceptional flights. Thanks to Mc for pointing this out. Also note that, from 2018, [there is now](https://en.wikipedia.org/wiki/Longest_flights#Non-stop_flights_(top_30,_by_great_circle_distance) a direct Quantas flight between London and Perth. Again, all data in this post may have gone out of date. 

  5. What's slightly more surprising is that using such flights is the only solution, e.g., were it not for that marketing trick, three flights would be required to go from Sydney to Reykjavík, not two. I guess that this is because KEF is not connected to anything sufficiently close to Australia or Asia that could play the role of the single stop. 

  6. If you plan to schedule a trip, the better option is then probably to leave AUY with the Saturday flights, to reach KEF on Monday evening, and then take the Wednesday flights and hopefully reach OBY on Wednesday evening. I don't know if anyone ever tried to fly from AUY to OBY, but somehow it looks unlikely. In terms of cost, if you agree to give up on the absurdly expensive "direct" flights from SYD to LHR, the price would be around 2500 EUR for a one-way trip, with around 34:30 hours spent in airplanes. 

Arbitrary-length unambiguous buffalo sentences

Consider the well-known sentence:

Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo.

As Wikipedia explains, this is a grammatically correct sentence using the ambiguity of Buffalo (the town) and buffalo (either a noun, the animal, or a verb, "to bully"). Let me illustrate the grammatical structure of the sentence, using words from the corresponding parts of speech, and using square brackets to help the parsing:

Boston deer [(that) Seattle doves annoy] pester Austin kittens.

The Wikipedia article points out that buffalo sentences can be created with any number of repetitions of the word buffalo. For instance:

Buffalo buffalo Buffalo buffalo Buffalo buffalo buffalo buffalo buffalo.

One can easily interpret this as, e.g.,

Puppies tickle Chicago unicorns [(that) Dallas ponies [(that) squirrels mock] nag].

In other words, considering squirrels that mock Dallas ponies that nag Chicago unicorns, then there are puppies tickling such unicorns.1 For legibility we can also write this in the following way, using Nouns, Verbs, and Cities:

N V (C N (C N (N V) V))

The problem is that this sentence can equivalently be understood as:

C N (C N (C N V) V) V

In my silly rephrasing language:

Houston birds [(that) Phoenix bunnies [(that) Memphis butterflies bother] peeve] displease.

Note that I'm allowing here an intransitive use of buffalo (or "displease" in the gloss) as a verb, which I guess is OK. (Indeed, Wikipedia claims that Buffalo! is a valid imperative sentence.)

Hence, the above buffalo sentence has the unpleasant aspect that it can be parsed in many different ways. By contrast, the original one seems to have only one reasonable parse, as long as we can rely on case to distinguish Buffalo and buffalo. Hence the simple and tantalizing question: can we construct arbitrary-length buffalo sentences that are unambiguous, i.e., have a single grammatical interpretation?2

In the rest of this post I show the following, for a certain choice of grammar which I hope covers all reasonable interpretations: for any length greater than one, the (unique) case-insensitive buffalo sentence is ambiguous; by contrast, if we are case sensitive, we can build unambiguous buffalo sentences of any length, even if we have case ambiguity on the first word.

Grammar

I will use the following grammar, in simili-BNF:

S    ->   V NPO | NP V NPO
NPO  ->   NP    | C        | ε
NP   ->   NPS   | NPS PP
NPS  ->   N     | C N
PP   ->   NP V
C    ->   Buffalo
V    ->   buffalo
N    ->   buffalo

In other words:

  • A sentence is either an imperative sentence or a declarative sentence, both of which have an object noun phrase.
  • An object noun phrase can be a noun phrase, the empty sequence (corresponding to an intransitive use of the main verb) or a single Buffalo. The last case handles the case of Buffalo buffalo Buffalo understood as "Dragonflies tease New York" (i.e., its inhabitants). I only allow Buffalo to be an object, because my understanding of grammar suggests that sentence's main verb should be "buffaloes" if Buffalo were the subject.
  • A noun phrase is a simple noun phrase optionally followed by a relative proposition: we will be able to nest relatives but importantly we cannot juxtapose them.
  • A simple noun phrase is either buffalo (the noun) or Buffalo buffalo (the city followed by the noun).
  • A relative proposition has a subject noun phrase and a verb, with the object of the verb being the antecedent of the proposition.

This grammar can uniquely parse the example sentence,3 and includes a few additional features not exemplified by that sentence, which I already commented about: buffalo as an imperative and/or intransitive verb, and Buffalo used as an object.

Case-insensitive case

In the fully case-insensitive case, we write everything as buffalo. As we can show, all sentences of length greater than 1 are ambiguous.

The reason is that S -> V NP and S -> NP V are two different possible parses, and it is not hard to see that we can construct NPs of arbitrary length: we can choose any number n of nested PP in the derivation, and choose either N or C N at each GN, of which there are n+1. So to realize a length of k, choose to create n=(k1)/2 PP, and create kn words in the GN (which is between n+1 and 2(n+1)).

As an example, for buffalo buffalo buffalo buffalo buffalo, parse buffalo buffalo buffalo buffalo as NP, e.g., as, N (C N V), and parse the sentence either as V NP or NP V.

Case-sensitive case

We now assume that we can distinguish Buffalo and buffalo, except maybe for the first word, intuitively because it must carry a capital because it starts the sentence. Let us consider small lengths as examples.

  • For length 1, the sentence Buffalo can only be interpreted as V, namely, the imperative with no complement. My grammar forbids the declarative N, a nominal sentence declaring the existence of buffaloes, but of course this is debatable.
  • For length 2, Buffalo Buffalo has exactly one interpretation because the last Buffalo must clearly be a C, so the unique parse is V C. Again, we are prohibiting the nominal sentence asserting the existence of buffalo in Buffalo.
  • For length 3, considering Buffalo Buffalo buffalo, the first Buffalo cannot be a C as otherwise the next symbol should be buffalo, and it cannot be an N as otherwise the next Buffalo buffalo must start a PP that cannot finish. Hence the first Buffalo is a verb and we have a unique interpretation: V (C N).

For larger lengths, we will first observe that if any sequence of buffalo and Buffalo can be parsed as an PP, then there is only exactly one way to do so. To see why, observe that Buffalo must always be followed by buffalo in which case we always parse both symbols as an NPS. So write b for buffalo and b2 for Buffalo buffalo, and rewrite our sentence (uniquely) as a sequence of b's and b2's. We then see that each b or b2 at the left must pop a b at the right, so that parse fails unless the number of b and b2's is even and the rightmost half contains only b's; conversely, if this condition is satisfied, there is a unique parse. So we have proven our initial observation.

Second, we claim that for any (possibly empty) sequence S, there is at most one way to parse buffalo S or Buffalo buffalo S as an NP. This is clear as we must parse buffalo or Buffalo buffalo as an NPS and the rest as an PP, and we can then use the previous reasoning. It is not so hard to see then that this claim extends to the situation where the first symbol has ambiguous case. Indeed, for any sequence S, at most one of S and buffalo S has a successful parse as a PP (because, rewriting them using b's and b2's, only one rewriting can have an even number of symbols). Hence, for Buffalo buffalo S, only one option succeeds between parsing the initial Buffalo as an N (using case ambiguity) and buffalo S as a PP, or parsing Buffalo buffalo as an NPS and S as a PP. Likewise, for buffalo S, either we parse buffalo as an N and S as a PP, or we parse buffalo as a C (using case ambiguity): in this second case the first symbol of S cannot be Buffalo, so that, writing S as buffalo S', we parse S' as a PP. By the previous reasoning only one of these two options can succeed, and if it does then it succeeds in one unique way. So we have proven our initial claim.

We now show that for any non-empty sequence S, the only way to parse S buffalo Buffalo as a sentence is to parse S as an NP. This is because a final Buffalo is necessarily an NPO. Now, as S buffalo contains at least two symbols, it cannot be V, so it must be NP V, so the buffalo is V and we must parse S as an NP. Hence, combining this with our previous claim, for any sequence S, there is at most one way to parse the following sentences:

  • Buffalo S buffalo Buffalo
  • Buffalo buffalo S buffalo Buffalo

It is now easy to define recursively the language of sequences S that have a successful parse as a PP (or as the empty sequence ϵ), abbreviating buffalo and Buffalo as b and B. By our initial observation, such parses are then unique.

  • L0=ϵ
  • Ln=wLn1{bwb,Bbwb}

So using L0 we can construct buffalo sentences with unique parses for length 3 and 4; and using the first rule in the definition of Ln we can then reach k+2 for any k that we can reach. Hence, as 3 is odd, 4 is even, and 1 and 2 were previously covered, we have shown our claim: we can construct uniquely parsable buffalo sentences of any length, even with ambiguity on the case of the first word.

To illustrate, here is an example of a sentence of length 42 which we have shown to have a unique parse.4

Buffalo buffalo Buffalo buffalo Buffalo buffalo buffalo Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo Buffalo buffalo Buffalo buffalo Buffalo buffalo buffalo Buffalo buffalo Buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo Buffalo.


  1. For all animal species in the sentence, it seems semantically unclear whether the individuals are existentially or universally quantified. For instance, we do not know whether the sentence applies to all such unicorns or only some of them. 

  2. By "grammatical interpretation" I mean parsing, not meaning; I am not interested in possible doubts about whether the verb buffalo means "intimidate" or "hunting buffaloes", nor do I care about which buffalo species or Buffalo city the speaker is referring to. 

  3. If the initial Buffalo is V, then buffalo is an NP, we must parse a PP, Buffalo buffalo is an NP, we must parse a PP, then either we start unstacking the V and we are left with Buffalo buffalo which we cannot parse, or we further parse buffalo as an NP and then we can never unstack enough V. If the initial Buffalo is N, then for a similar reason we cannot parse what follows as a PP (if we unstack early then we are left with the second Buffalo where the verb should be), so the second buffalo is V, Buffalo buffalo is then NP, and once again we cannot parse a PP. So the initial Buffalo is C, and buffalo is then N, and Buffalo buffalo must then be a PP. We must unstack immediately, we must parse buffalo as the verb, and we have no more choices. 

  4. Of course, this is just an example, and in general unique parse sentences are not themselves unique. We can in fact build unambiguous sequences following entirely different schemes; e.g., we can also show that Buffalo Buffalo buffalo w has a unique parse (with the initial Buffalo being necessarily a V), for any w in the language inductively defined by L1={Bbb} and Ln=wFn1{BbBbbwbbb,Bbwb}

Configuring screens with xrandr

— updated

There are various GUI tools on Linux to change the screen resolution and configure the various screens of your machine, which is useful to do, e.g., when plugging in a video projector to give talks. From my experience, however, the right tool, the one that does not hide anything from you and allows you to get the job done, is the CLI utility xrandr.

This post is a quick guide to perform common tasks with xrandr. I'm certainly not an X guru, but maybe this can serve as an introduction about how to perform common display configuration tasks with it.

The starting point is always to see the current state with:

xrandr -q

This will print out information about all screens and outputs known to the system. For instance:

Screen 0: minimum 320 x 200, current 3360 x 1080, maximum 16384 x 16384
LVDS connected 1920x1080+1440+0 (normal left inverted right x axis y axis) 382mm x 215mm
   1920x1080     60.01*+
   1680x1050     59.95  
   1400x1050     59.98  
   1280x1024     59.89  
   1440x900      59.89  
   1280x960      59.94  
   1280x854      59.89  
   1280x800      59.81  
   1280x720      59.86  
   1152x768      59.78  
   1024x768      59.92  
   800x600       59.86  
   848x480       59.66  
   720x480       59.71  
   640x480       59.38  
DisplayPort-0 disconnected (normal left inverted right x axis y axis)
DisplayPort-1 disconnected (normal left inverted right x axis y axis)
DisplayPort-2 disconnected (normal left inverted right x axis y axis)
VGA-0 connected 1440x900+0+0 (normal left inverted right x axis y axis) 408mm x 255mm
   1440x900      59.89*+  74.98  
   1280x1024     75.02    60.02  
   1280x800      59.81  
   1152x864      75.00  
   1024x768      75.08    70.07    60.00  
   832x624       74.55  
   800x600       72.19    75.00    60.32    56.25  
   640x480       75.00    72.81    66.67    60.00  
   720x400       70.08

The first thing is to figure out which output corresponds to which of the screens you have in front of you. You can usually determine it from the output names: VGA, HDMI, and DVI usually correspond to the connector type on the motherboard. On the example output above from my work computer, VGA-0 is indeed an external screen plugged in my computer's VGA port. LVDS is usually a laptop's internal screen, which is the case in the example. Sometimes the names are off, e.g., on my home computer, a DVI port is mistakenly labeled as an HDMI port. Alternatively, you can identify screens by their modes if you know which resolutions they support. For instance, above, I know that my main screen supports 1920x1080, so it is necessarily the LVDS one. (If you don't know what a screen resolution is, read on, I'll explain it.)

If you're trying to configure an external screen or video projector, and it doesn't show up in the output of xrandr -q, then you have a problem that's deeper than just persuading X to use the output properly. You should probably investigate whether the kernel or X server can see anything (in case your output is not listed) or check the cable (in case the output is listed but indicated as disconnected). When debugging things, know that all external screen standards that I know of (VGA, DVI, HDMI) are perfectly plug-and-play, meaning that it is safe to play with the cables. However, in some cases (e.g., laptops), it could be the case that an output will not work unless the system has booted with a usable screen on that output, or unless some magic key combination is pressed to enable the output.

If all outputs that you want to use seem to be listed and have a screen connected to them, congrats! You can now configure them. The two simplest commands are the following. The first one enables one output that has a connected screen and sets it to its preferred resolution, which is usually what you want to do. The second one disables one output (any screen connected to it will usually go in power-save mode).

# enable output VGA1 and set its screen to its preferred mode
xrandr --output VGA1 --auto
# disable output VGA1
xrandr --output VGA1 --off

Now, how to decide what to put on each screen? The first possibility is to tile the display across multiple screens. This makes sense when you have a laptop and an external screen, or multiple screens on one machine, and want to show different windows on them. If you are trying to use a video projector, this is reasonable if you are using a fancy presentation tool like pdfpc that can show the slides on one screen and use your laptop screen to show the next slide, notes, etc. Here is an example to illustrate how this is configured1:

# my home setup is as follows, from left to right:
# one VGA screen, one DVI screen, one HDMI screen
xrandr --output VGA1 --left-of DVI1
xrandr --output DVI1 --left-of HDMI1

The second thing that you could want to do is to have two screens that show the same thing. For instance, for presentations using a non-fancy PDF reader, you usually want both your screen and the video projector to show your slides. This is achieved very simply:

# set VGA1 (the projector) to show the same thing as LVDS1 (laptop screen)
xrandr --output VGA1 --same-as LVDS1

However, if your projector is not set to the same resolution as your screen, your presentation will be clipped or mis-centered on one of the displays. Let me explain a bit about this. What a screen displays is made of atomic dots, aka pixels2, and the resolution of a screen is the number of pixels it displays, given as the width times the height, for instance 1024x768 (with the letter 'x' commonly used for the times symbol instead of the Unicode character '×'). Screen resolutions also have names, such as "VGA" (for 640x480), or "FHD" (for Full HD, that is3, 1920x1080). Screen resolution tells you nothing about the actual size of the display, which depends on the size of the physical pixels on the screen, aka pixel density4.

Screens can support multiple resolutions. In the vast majority of cases, the screen is designed to be used at its highest possible resolution, and lower resolutions look a bit ugly because they crudely scale up the lower resolution to the screen's physical resolution (the number of physical pixels it has). Screens can be set at a lower resolution as a poor man's version of zooming, although it is usually far prettier to do the resizing at the software level, e.g., use larger fonts, so that fonts are large but remain crisp. However, lower resolutions are also useful to show the same thing (e.g., your presentation) on two screens with different characteristics.

Doing xrandr -q indicates which modes are supported by the screen connected to each output, with a '+' next to the preferred mode (the one the screen is "designed" to be used in) and a '*' next to the currently used mode. Each mode consists of a screen resolution, given at the left, and a refresh rate5.

Thus, to fix the problem with --same-as, you should determine from the list what is the highest resolution supported by both outputs. (With some luck, it will be the projector's preferred resolution, so only your laptop screen will look a bit blurry; but this is not always the case, especially when aspects ratios do not match.) You should then set the outputs to that resolution; for instance:

xrandr --output VGA1 --mode 1024x768
xrandr --output LVDS1 --mode 1024x768

It turns out that, when you want to use a mode that is not proposed in xrandr -q, you may be able to add one yourself. To create a mode for 1024x768 at 60 Hz, you first run:

gtf 1024 768 60 | grep Modeline | cut -d' ' -f4-

Now you can run xrandr --newmode LINE, where LINE is the output of the above, and run xrandr --addmode DISPLAY NAME, where DISPLAY is the name of the display to which you want to add the mode, and NAME is the first field of LINE (the one in quotes). The new mode should now appear in xrandr -q, and you can try switching to it with --mode. This may fail in creative ways if your screen doesn't like the mode. Thanks to Pierre for pointing this out!

Once you are done giving your presentation, you will probably want to revert your laptop's screen to its preferred resolution using --auto, as previously explained.

You can do a bunch of other things with xrandr, such as rotating displays, applying transformation matrices to them, software brightness and gamma correction, zooming, etc., see man xrandr to know more. Playing with these options should be safe (although it's true that, in ancient days, there were stories about software ways to destroy CRT screens...). A real risk, however, is that if you make all your screens unusable (e.g., turn them off with --off, make them display garbage with bogus scales, flip them, etc.,), xrandr has no way to allow you to revert to sane settings automatically after a time, as GUI tools often do. You will have to undo your mistake in the blind. If you can't do this, one possible option is to switch to a TTY with Ctrl-Alt-F1, and use xrandr from there; you will probably have to pass it an environment variable so that it can talk to your X server, for instance:

DISPLAY=":0" xrandr --output LDVS1 --auto

Worst comes, of course, you can always restart the machine, as these settings do not persist across reboots.


  1. Experts will note that you can combine this in one invocation to the xrandr command, but for simplicity I will not address this here. 

  2. Pixel is the abbreviation of "picture element" and is supposed to be the smallest displayable unit. However, in fact, on LCD screens, individual pixels are further split in red, green, and blue subpixels, as you can easily see with a magnifying glass (or, e.g., a drop of rain on your smartphone's screen). This fact is commonly used to improve the quality of the display to something finer than what the resolution would suggest, especially for text; this is known as subpixel rendering

  3. It is typically quite annoying to see screens advertised as "Full HD" as if it were some magical statement about the quality of the screen, whereas it is nothing more than a technical statement about the number of pixels it can display. 

  4. In particular, what Apple brands as "retina displays" are displays with a certain high pixel density, which supposedly exceeds the human eye's own density for a certain typical viewing distance. Nothing magical here either. 

  5. The refresh rate is the number of times per second the screen should update, in Hertz. You can select it using --rate. I will not talk about it further, because it usually does not matter. 

Google Hash Code 2015

Today was the onsite final of the Google Hash Code contest organized by Google Paris. The task was to route loons to optimize Internet delivery on target cells by adjusting the loons' altitude and taking advantage from winds: see the problem statement in English and French and the input file, which I will not summarize further here.

With Théotime Grohens, Marc Jeanmougin, and Auguste Olivry, I was part of the "ENS Ulm 1" team who got the first place. This post presents our strategy to solve the problem. It is comparatively simpler than the 2014 contest strategies used during the extended round. We also provide cleaned-up code for our approach; it is about 200 lines long, which is twice more than what I published for the selection round; however, unlike the selection round, this code would have sufficed to reach the first place of the contest (like we did), beating the second best score (698678 by the hackEns team).

Strategy

The basic idea of the strategy is to compute the route of the loons one after the other, considering the route of the other loons as fixed, and starting with the default routes of not moving the loons at all.

To route each loon, we use a dynamic programming approach, which computes the optimal route for this loon given the other routes. We maintain (in covered in the code below) which target cells are already covered at which time by the other loons (whose routes are fixed), to focus at each point in time on the target cells which are not covered. We optimize the total number of covered cells (summed over all time units) for the loon that we are routing: we call this the score. We use the following recursive definition to compute the optimal route:

  • once the time is over, the score for a loon is 0 no matter its position
  • at a time 0t<T, the possible options for a loon at row r, column c, altitude a are to adjust altitude by -1, 0, or 1 (within acceptable bounds), and the score is the sum of the following quantities:
    • the number of cells that we cover at time t+1 in the cell where the wind has taken us after our move, counting only those that are not otherwise covered
    • the score at time t+1 for the position where the wind has taken us after our move (or 0 if the loon was lost), which we compute recursively

To make this more palatable, we have precomputed overall, for each cell, the cell where the wind leads us (in dest). We also precompute which target cells are covered by a loon in each cell (in cov; and then in left at each iteration for speed). To implement the recursive definition and compute the optimal path (in function route), we then just need to compute a big table for our loon with each possible row, cell, altitude and time unit, filling it from the end of time to the beginning of time. We store in each cell of the table the optimal score (in best), computed according to the above, and the altitude adjustment (in dir) that achieves the best score.

Will this have a reasonable running time? The bounds of the input were the following: 300 rows, 75 columns, 9 altitudes (counting the ground), and 400 time units. We need two tables, each containing one int for each possibility (so 8 bytes total). The total is 648 MB, which just fits. On my workstation (with Intel Core i5-4570 @ 3.20GHz), this computation takes about one second per loon.

Doing this for all loons in sequence gives a score of 680953 with my implementation, which would have sufficed to reach the 7th place. To improve the solution further, we follow a very simple scheme. As we have computed the optimal route for each loon in isolation, once all routes have been computed, we just recompute the routes for each loon in sequence, given the previous routes. This cannot make the solution worse (the loon can always follow its previous route, which the dynamic programming approach cannot miss), and may improve it, if the loon can follow a better route given the new routes of the other loons. As just one additional tweak, to make this optimization more efficient, when two altitude adjustments are tied in the dynamic algorithm, we choose one of them at random rather than deterministically: this means that the dynamic algorithm still returns the optimal route for its loon, but it adds more opportunities to change the route and makes the iterative optimization more effective.

Code

The following C++ code is under 200 lines in total, and produces solutions with increasingly better score, in a file called "output". On my workstation, it takes a bit under 20 minutes to produce a solution with a score that would have sufficed to reach the first place in the contest.

#include <algorithm>
#include <cstdio>
#include <vector>

using namespace std;

#define MAXR 77
#define MAXC 302
#define MAXL 2252
#define MAXA 11
#define MAXT 402
#define MAXB 54

int R, C, A, L, V, B, T, rs, cs;

struct Pt {
  int r, c;
  Pt(int r=0, int c=0) : r(r), c(c) {}
};

// global variables: auto-initalized, can be larger than what fits in stack
vector<int> cov[MAXR][MAXC]; // which targets are covered by a loon at (r, c)
Pt dest[MAXA][MAXR][MAXC]; // where a loon at (a, r, c) flies; (-1, -1) = out
Pt target[MAXL]; // position of target cells
bool covered[MAXT][MAXL]; // at time t, is target cell l covered yet?
// dynamic programming: best score at (t, a, r, c) and best direction
int best[MAXT][MAXA][MAXR][MAXC], dir[MAXT][MAXA][MAXR][MAXC];
int left[MAXT][MAXR][MAXC]; // how many new cells covered at (r, c) at time t
// solution: at time t, altitude adjustment for loon b 
int solution[MAXT][MAXB];

// given by problem statement
int columndist(int c1, int c2) { return min(abs(c1 - c2), C - abs(c1 - c2)); }
// does a loon at (r, c) cover target cell l?
int iscovered(int l, int r, int c) {
  int u = target[l].r, v = target[l].c;
  return ((r - u) * (r - u) + columndist(c, v) * columndist(c, v) <= V*V);
}

// return next move for loon b at time t when at position (a, r, c)...
// ... using the results of dynamic programming
int decide_dyn(int b, int t, int a, int r, int c) { return dir[t][a][r][c]; }
// ... using the computed solution
int decide_sol(int b, int t, int a, int r, int c) { return solution[t][b]; }

int simulate(int b, int (*decide)(int, int, int, int, int)) {
  // compute the run of loon b using decision function decide
  // return the score improvement
  int score = 0;
  int r = rs, c = cs, a = 0;
  for (int t = 0; t < T; t++) {
    int bestda = (*decide)(b, t, a, r, c);
    // store the move
    solution[t][b] = bestda;
    // apply it
    a += bestda;
    Pt next = dest[a][r][c];
    r = next.r;
    c = next.c;
    if (r < 0)
      break; // loon exited
    if (a > 0) {
      // update covered target cells
      for (unsigned int i = 0; i < cov[r][c].size(); i++) {
        int l = cov[r][c][i];
        // score improved if target cell not already covered
        score += covered[t+1][l] ? 0 : 1;
        covered[t+1][l] = true;
      }
    }
  }
  return score;
}

void route(int b) {
  // route loon b given the already covered targets (in covered)

  // save a factor A: pre-count the uncovered targets for each cell
  for (int t = 0; t <= T; t++)
    for (int r = 0; r < R; r++)
      for (int c = 0; c < C; c++) {
        int score = 0;
        for (unsigned int i = 0; i < cov[r][c].size(); i++) {
          int l = cov[r][c][i];
          score += covered[t][l] ? 0 : 1;
        }
        left[t][r][c] = score;
      }

  // dynamic programming, t == T is a sentinel value
  for (int t = T-1; t >= 0; t--)
    for (int a = 0; a <= A; a++)
      for (int r = 0; r < R; r++)
        for (int c = 0; c < C; c++) {
          int bestda = 0, bestv = 0; // best altitude adjustment, best value
          for (int da = -1; da <= 1; da++) {
            if (a <= 1 && da == -1)
              continue; // can't go down when on ground, or back on ground
            if (a + da > A)
              continue; // can't go above max altitude
            // pretend we moved b by da at time t
            Pt next = dest[a + da][r][c];
            if (next.r < 0)
              break; // loon exited, so score remains 0
            // score if going at these new coordinates, computed dynamically
            int rscore = best[t+1][a+da][next.r][next.c];
            // if above ground, score increased by number of covered targets
            if ((a + da) > 0)
              rscore += left[t+1][next.r][next.c];
            // break ties at random
            if (rscore > bestv || (rscore == bestv && (rand() % 2))) {
              // da is the best altitude adjustment
              bestv = rscore;
              bestda = da;
            }
          }
          // store best altitude adjustment, best value
          best[t][a][r][c] = bestv;
          dir[t][a][r][c] = bestda;
        }
}

void print_sol() {
  FILE *f = fopen("output", "w");
  for (int t = 0; t < T; t++) {
    for (int b = 0; b < B; b++)
      fprintf(f, "%d ", solution[t][b]);
    fprintf(f, "\n");
  }
  fclose(f);
}

int main(int argc, char **argv) {
  srand(42); // make it deterministic
  scanf("%d%d%d", &R, &C, &A);
  scanf("%d%d%d%d", &L, &V, &B, &T);
  scanf("%d%d", &rs, &cs);
  for (int l = 0; l < L; l++) {
    int r, c;
    scanf("%d%d", &r, &c);
    target[l] = Pt(r, c);
  }
  for (int r = 0; r < R; r++)
    for (int c = 0; c < C; c++) {
    }
  for (int a = 0; a <= A; a++)
    for (int r = 0; r < R; r++)
      for (int c = 0; c < C; c++)
        if (!a) {
          // wind at altitude 0 does not move loons
          dest[a][r][c] = Pt(r, c);
        } else {
          int dr, dc;
          scanf("%d%d", &dr, &dc);
          // populate dest with the coordinates where the wind takes the loon
          int destr, destc;
          destr = r + dr;
          destc = (c + dc + C) % C;
          if (destr < 0 || destr >= R)
            destr = destc = -1; // loon goes out
          dest[a][r][c] = Pt(destr, destc);
        }

  // compute for each cell the list of targets covered by a loon at this cell
  for (int r = 0; r < R; r++)
    for (int c = 0; c < C; c++)
      for (int l = 0; l < L; l++)
        if (iscovered(l, r, c))
          cov[r][c].push_back(l);

  int orig_score = 0;
  // iteratively compute a route for each loon given other routes
  while (true) {
    for (int br = 0; br < B; br++) {
      // route br given the routes of the other loons
      // first, forget about which cells are covered
      for (int t = 0; t <= T; t++)
        for (int l = 0; l < L; l++)
          covered[t][l] = false;
      // then, simulate the other loons (default routes are 0 ... 0)
      int score = 0;
      for (int b = 0; b < B; b++)
        if (b != br)
          score += simulate(b, &decide_sol);
      // now, compute the route of br with dynamic programming
      route(br);
      score += simulate(br, &decide_dyn);
      fprintf(stderr, "route %d\n", br);
      if (score > orig_score) {
        fprintf(stderr, "score was %d is %d\n", orig_score, score);
        orig_score = score;
        print_sol();
      }
    }
  }

  return 0;
}

Extensions

The above code gets stuck fairly quickly. After reaching 698684 in about 17 minutes on my machine, it painfully and progressively goes up to 698734 in about 30 more minutes, and then stalls and does not seem to improve further (I left it running for about 30 additional minutes). Of course, this may vary given your choice of RNG seed (I hardcoded one in the code for reproducibility), and indeed for some unlucky seeds (one out of 5 that I tested) it seems that the code fails to reach a winning score. In any case, our final score of 700913 was achieved with a slightly more elaborate approach. If anyone is brave enough to want to have a look at our actual code, it is here.

We used the following additional improvement: whenever iterating over loons no longer makes progress, start trying to reroute two loons (or more than two) instead of just one. When doing this, unlike when rerouting only one loon, the solution score may decrease, so the code must rollback to the previous solution if the solution got worse. (If the score is still the same but the solution is different, however, it is good to keep the change, so that further optimizations with just one loon can be more effective.) We alternated in parallel optimization attempts with a single loon and attempts with two or three loons that did not make things worse. To be precise, we also tweaked solutions twice, by moving as many as 4 loons, at the expense of a slight drop in quality, to escape local optima and reach better solutions.

All of this was very hacky and we did not have a clean way to run many explorations in parallel on multiple machines, passing improved solutions to other running instances as they were found, some being more adventurous and others more conservative, and hopefully following what is known about metaheuristics. I trust that such approaches will be explored to unimaginable depths during the extended online contest. ;) I don't know if a more clever approach could yield substantial score improvements (rather than incremental ones); this may also turn up during the extended contest.

In any case, we had lots of fun during the contest, and we are very grateful to the organizers for making it happen. :)

Noms et verbes français qui n'existent qu'avec des préfixes

This post is about French, so I only write it in French.

Un problème amusant (merci, Valentine !) : quels sont les verbes et noms français qui existent avec différents préfixes, mais n'existent pas sans le préfixe ?

Par exemple, on peut déférer, référer, préférer, conférer, inférer, proférer, et même transférer, mais on ne peut jamais férer. On peut déduire, réduire, enduire, conduire, induire, et produire, mais pas duire. On peut révoquer, évoquer, convoquer, invoquer, et provoquer, mais pas voquer. Si on sait ce qu'est une déjection, une injection, une surjection, une projection, une bijection, et une interjection, qu'est-ce qu'une jection ? Une révolution, une dévolution, une involution, une convolution ; mais une volution ?

Ce billet cherche à trouver de façon automatique d'autres exemples de ce phénomène. Si seuls les résultats finaux vous intéressent, voici les 100 premières formes nominales et verbales trouvées, assorties des formes préfixées qui existent réellement. Bien sûr, il y a des choses inintéressantes et incorrectes dedans, parce que c'est généré automatiquement.

Voici les verbes :

férer:       déférer       référer       préférer      conférer      inférer    proférer  transférer
duire:       déduire       réduire       enduire       conduire      induire    produire
crire:       décrire       récrire       écrire        transcrire    redécrire
voquer:      révoquer      évoquer       convoquer     invoquer      provoquer
noncer:      dénoncer      renoncer      énoncer       prononcer
sister:      désister      résister      consister     insister
ouler:       rouler        couler        bouler        fouler        remouler
ter:         enter         conter        coter         rater         acter
puter:       députer       réputer       computer      disputer
spirer:      respirer      aspirer       conspirer     inspirer
osser:       désosser      rosser        cosser        bosser
stituer:     restituer     constituer    instituer     prostituer
clure:       reclure       conclure      inclure       exclure
cevoir:      décevoir      recevoir      concevoir
gresser:     régresser     agresser      progresser    transgresser
treindre:    retreindre    rétreindre    étreindre
iser:        riser         coniser       biser         remiser
gurgiter:    dégurgiter    régurgiter    ingurgiter
charner:     décharner     écharner      acharner
purer:       dépurer       épurer        apurer
vier:        dévier        envier        convier
voyer:       dévoyer       envoyer       convoyer
baucher:     débaucher     ébaucher      embaucher
iller:       ailler        ciller        piller        railler
arder:       darder        carder        barder        farder
tamer:       rétamer       entamer       étamer
soler:       désoler       consoler      insoler
iler:        piler         biler         filer         exiler
valer:       dévaler       avaler        ravaler
onder:       inonder       bonder        fonder        exonder
iger:        piger         figer         transiger     exiger
gir:         régir         surgir        agir
mailloter:   démailloter   emmailloter   remmailloter
mancher:     démancher     emmancher     remmancher
boîter:      déboîter      emboîter      remboîter
poter:       dépoter       empoter       rempoter
barquer:     débarquer     embarquer     rembarquer
paqueter:    dépaqueter    empaqueter    rempaqueter
créter:      décréter      concréter     excréter
ssortir:     ressortir     assortir      rassortir
ssembler:    ressembler    assembler     rassembler
sumer:       résumer       présumer      consumer
server:      réserver      préserver     conserver
specter:     respecter     inspecter     prospecter
irer:        désirer       airer         cirer
sulter:      résulter      consulter     insulter
membrer:     démembrer     remembrer
oter:        roter         doter         coter
amer:        ramer         damer         camer
tribuer:     rétribuer     contribuer    distribuer
ocher:       rocher        cocher        pocher
outer:       router        douter        bouter
aguer:       raguer        daguer        baguer
aliser:      réaliser      coaliser      baliser
criminer:    récriminer    incriminer    discriminer
ire:         rire          dire          raire
incer:       rincer        pincer        coincer
aser:        raser         caser         baser
pliquer:     répliquer     compliquer    expliquer
ister:       désister      pister        exister
ayer:        rayer         payer         bayer
endre:       rendre        pendre        fendre
uer:         ruer          puer          remuer
coler:       récoler       racoler       accoler
cuser:       récuser       excuser       accuser
ifier:       déifier       réifier
vertir:      avertir       convertir     invertir
piner:       épiner        copiner       rapiner
pier:        épier         copier        expier
iner:        suriner       biner         rainer
volvériser:  revolvériser  révolvériser
champir:     rechampir     réchampir
fréner:      refréner      réfréner
véler:       revéler       révéler
oucher:      doucher       coucher       boucher
ciser:       préciser      inciser       exciser
courager:    décourager    encourager
caisser:     décaisser     encaisser
visager:     dévisager     envisager
tourer:      détourer      entourer
dommager:    dédommager    endommager
grosser:     dégrosser     engrosser
crasser:     décrasser     encrasser
clencher:    déclencher    enclencher
velopper:    développer    envelopper
fourner:     défourner     enfourner
verguer:     déverguer     enverguer
tartrer:     détartrer     entartrer
gourdir:     dégourdir     engourdir
gluer:       dégluer       engluer
asser:       casser        passer        coasser
cimer:       décimer       écimer
pouiller:    dépouiller    épouiller
valuer:      dévaluer      évaluer
ller:        aller         coller        raller
anner:       canner        panner        banner
aver:        caver         paver         baver
aner:        caner         paner         faner
eindre:      ceindre       peindre       feindre
uver:        cuver         couver        prouver

Et les noms :

duction:      déduction      réduction      induction       enduction       conduction     production    diduction     transduction
quet:         coquet         caquet         biquet          baquet          paquet         triquet       parquet       taquet        piquet
jection:      déjection      rejection      injection       surjection      projection     bijection     interjection
férence:      déférence      référence      inférence       conférence      préférence     interférence
ception:      déception      réception      conception      interception    exception      perception
chet:         déchet         cochet         cachet          parchet         archet         pichet
tention:      détention      rétention      intention       contention      prétention
fection:      défection      réfection      infection       confection      perfection
cision:       décision       incision       concision       précision       excision
putation:     députation     réputation     imputation      computation     disputation
céphale:      encéphale      bicéphale      polycéphale     tricéphale      microcéphale   hydrocéphale
quette:       coquette       raquette       biquette        maquette        disquette      piquette
quetage:      coquetage      caquetage      baquetage       paquetage       parquetage     piquetage
ille:         caille         baille         paille          maille          taille         arille
ssement:      cassement      passement      massement       trissement      tassement      pissement
ducteur:      réducteur      inducteur      conducteur      producteur      transducteur
sseur:        casseur        masseur        passeur         tasseur         pisseur        chasseur
ssage:        cassage        massage        passage         tassage         pissage        chassage
posant:       déposant       proposant      composant       disposant       exposant
gression:     régression     ingression     progression     digression      transgression
venance:      survenance     convenance     prévenance      provenance      contrevenance
cepteur:      récepteur      concepteur     précepteur      intercepteur    percepteur
volution:     dévolution     révolution     involution      convolution
tage:         cotage         contage        ratage          matage          partage
lation:       délation       relation       dilation        translation
clamation:    déclamation    réclamation    proclamation    exclamation
scrit:        rescrit        inscrit        conscrit        proscrit
stitution:    restitution    institution    constitution    prostitution
scription:    rescription    inscription    conscription    proscription
chage:        cochage        cachage        trichage        tachage         perchage
sque:         casque         bisque         disque          basque          masque
logue:        prologue       autologue      radiologue      trilogue        hydrologue
nche:         canche         ranche         banche          manche          tanche
illage:       caillage       maillage       paillage        entreillage     taillage
métrie:       télémétrie     radiométrie    micrométrie     photométrie     hydrométrie
spiration:    respiration    inspiration    conspiration    perspiration
quage:        caquage        parquage       taquage         arquage         piquage
clusion:      réclusion      inclusion      conclusion      exclusion
sistance:     résistance     insistance     consistance     persistance
potement:     dépotement     empotement     tripotement     tapotement
portation:    déportation    importation    exportation     transportation
ction:        rection        coction        diction         paction
posante:      déposante      composante     disposante      exposante
vention:      invention      convention     prévention      intervention
hibition:     inhibition     cohibition     prohibition     exhibition
clinaison:    déclinaison    reclinaison    inclinaison
plication:    réplication    implication    complication    explication
collement:    décollement    recollement    encollement
tine:         rétine         ratine         patine          matine
gurgitation:  dégurgitation  régurgitation  ingurgitation
noncement:    dénoncement    renoncement    prononcement
nonciation:   dénonciation   renonciation   prononciation
ture:         enture         rature         biture          triture
posé:         imposé         préposé        composé         exposé
niement:      déniement      reniement      maniement
calement:     décalement     recalement     intercalement
sane:         insane         basane         persane         pisane
posement:     déposement     reposement     entreposement
solation:     désolation     insolation     consolation
bordement:    débordement    rebordement    transbordement
rdon:         cordon         cardon         pardon          chardon
bine:         cabine         bibine         combine         babine
vage:         cavage         ravage         bavage          pavage
llage:        collage        billage        tallage         pillage
sultante:     resultante     résultante     consultante
ngue:         cangue         dingue         mangue          tangue
illeuse:      railleuse      bailleuse      pailleuse       tailleuse
illeur:       railleur       bailleur       pailleur        tailleur
spirateur:    respirateur    inspirateur    conspirateur
tiste:        protiste       batiste        artiste         altiste
crément:      décrément      incrément      excrément
testation:    détestation    contestation   protestation
èdre:         polyèdre       trièdre        parèdre         exèdre
ssette:       cassette       massette       tassette        pissette
pier:         papier         polypier       tripier         pipier
lèvement:     relèvement     enlèvement     prélèvement
lier:         calier         palier         perlier         pilier
logie:        trilogie       radiologie     électrologie    hydrologie
sseuse:       casseuse       masseuse       pisseuse        chasseuse
nière:        panière        manière        tanière         pinière
lage:         calage         parlage        perlage         pilage
vasement:     dévasement     envasement     transvasement
primé:        déprimé        imprimé        comprimé
pette:        tripette       tapette        arpette         pipette
pilation:     dépilation     empilation     compilation
cernement:    décernement    concernement   discernement
pense:        dépense        impense        dispense
ploration:    déploration    imploration    exploration
hérence:      inhérence      déshérence     cohérence
valement:     dévalement     cavalement     ravalement
quillage:     requillage     coquillage     maquillage
crimination:  récrimination  incrimination  discrimination
quin:         requin         coquin         taquin
servation:    réservation    conservation   préservation
quisition:    réquisition    inquisition    perquisition
formé:        réformé        informé        transformé
sonance:      résonance      consonance     dissonance
tribution:    rétribution    contribution   distribution
sonnance:     résonnance     consonnance    dissonnance
paration:     réparation     préparation    disparation

Des détails, pour ceux que ça amuse. La liste de mots utilisée est Lefff, d'où je ne garde, soit que les verbes à l'infinitif, soit que les noms communs au singulier. Il y a 7817 infinitifs et 41680 noms.

La première question qu'il faut élucider, c'est de déterminer ce qu'est un préfixe. Pour cela, j'utilise la méthode suivante : pour chaque mot de la liste, pour chaque façon de le couper en un préfixe et un radical (c'est-à-dire toutes les longueurs possibles de préfixe) qui soit non-triviale (c'est-à-dire que radical et préfixe ne sont pas vides), si le radical est également un mot de la liste, alors cela compte comme une preuve que le préfixe considéré est effectivement un préfixe. Je trie les préfixes par nombre décroissant d'occurrences en ce sens, et je garde les meilleurs avec leur score.

Le code est ici. Pour les verbes, je garde les 30 meilleurs ; pour les noms, les 60 meilleurs, mais j'exclus ensuite ceux qui font une seule lettre, car il y a trop de bruit sinon. Voici les préfixes obtenus et leur nombre d'occurrences, pour les verbes :

dé     542
re     387
ré     166
dés    139
r      120
en     97
sur    97
é      87
auto   60
pré    58
a      49
d      44
em     43
con    39
c      37
p      32
in     30
co     28
com    26
b      23
f      22
pro    21
dis    20
ra     20
entre  20
trans  19
ex     19
redé   19
ac     18
rem    18

Et pour les noms :

dé       651
re       394
ré       223
in       195
dés      166
sur      150
en       147
co       105
con      91
im       90
pré      88
pro      81
em       81
télé     77
sous-    63
contre-  61
ca       59
anti     56
ra       55
auto     53
bi       52
com      52
di       50
ba       49
ma       48
pa       48
poly     47
radio    46
demi-    46
inter    46
tri      46
dis      44
contre   43
par      42
entre    41
électro  40
micro    40
ex       39
ta       38
ar       37
photo    36
hydro    35
trans    34
per      34
pi       32
cha      31
al       31

Ensuite, étant donnée cette liste pondérée de préfixes, je regarde chaque mot du dictionnaire et chaque préfixe. Si on peut retirer ce préfixe au mot, et que le résultat n'existe pas, alors c'est une forme manquante : son poids est la somme du poids de tous les préfixes tels quel la forme avec préfixe existe, mais que j'élève à un exposant (0.2, arbitrairement choisi) pour donner un meilleur score aux situations où de nombreux préfixes sont possibles.

Le code est ici. Pour les noms, j'élimine les propositions de longueur 3 ou moins. L'espacement final est réalisé avec column -t. Le fichier principal qui fait tout est duire.sh ; vous pouvez l'utiliser pour calculer davantage de formes que celles que j'inclus ici. Certains des fichiers intermédiaires sont également inclus dans le dépôt, par commodité.

Comme améliorations supplémentaires possibles, on pourrait appliquer des filtres plus sérieux pour identifier les propositions intéressantes : exiger notamment que le début du mot, par exemple les trois premières lettres, existe déjà dans un mot français ; peut-être filtrer les formes attestées avec préfixe pour vérifier que la prononciation correspond bien. On pourrait aussi imaginer de faire le même travail en tenant compte de la fréquence des mots : pour l'identification des préfixes et des formes manquantes, le vote de tous les mots se valent, mais ce n'est pas réaliste vu que certains mots sont beaucoup plus fréquents que d'autres ; et une forme sans préfixe qui existe mais est beaucoup plus rare que les formes avec préfixe qui existent, cela pourrait aussi être une réponse intéressante.

On peut s'interroger sur la raison pour laquelle certains de ces mots sont manquants. Je reprends ici un commentaire qui m'a été adressé par Maxime (merci à lui) pour décrire certaines situations possibles :

  • Pour certains mots, la racine fut un jour un mot à part entière et elle a cessé de l'être, j'imagine par affaiblissement de sens. Ainsi de "duire" (du latin "ducere", qu'on doit souvent traduire par "conduire"), de "férer" ("ferre", "porter") ou de "voquer" ("vocare", "appeler").

  • Pour d'autres, la racine est toujours un mot à part entière mais elle a subi des modifications et n'est plus perçue comme tel (par les locuteurs ou par l'algorithme). Ainsi "écrire" ne se décompose pas en "é+crire" mais est bien une racine en soi : le "s" initial de "scribere" a pu devenir (ou fusionner avec) une voyelle en début de mot ou derrière une voyelle ("écrire", "décrire") ou survivre après une consonne ("transcrire"). Plus fourbe, "régir" est un mot en soi ; avec le préfixe "sub-", il a donné "surgir" (la transformation "sub+rego" -> "surrego" -> "surgo" avait déjà eu lieu en latin). Quant à "agir", il semble venir d'une autre racine.

  • Parfois, la racine existe encore, mais pas dans la même catégorie grammaticale. Ainsi les verbes "dévaler", "avaler", "ravaler", sur le substantif "val", ou "contribuer", "rétribuer", sur "tribu" (mais ceux-ci entrent aussi dans la première catégorie de cette liste : le verbe "tribuere" existait en latin).

  • Évidemment, il y a dans la liste de simples ressemblances orthographiques ("damer", "ramer"), parfois liées à la suffixation ("passage", "massage", "tassage"), parfois à des résultats identiques partant de racines distinctes ("ratine", "rétine" vs "matine").