a3nm's blog

Cruising at -41.8 million feet: Antipodal airports

In the original version of my previous entry about the most distant airports by minimal number of connections, I made an offhand remark about the alternative problem of finding the two most geographically distant airports. I originally answered it by copying a random answer to that question from the Web without thinking much about it, but after stumbling upon a different answer, I started having doubts. After more investigation, it appears that the Web has no clue what It is doing, and that these answers were erroneous.

Having recomputed everything myself from the OpenFlights dataset, I believe that the two most geographically distant airports in the world are the following, with a distance of 20002 km:

Dataset and preparation

OpenFlights gives you a (slightly noisy) dataset with airport codes, latitudes, longitudes, and altitudes. As I am just interested in the one most distant airport pair, and not in the complete rankings, I just had to clean up the one bad offender that I saw, namely, the Budapest Keleti station and its mysterious antipode double: it is probably a dataset bug, and anyway this is a train station so it has limited aerial value. See the whole script for details about the preprocessing.

The silly method: distances on cylinders, and equirectangular projections

The latitude is a decimal value between -90 and 90 that describes on which parallel a point is located: 0 is the equator, 90 is the Geographic South Pole, and -90 is the Geographic North Pole. The longitude is a decimal value between -180 and 180 that describes on which meridian a point is located: 0 is the IERS Reference Meridian, positive values go eastwards, and negative values go westwards.

What's the dumbest way to find out, given a bunch of such latitude/longitude coordinates, which points seem to be exact opposites on Earth? Well, just compute the latitude and longitude of the antipodal point, namely, the opposite point on Earth relative to the center of the Earth. This is fairly easy:

  • given a latitude \(\phi\), the antipode has latitude \(-\phi\) (it is opposite with respect to the equator);
  • given a longitude \(\lambda\), the antipode has longitude \(180+\lambda\) (i.e.,you go around the globe for half a turn) except you have to bring it back to \([-180, 180]\) by adding or subtracting 360.

Now, you estimate the distance of the second point to the antipode of the first point in the crudest possible fashion: just pretend that the latitude and longitude are two-dimensional coordinates and use the Euclidean distance, where the square brackets \([\cdot]_{-180, 180}\) denote the operation of going back to \([-180, 180]\).

\(d_{\text{e}}((\phi_1, \lambda_1), (\phi_2, \lambda_2)) = \sqrt{(\phi_1 + \phi_2)^2 + [\lambda_1 - \lambda_2 - 180]_{-180, 180}^2}\)

Of course, this distance estimation is wildly inaccurate. In geographical terms, it amounts to computing distances directly on the equirectangular projection of Earth. This is inaccurate, because the Earth is not flat, and the distortion in distances depends on latitudes and longitudes. In particular, distances near the poles are wildly overestimated. Yet, my friend Mc, when he suggested this crude approach to me, claimed that this would probably not matter much, because the distances in consideration for nearly antipodal airports are small and there aren't many airports near the poles. As we will see, he was right.

To perform this computation, I simply evaluated it on all pairs with a simple C++ program, that completes in seconds. To get a distance estimation from this, I subtract the distance between the antipode and second point to the distance of going around the Earth obtained from the Earth's mean radius. The best pairs are here, sorted with more distant pairs at the bottom (i.e., the most antipodal, so the most interesting). Hence, according to this method, the best pair is the following:

I did not check that the other results corresponded to anything sensible (and not, e.g., train stations that don't really exist), so take them with a grain of salt. For NVA-PLM, however, you can check from online sources that these airports indeed exist at the given coordinates.

The sensible method: distances on spheres, and haversines

Can we do better? Well, let us notice that the Earth is not a cylinder but a sphere, with mean radius of 6,371.009 kilometers according to the IUGG. Given two airports, what interests us is the length of the shortest route from one to the other on the sphere, which is known as the great-circle distance: the name is because the route from one point to the other will follow a great circle, a circle on the sphere whose center is that of the sphere. The standard way to do so is to use the Haversine formula. So I just plugged it in the previous code.

The results are here, and we can see that Mc was right: they don't change much. In particular, the top pair is still the same, though our distance estimate changed by 3 kilometers, i.e., quite a lot compared to the distance to the second best pair. The first difference is at position 6, where the 6th best pair and the 8th best pair are swapped. The result file contains all 3435 pairs estimated to be at a distance of at least 19800 km by the Haversine formula, with the computed distances differing from up to 107 km compared to the results of the previous section on these pairs.

The serious method: distances on ellipsoids, and WGS-84

Do you recall anything from school about the Earth not being a perfect sphere, but being a bit rounder at the Equator? How bad is this? Well, the radius varies by about 33 km; it's small compared to the overall radius, but huge relative to the distance differences between our top airport pairs. So we have to be more precise to get a more definite answer.

Fortunately, serious people already know how to deal with this problem, and model the Earth, not as a sphere, but as an ellipsoid, more precisely an oblate spheroid. An ellipsoid is less scary than it sounds: it is just what you get by rotating an ellipse around the North-South pole axis, rather than a circle (in which case you get a sphere). The most common such model seems to be the reference ellipsoid of WGS-84, the World Geodetic System used by GPS. I wouldn't want to implement geodesic calculations on ellipsoids, but fortunately other people have done it before: I used GeographicLib, CLI bindings of which are packaged for Debian as geographiclib-tools. For instance, the GeodSolve -i tool eats latitude/longitude pairs as input and produces as output two azimuths (that we don't care about) and more importantly the distance of the shortest path from one point to the other.

I wasn't sure about the feasibility of feeding all 32 million airport pairs to this tool to perform complicated computations, so I restricted the study to the 3435 top pairs given in the previous output. This is more than sufficient to be sure not to miss anything: the difference in radii is < 33 km and the margin to the best is > 200 km, which is clearly more than the maximal error we could have made. The computation on these 3435 pairs is essentially instantaneous: the results are here.

Observe that the top pair changed to the one given at the beginning of this post, with its longest distance estimate decreased by 8 km. Overall, the error between the Haversine and ellipsoid estimates is at most 23 km on these 3435 pairs. Of course, discriminating the first and second best pair, with only about 700 m difference, is a bit problematic, as the "position" of an airport is hard to define (and 700 m is less than the length of a runway...). So I picked the first one as a winner, but we start hitting the limits of the definition of our problem.

The surreal method: accounting for altitude

Having reach the limit, let us go even further! The Earth is not flat, is not a sphere, but it is not a perfect ellipsoid either. Notably, it has local variations in radius, a.k.a. mountains. What if the altitude of the airports changed the ranking? Two airports with a large altitude difference could be more distant than a more antipodal pair where both airports are at sea level, because of the need to cover the altitude difference. Further, even if two airports have the same altitude, the distances between both are greater if the altitude is high, because you will travel further from the ellipsoid center. This is not entirely negligible as there are airports with an elevation of 4-5 km, which is larger than our 700 m margin between the top pair and the second one.

Of course, thinking about it will make you realize that we cannot conceivably account for this. If you start taking precise altitude into account, the shortest path between two points may be twisted because of the need to avoid mountains. Further, this question is totally disconnected from reality: commercial flights going from a point to another first ascend to a cruise altitude of about 11-12 km, and then descend to the target, and of course this contributes to the distance travelled, and means that accounting for airport altitude would not be doable.

As a purely theoretical exercise, though, let me continue to model the Earth as the WGS-84 ellipsoid, with no mountains, and let us position the airports at their latitude/longitude and with an altitude that elevates them away from the spheroid's surface. Now, we must bound the length of the shortest path between two given airports. Clearly, the shortest path with altitude is at least the length of the path at sea level, which we computed before. Further, it is at most the sum of both altitudes plus the sea level path, because one possible path is to go down to sea level, travel, and go up again, so this path must be longer than the shortest path. I computed this dumb upper bound, and it seems like, for our 3435 pairs, the bound suffices to show that the path with altitude is never longer that that of the top pair. So taking altitude into account cannot imply that other pairs will beat the PGK-LMC pair.

The subterranean method (bonus)

In all this post I have made the implicit assumption that we are going from an airport to the other while remaining above sea level. However, the literally minded will note that when I talked about the "most distant" airport pair, I never said anything about having to go around the big spherical, er, ellipsoidal, obstacle that bars the way. So, neglecting the existence of planet Earth, what are the two most distant airports by straight line distance in three-dimensional space, going straight from one to the other?

To answer this pressing question, we can again rely on GeographicLib, whose CartConvert program takes latitude, longitude (and even altitude!) and produces Geocentric coordinates, i.e., coordinates in a three-dimensional Cartesian coordinate system. The straight line distance is then computed with the usual Euclidean distance formula:

\(d_{\text{e2}}((x_1, y_1, z_1), (x_2, y_2, z_2)) = \sqrt{(x_1 - x_2)^2 + (y_1 + y_2)^2 + (z_1 - z_2)^2}\)

Computing this only for the restricted pair subset of the previous sections, we obtain this. Amusingly, the most distant pair by this criterion is NVA-PLM, the same one as with our initial method (which also used Euclidean distance but in two-dimensional space). As far as I can tell, this is entirely coincidental. Note that the distance of 12757675 km is slightly more than the average diameter of Earth (though it is less than the maximal diameter, of course): this is explained by the fact that these airports are close to the equator.

Confidence in the result

I now conclude by giving more information about whether my results can be trusted to be accurate.

The GPS coordinates for LMC and PGK in OpenFlights are essentially the same as those given in Wikipedia and are both, according to OpenStreetMap, at some position on each of these airport's main runway (LMC, PGK), which is further confirmed by the satellite imagery of Google Maps (LMC, PGK).

The fact that the distance between these two points is indeed 20001571.135 meters was computed directly by GeographicLib. However, here I must point out that this result seems to disagree with many websites that offer such calculations. This is probably caused by the fact that nearly antipodal points are corner cases for geodesic computations on ellipsoids.

It is conceivable that the sources that overestimate the distance (except Great Circle Mapper, where the difference is too high, and otherwise explained) are using a geoid model, to account for variations in radius beyond what the WGS-84 ellipsoid accounts for. This is beyond the scope of what I did.

That this pair is indeed the most distant depends on whether positions for other airports in OpenFlights are accurate, and whether this source is exhaustive, i.e., contains all airports. I did not really try to check any of these two points.

It's a big world after all: Computing the diameter of the flights graph

What are the two most distant airports on Earth? A natural definition is by geographical distance: see my followup post for this definition. However, an alternative question asks what are the two most distant airports in terms of the number of different flights required to go from one to the other. For the most geographically distant pair, LMC and PGK, the answer appears to be 5 flights, because you must go to BOG, then fly through e.g. FRA and BKK to CGK, and then to PGK. Are there other pairs where most flights are needed?

The only relevant reference I found is this thread. Quora has the awful habit of blocking the content to unregistred users, so I will summarize it: at the time of writing, the only answer estimates that there should be no airport pair requiring more than 5 flights, as you should always be able to fly via a regional hub, a global hub, a global hub, and a regional hub. This post tries to give a more definite answer; as we will show, that intuition is incorrect, and up to eight flights can be needed.

Dataset

While any form of precise timing or pricing information about flights is hard to obtain, one would think that a public database of flights with flight number, company identifier, origin airport, and destination airport, should not be hard to come by. This is not the case: the only public structured database I could find is the OpenFlights route database, which is apparently derived from the one distributed with the Airline Route Mapper, a mysterious proprietary Windows program with no contact information other than a FlyerTalk thread.1 The dataset does not include the flight number, which we will see is extremely problematic.

We can start cleaning up the routes by removing codeshare flights and flights by airlines that the OpenFlights airline database lists as defunct. For flights with an airline unknown to OpenFlights, we generate fresh airline identifiers instead. See the exact code.

Computation

Now we just compute the diameter of this graph, i.e., the maximum over all airport pairs of the length of the shortest path from one airport to the other, using a straightforward C++ program that implements Floyd-Warshall with path reconstruction. This runs in cubic time in the number of airports (around 3100), taking about 20 seconds on my machine.

The problem is that OpenFlights is not perfect, also because of the fuzzy definition of what a flight is. Indeed, the naive result gives the following paths of length 15:

15: XEQ -> YPO: XEQ QUV QFN JNN JJU UAK GOH KEF YYZ YQT YTS YMO YFA ZKE YAT YPO 
15: YPO -> QFN: YPO YAT ZKE YFA YMO YTS YTZ IAD KEF GOH UAK JJU JNN XEQ QUV QFN

However, investigating on the website of Air Greenland2, Tasiusaq Heliport and Nanortalik Heliport are connected by a direct flight (GL9402), with the intermediate stops listed as "technical stops". It a bit problematic to call those changes of flights, as you can probably just remain in the plane. This could be accounted for if only we had the flight number in the dataset, but sadly we do not have it. Likewise, Thunder Airlines seems to be selling the flight from Timmins to Peawanauck as one single flight (flight 105).

So I tried to fix these problems until I found an example that I liked, which required an unbelieve amount of manual intervention. I added a bunch of missing flights by considering stopovers with the same flight number as being free. I had to merge a few airports because some complicated routes were jumping through hoops to get from the domestic airport of a city to its international airport: for instance, to go from Sde Dov to Ben Gurion by plane only, you need to fly to Eilat and back. Last, I had to eliminate many airports for which no flights could actually be bought online from mainstream providers. I only retained flights featured either on Google Flights3 or on Skyscanner. This extremely tedious iterative process took me several hours and I don't know why I did it, but I did, hopefully without introducing grave errors. All paths from length 15 down to length 9 turned out to be spurious according to my criteria.

An eight-flight result

Here is the result for eight fligths, after sufficiently refining the dataset to rule out all longer paths as they were erroneous. Now, the vast majority of these paths are probably crap, but here is an example of one that seems correct to me:

8: AUY -> IOQ: AUY TAH VLI SYD LHR KEF KUS CNP OBY

Indeed, Anatom Airport in Vanuatu seems to be served only by a flight to Tanna, from which it is only useful to connect to Port Vila; Air Vanuatu sells them as separate flights (NF253 and NF239) with different planes, and you can buy the tickets from any online source (beware, however, it seems that the flights only run on Saturdays and Tuesdays). From Port Vila you can fly only to Oceanian hubs, for instance Sydney, so you need three hops in total to get there.

Conversely, to go to Ittoqqortoormiit Heliport in Greenland, the only way is from Nerlerit Inaat Airport (and only on Wednesdays). To go there, no matter what the Wikipedia page says, it seems that there are only two reasonable ways: with Air Greenland from Kulusuk on Wednesday early afternoon (in time for your connection), which you can reach with Air Iceland (in the earlier morning, with a tighter connection) from the domestic Reykjavík Airport; or directly with Air Iceland (and on Wednesdays, but too late to connect) but only from Akureyi in Iceland, which you can then reach from Reykjavík on the same day (the two legs are flown by the exact same plane, TF-JMT, but they are sold with different flight numbers, NY-112 and NY-511). So that's three hops altogether from Reykjavík in either case. (Alternate options are to arrive from Nuuk, which is e.g. reachable from Copenhagen, but they are not competitive.) The path via Kulusuk can be bought easily from many online sources (but not Google Flights), though you may have to buy separately the Air Iceland and Air Greenland parts.

Now, to go from Sydney to Reykjavík, I claim that you need at least one stop. This is clear4 because a direct flight would beat, by far, the current longest scheduled flights. However, some flights from Australia to Europe are marketed as a single flight even though they include a refueling stop. This is particular the case of flights BA16 (via SIN) and QF1 (via DXB). Fortunately, however, no such flight goes directly to Reykjavík, so even though you can go from Sydney to London in one "flight" with this trick, you need an additional flight to reach Reykjavík.5 Hence, following our choice of sticking with flight numbers, that's two flights.6

So we have a total of eight flights. Of course, while many websites offer the individual legs, none of them will manage to sell them simultaneously. Still, eight flights is quite a far cry from the five flights initially conjectured. Of course, the fuzziness of the problem means that this numbers does not mean that much: if you include flights that can only be bought from their own companies, charter flights, count differently stops that involve no change of aircraft, etc., your mileage may vary. In particular, the "flights" from SYD to LHR probably require you to leave the plane, whereas I'm not sure that the "change" in Akureyi would require doing this, so this illustrates a bit the absurdity of the problem if you look too closely.


  1. This leads to an awful dilution in responsibility: you cannot report errors in the data, because OpenFlights says (e.g., in this bug report) that the data comes from a third party, but that third party has no apparent way to report bugs... As of this writing it looks like the discussion is happening on FlyerTalk and there is no way yet to edit the dataset collaboratively... 

  2. By the way, how ironic is it that the website of Air Greenland is one of the neatest and fastest airline websites I ever interacted with? 

  3. I did not keep Google Flights options for which the price had to be checked on the airline website: the redirection usually points to the website's landing page and not to a specific flight, suggesting that Google probably has as little information about the flight as I do. 

  4. A previous version of this post claimed that a direct flight from SYD to KEF would have been impossible because of fuel requirements. This is incorrect, as there have been longer exceptional flights. Thanks to Mc for pointing this out. 

  5. What's slightly more surprising is that using such flights is the only solution, e.g., were it not for that marketing trick, two stops would be required from Sydney to Reykjavík. I guess that this is because KEF is not connected to anything sufficiently close to Australia or Asia that could play the role of the single stop. 

  6. If you plan to schedule a trip, the better option is then probably to leave AUY with the Saturday flights, to reach KEF on Monday evening, and then take the Wednesday flights and hopefully reach OBY on Wednesday evening. I don't know if anyone ever tried to fly from AUY to OBY, but somehow it looks unlikely. In terms of cost, if you agree to give up on the absurdly expensive "direct" flights from SYD to LHR, the price would be around 2500 EUR for a one-way trip, with around 34:30 hours spent in airplanes. 

Arbitrary-length unambiguous buffalo sentences

Consider the well-known sentence:

Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo.

As Wikipedia explains, this is a grammatically correct sentence using the ambiguity of Buffalo (the town) and buffalo (either a noun, the animal, or a verb, "to bully"). Let me illustrate the grammatical structure of the sentence, using words from the corresponding parts of speech, and using square brackets to help the parsing:

Boston deer [(that) Seattle doves annoy] pester Austin kittens.

The Wikipedia article points out that buffalo sentences can be created with any number of repetitions of the word buffalo. For instance:

Buffalo buffalo Buffalo buffalo Buffalo buffalo buffalo buffalo buffalo.

One can easily interpret this as, e.g.,

Puppies tickle Chicago unicorns [(that) Dallas ponies [(that) squirrels mock] nag].

In other words, considering squirrels that mock Dallas ponies that nag Chicago unicorns, then there are puppies tickling such unicorns.1 For legibility we can also write this in the following way, using Nouns, Verbs, and Cities:

N V (C N (C N (N V) V))

The problem is that this sentence can equivalently be understood as:

C N (C N (C N V) V) V

In my silly rephrasing language:

Houston birds [(that) Phoenix bunnies [(that) Memphis butterflies bother] peeve] displease.

Note that I'm allowing here an intransitive use of buffalo (or "displease" in the gloss) as a verb, which I guess is OK. (Indeed, Wikipedia claims that Buffalo! is a valid imperative sentence.)

Hence, the above buffalo sentence has the unpleasant aspect that it can be parsed in many different ways. By contrast, the original one seems to have only one reasonable parse, as long as we can rely on case to distinguish Buffalo and buffalo. Hence the simple and tantalizing question: can we construct arbitrary-length buffalo sentences that are unambiguous, i.e., have a single grammatical interpretation?2

In the rest of this post I show the following, for a certain choice of grammar which I hope covers all reasonable interpretations: for any length greater than one, the (unique) case-insensitive buffalo sentence is ambiguous; by contrast, if we are case sensitive, we can build unambiguous buffalo sentences of any length, even if we have case ambiguity on the first word.

Grammar

I will use the following grammar, in simili-BNF:

S    ->   V NPO | NP V NPO
NPO  ->   NP    | C        | ε
NP   ->   NPS   | NPS PP
NPS  ->   N     | C N
PP   ->   NP V
C    ->   Buffalo
V    ->   buffalo
N    ->   buffalo

In other words:

  • A sentence is either an imperative sentence or a declarative sentence, both of which have an object noun phrase.
  • An object noun phrase can be a noun phrase, the empty sequence (corresponding to an intransitive use of the main verb) or a single Buffalo. The last case handles the case of Buffalo buffalo Buffalo understood as "Dragonflies tease New York" (i.e., its inhabitants). I only allow Buffalo to be an object, because my understanding of grammar suggests that sentence's main verb should be "buffaloes" if Buffalo were the subject.
  • A noun phrase is a simple noun phrase optionally followed by a relative proposition: we will be able to nest relatives but importantly we cannot juxtapose them.
  • A simple noun phrase is either buffalo (the noun) or Buffalo buffalo (the city followed by the noun).
  • A relative proposition has a subject noun phrase and a verb, with the object of the verb being the antecedent of the proposition.

This grammar can uniquely parse the example sentence,3 and includes a few additional features not exemplified by that sentence, which I already commented about: buffalo as an imperative and/or intransitive verb, and Buffalo used as an object.

Case-insensitive case

In the fully case-insensitive case, we write everything as buffalo. As we can show, all sentences of length greater than 1 are ambiguous.

The reason is that S -> V NP and S -> NP V are two different possible parses, and it is not hard to see that we can construct NPs of arbitrary length: we can choose any number \(n\) of nested PP in the derivation, and choose either N or C N at each GN, of which there are \(n+1\). So to realize a length of \(k\), choose to create \(n = \lfloor (k-1) / 2 \rfloor\) PP, and create \(k - n\) words in the GN (which is between \(n+1\) and \(2(n+1)\)).

As an example, for buffalo buffalo buffalo buffalo buffalo, parse buffalo buffalo buffalo buffalo as NP, e.g., as, N (C N V), and parse the sentence either as V NP or NP V.

Case-sensitive case

We now assume that we can distinguish Buffalo and buffalo, except maybe for the first word, intuitively because it must carry a capital because it starts the sentence. Let us consider small lengths as examples.

  • For length 1, the sentence Buffalo can only be interpreted as V, namely, the imperative with no complement. My grammar forbids the declarative N, a nominal sentence declaring the existence of buffaloes, but of course this is debatable.
  • For length 2, Buffalo Buffalo has exactly one interpretation because the last Buffalo must clearly be a C, so the unique parse is V C. Again, we are prohibiting the nominal sentence asserting the existence of buffalo in Buffalo.
  • For length 3, considering Buffalo Buffalo buffalo, the first Buffalo cannot be a C as otherwise the next symbol should be buffalo, and it cannot be an N as otherwise the next Buffalo buffalo must start a PP that cannot finish. Hence the first Buffalo is a verb and we have a unique interpretation: V (C N).

For larger lengths, we will first observe that if any sequence of buffalo and Buffalo can be parsed as an PP, then there is only exactly one way to do so. To see why, observe that Buffalo must always be followed by buffalo in which case we always parse both symbols as an NPS. So write b for buffalo and b2 for Buffalo buffalo, and rewrite our sentence (uniquely) as a sequence of b's and b2's. We then see that each b or b2 at the left must pop a b at the right, so that parse fails unless the number of b and b2's is even and the rightmost half contains only b's; conversely, if this condition is satisfied, there is a unique parse. So we have proven our initial observation.

Second, we claim that for any (possibly empty) sequence S, there is at most one way to parse buffalo S or Buffalo buffalo S as an NP. This is clear as we must parse buffalo or Buffalo buffalo as an NPS and the rest as an PP, and we can then use the previous reasoning. It is not so hard to see then that this claim extends to the situation where the first symbol has ambiguous case. Indeed, for any sequence S, at most one of S and buffalo S has a successful parse as a PP (because, rewriting them using b's and b2's, only one rewriting can have an even number of symbols). Hence, for Buffalo buffalo S, only one option succeeds between parsing the initial Buffalo as an N (using case ambiguity) and buffalo S as a PP, or parsing Buffalo buffalo as an NPS and S as a PP. Likewise, for buffalo S, either we parse buffalo as an N and S as a PP, or we parse buffalo as a C (using case ambiguity): in this second case the first symbol of S cannot be Buffalo, so that, writing S as buffalo S', we parse S' as a PP. By the previous reasoning only one of these two options can succeed, and if it does then it succeeds in one unique way. So we have proven our initial claim.

We now show that for any non-empty sequence S, the only way to parse S buffalo Buffalo as a sentence is to parse S as an NP. This is because a final Buffalo is necessarily an NPO. Now, as S buffalo contains at least two symbols, it cannot be V, so it must be NP V, so the buffalo is V and we must parse S as an NP. Hence, combining this with our previous claim, for any sequence S, there is at most one way to parse the following sentences:

  • Buffalo S buffalo Buffalo
  • Buffalo buffalo S buffalo Buffalo

It is now easy to define recursively the language of sequences S that have a successful parse as a PP (or as the empty sequence \(\epsilon\)), abbreviating buffalo and Buffalo as b and B. By our initial observation, such parses are then unique.

  • \(L_0 = \epsilon\)
  • \(L_n = \bigcup_{w \in L_{n-1}} \{b \, w \, b, B \, b \, w \, b\}\)

So using \(L_0\) we can construct buffalo sentences with unique parses for length 3 and 4; and using the first rule in the definition of \(L_n\) we can then reach \(k+2\) for any \(k\) that we can reach. Hence, as 3 is odd, 4 is even, and 1 and 2 were previously covered, we have shown our claim: we can construct uniquely parsable buffalo sentences of any length, even with ambiguity on the case of the first word.

To illustrate, is an example of a sentence of length 42 which we have shown to have a unique parse.4

Buffalo buffalo Buffalo buffalo Buffalo buffalo buffalo Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo Buffalo buffalo Buffalo buffalo Buffalo buffalo buffalo Buffalo buffalo Buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo Buffalo.


  1. For all animal species in the sentence, it seems semantically unclear whether the individuals are existentially or universally quantified. For instance, we do not know whether the sentence applies to all such unicorns or only some of them. 

  2. By "grammatical interpretation" I mean parsing, not meaning; I am not interested in possible doubts about whether the verb buffalo means "intimidate" or "hunting buffaloes", nor do I care about which buffalo species or Buffalo city the speaker is referring to. 

  3. If the initial Buffalo is V, then buffalo is an NP, we must parse a PP, Buffalo buffalo is an NP, we must parse a PP, then either we start unstacking the V and we are left with Buffalo buffalo which we cannot parse, or we further parse buffalo as an NP and then we can never unstack enough V. If the initial Buffalo is N, then for a similar reason we cannot parse what follows as a PP (if we unstack early then we are left with the second Buffalo where the verb should be), so the second buffalo is V, Buffalo buffalo is then NP, and once again we cannot parse a PP. So the initial Buffalo is C, and buffalo is then N, and Buffalo buffalo must then be a PP. We must unstack immediately, we must parse buffalo as the verb, and we have no more choices. 

  4. Of course, this is just an example, and in general unique parse sentences are not themselves unique. We can in fact build unambiguous sequences following entirely different schemes; e.g., we can also show that Buffalo Buffalo buffalo w has a unique parse (with the initial Buffalo being necessarily a V), for any w in the language inductively defined by \(L_1 = \{B \, b \, b\}\) and \(L_n = \bigcup_{w \in F_{n-1}} \{B \, b \, B \, b \, b \, w \, b \, b \, b, B \, b \, w \, b\}\)

Configuring screens with xrandr

There are various GUI tools on Linux to change the screen resolution and configure the various screens of your machine, which is useful to do, e.g., when plugging in a video projector to give talks. From my experience, however, the right tool, the one that does not hide anything from you and allows you to get the job done, is the CLI utility xrandr.

This post is a quick guide to perform common tasks with xrandr. I'm certainly not an X guru, but maybe this can serve as an introduction about how to perform common display configuration tasks with it.

The starting point is always to see the current state with:

xrandr -q

This will print out information about all screens and outputs known to the system. For instance:

Screen 0: minimum 320 x 200, current 3360 x 1080, maximum 16384 x 16384
LVDS connected 1920x1080+1440+0 (normal left inverted right x axis y axis) 382mm x 215mm
   1920x1080     60.01*+
   1680x1050     59.95  
   1400x1050     59.98  
   1280x1024     59.89  
   1440x900      59.89  
   1280x960      59.94  
   1280x854      59.89  
   1280x800      59.81  
   1280x720      59.86  
   1152x768      59.78  
   1024x768      59.92  
   800x600       59.86  
   848x480       59.66  
   720x480       59.71  
   640x480       59.38  
DisplayPort-0 disconnected (normal left inverted right x axis y axis)
DisplayPort-1 disconnected (normal left inverted right x axis y axis)
DisplayPort-2 disconnected (normal left inverted right x axis y axis)
VGA-0 connected 1440x900+0+0 (normal left inverted right x axis y axis) 408mm x 255mm
   1440x900      59.89*+  74.98  
   1280x1024     75.02    60.02  
   1280x800      59.81  
   1152x864      75.00  
   1024x768      75.08    70.07    60.00  
   832x624       74.55  
   800x600       72.19    75.00    60.32    56.25  
   640x480       75.00    72.81    66.67    60.00  
   720x400       70.08

The first thing is to figure out which output corresponds to which of the screens you have in front of you. You can usually determine it from the output names: VGA, HDMI, and DVI usually correspond to the connector type on the motherboard. On the example output above from my work computer, VGA-0 is indeed an external screen plugged in my computer's VGA port. LVDS is usually a laptop's internal screen, which is the case in the example. Sometimes the names are off, e.g., on my home computer, a DVI port is mistakenly labeled as an HDMI port. Alternatively, you can identify screens by their modes if you know which resolutions they support. For instance, above, I know that my main screen supports 1920x1080, so it is necessarily the LVDS one. (If you don't know what a screen resolution is, read on, I'll explain it.)

If you're trying to configure an external screen or video projector, and it doesn't show up in the output of xrandr -q, then you have a problem that's deeper than just persuading X to use the output properly. You should probably investigate whether the kernel or X server can see anything (in case your output is not listed) or check the cable (in case the output is listed but indicated as disconnected). When debugging things, know that all external screen standards that I know of (VGA, DVI, HDMI) are perfectly plug-and-play, meaning that it is safe to play with the cables. However, in some cases (e.g., laptops), it could be the case that an output will not work unless the system has booted with a usable screen on that output, or unless some magic key combination is pressed to enable the output.

If all outputs that you want to use seem to be listed and have a screen connected to them, congrats! You can now configure them. The two simplest commands are the following. The first one enables one output that has a connected screen and sets it to its preferred resolution, which is usually what you want to do. The second one disables one output (any screen connected to it will usually go in power-save mode).

# enable output VGA1 and set its screen to its preferred mode
xrandr --output VGA1 --auto
# disable output VGA1
xrandr --output VGA1 --off

Now, how to decide what to put on each screen? The first possibility is to tile the display across multiple screens. This makes sense when you have a laptop and an external screen, or multiple screens on one machine, and want to show different windows on them. If you are trying to use a video projector, this is reasonable if you are using a fancy presentation tool like pdfpc that can show the slides on one screen and use your laptop screen to show the next slide, notes, etc. Here is an example to illustrate how this is configured1:

# my home setup is as follows, from left to right:
# one VGA screen, one DVI screen, one HDMI screen
xrandr --output VGA1 --left-of DVI1
xrandr --output DVI1 --left-of HDMI1

The second thing that you could want to do is to have two screens that show the same thing. For instance, for presentations using a non-fancy PDF reader, you usually want both your screen and the video projector to show your slides. This is achieved very simply:

# set VGA1 (the projector) to show the same thing as LVDS1 (laptop screen)
xrandr --output VGA1 --same-as LVDS1

However, if your projector is not set to the same resolution as your screen, your presentation will be clipped or mis-centered on one of the displays. Let me explain a bit about this. What a screen displays is made of atomic dots, aka pixels2, and the resolution of a screen is the number of pixels it displays, given as the width times the height, for instance 1024x768 (with the letter 'x' commonly used for the times symbol instead of the Unicode character '×'). Screen resolutions also have names, such as "VGA" (for 640x480), or "FHD" (for Full HD, that is3, 1920x1080). Screen resolution tells you nothing about the actual size of the display, which depends on the size of the physical pixels on the screen, aka pixel density4.

Screens can support multiple resolutions. In the vast majority of cases, the screen is designed to be used at its highest possible resolution, and lower resolutions look a bit ugly because they crudely scale up the lower resolution to the screen's physical resolution (the number of physical pixels it has). Screens can be set at a lower resolution as a poor man's version of zooming, although it is usually far prettier to do the resizing at the software level, e.g., use larger fonts, so that fonts are large but remain crisp. However, lower resolutions are also useful to show the same thing (e.g., your presentation) on two screens with different characteristics.

Doing xrandr -q indicates which modes are supported by the screen connected to each output, with a '+' next to the preferred mode (the one the screen is "designed" to be used in) and a '*' next to the currently used mode. Each mode consists of a screen resolution, given at the left, and a refresh rate5.

Thus, to fix the problem with --same-as, you should determine from the list what is the highest resolution supported by both outputs. (With some luck, it will be the projector's preferred resolution, so only your laptop screen will look a bit blurry; but this is not always the case, especially when aspects ratios do not match.) You should then set the outputs to that resolution; for instance:

xrandr --output VGA1 --mode 1024x768
xrandr --output LVDS1 --mode 1024x768

Once you are done giving your presentation, you will probably want to revert your laptop's screen to its preferred resolution using --auto, as previously explained.

You can do a bunch of other things with xrandr, such as rotating displays, applying transformation matrices to them, software brightness and gamma correction, zooming, etc., see man xrandr to know more. Playing with these options should be safe (although it's true that, in ancient days, there were stories about software ways to destroy CRT screens...). A real risk, however, is that if you make all your screens unusable (e.g., turn them off with --off, make them display garbage with bogus scales, flip them, etc.,), xrandr has no way to allow you to revert to sane settings automatically after a time, as GUI tools often do. You will have to undo your mistake in the blind. If you can't do this, one possible option is to switch to a TTY with Ctrl-Alt-F1, and use xrandr from there; you will probably have to pass it an environment variable so that it can talk to your X server, for instance:

DISPLAY=":0" xrandr --output LDVS1 --auto

Worst comes, of course, you can always restart the machine, as these settings do not persist across reboots.


  1. Experts will note that you can combine this in one invocation to the xrandr command, but for simplicity I will not address this here. 

  2. Pixel is the abbreviation of "picture element" and is supposed to be the smallest displayable unit. However, in fact, on LCD screens, individual pixels are further split in red, green, and blue subpixels, as you can easily see with a magnifying glass (or, e.g., a drop of rain on your smartphone's screen). This fact is commonly used to improve the quality of the display to something finer than what the resolution would suggest, especially for text; this is known as subpixel rendering

  3. It is typically quite annoying to see screens advertised as "Full HD" as if it were some magical statement about the quality of the screen, whereas it is nothing more than a technical statement about the number of pixels it can display. 

  4. In particular, what Apple brands as "retina displays" are displays with a certain high pixel density, which supposedly exceeds the human eye's own density for a certain typical viewing distance. Nothing magical here either. 

  5. The refresh rate is the number of times per second the screen should update, in Hertz. You can select it using --rate. I will not talk about it further, because it usually does not matter. 

Google Hash Code 2015

Today was the onsite final of the Google Hash Code contest organized by Google Paris. The task was to route loons to optimize Internet delivery on target cells by adjusting the loons' altitude and taking advantage from winds: see the problem statement in English and French and the input file, which I will not summarize further here.

With Théotime Grohens, Marc Jeanmougin, and Auguste Olivry, I was part of the "ENS Ulm 1" team who got the first place. This post presents our strategy to solve the problem. It is comparatively simpler than the 2014 contest strategies used during the extended round. We also provide cleaned-up code for our approach; it is about 200 lines long, which is twice more than what I published for the selection round; however, unlike the selection round, this code would have sufficed to reach the first place of the contest (like we did), beating the second best score (698678 by the hackEns team).

Strategy

The basic idea of the strategy is to compute the route of the loons one after the other, considering the route of the other loons as fixed, and starting with the default routes of not moving the loons at all.

To route each loon, we use a dynamic programming approach, which computes the optimal route for this loon given the other routes. We maintain (in covered in the code below) which target cells are already covered at which time by the other loons (whose routes are fixed), to focus at each point in time on the target cells which are not covered. We optimize the total number of covered cells (summed over all time units) for the loon that we are routing: we call this the score. We use the following recursive definition to compute the optimal route:

  • once the time is over, the score for a loon is 0 no matter its position
  • at a time \(0 \leq t < T\), the possible options for a loon at row \(r\), column \(c\), altitude \(a\) are to adjust altitude by -1, 0, or 1 (within acceptable bounds), and the score is the sum of the following quantities:
    • the number of cells that we cover at time \(t+1\) in the cell where the wind has taken us after our move, counting only those that are not otherwise covered
    • the score at time \(t+1\) for the position where the wind has taken us after our move (or 0 if the loon was lost), which we compute recursively

To make this more palatable, we have precomputed overall, for each cell, the cell where the wind leads us (in dest). We also precompute which target cells are covered by a loon in each cell (in cov; and then in left at each iteration for speed). To implement the recursive definition and compute the optimal path (in function route), we then just need to compute a big table for our loon with each possible row, cell, altitude and time unit, filling it from the end of time to the beginning of time. We store in each cell of the table the optimal score (in best), computed according to the above, and the altitude adjustment (in dir) that achieves the best score.

Will this have a reasonable running time? The bounds of the input were the following: 300 rows, 75 columns, 9 altitudes (counting the ground), and 400 time units. We need two tables, each containing one int for each possibility (so 8 bytes total). The total is 648 MB, which just fits. On my workstation (with Intel Core i5-4570 @ 3.20GHz), this computation takes about one second per loon.

Doing this for all loons in sequence gives a score of 680953 with my implementation, which would have sufficed to reach the 7th place. To improve the solution further, we follow a very simple scheme. As we have computed the optimal route for each loon in isolation, once all routes have been computed, we just recompute the routes for each loon in sequence, given the previous routes. This cannot make the solution worse (the loon can always follow its previous route, which the dynamic programming approach cannot miss), and may improve it, if the loon can follow a better route given the new routes of the other loons. As just one additional tweak, to make this optimization more efficient, when two altitude adjustments are tied in the dynamic algorithm, we choose one of them at random rather than deterministically: this means that the dynamic algorithm still returns the optimal route for its loon, but it adds more opportunities to change the route and makes the iterative optimization more effective.

Code

The following C++ code is under 200 lines in total, and produces solutions with increasingly better score, in a file called "output". On my workstation, it takes a bit under 20 minutes to produce a solution with a score that would have sufficed to reach the first place in the contest.

#include <algorithm>
#include <cstdio>
#include <vector>

using namespace std;

#define MAXR 77
#define MAXC 302
#define MAXL 2252
#define MAXA 11
#define MAXT 402
#define MAXB 54

int R, C, A, L, V, B, T, rs, cs;

struct Pt {
  int r, c;
  Pt(int r=0, int c=0) : r(r), c(c) {}
};

// global variables: auto-initalized, can be larger than what fits in stack
vector<int> cov[MAXR][MAXC]; // which targets are covered by a loon at (r, c)
Pt dest[MAXA][MAXR][MAXC]; // where a loon at (a, r, c) flies; (-1, -1) = out
Pt target[MAXL]; // position of target cells
bool covered[MAXT][MAXL]; // at time t, is target cell l covered yet?
// dynamic programming: best score at (t, a, r, c) and best direction
int best[MAXT][MAXA][MAXR][MAXC], dir[MAXT][MAXA][MAXR][MAXC];
int left[MAXT][MAXR][MAXC]; // how many new cells covered at (r, c) at time t
// solution: at time t, altitude adjustment for loon b 
int solution[MAXT][MAXB];

// given by problem statement
int columndist(int c1, int c2) { return min(abs(c1 - c2), C - abs(c1 - c2)); }
// does a loon at (r, c) cover target cell l?
int iscovered(int l, int r, int c) {
  int u = target[l].r, v = target[l].c;
  return ((r - u) * (r - u) + columndist(c, v) * columndist(c, v) <= V*V);
}

// return next move for loon b at time t when at position (a, r, c)...
// ... using the results of dynamic programming
int decide_dyn(int b, int t, int a, int r, int c) { return dir[t][a][r][c]; }
// ... using the computed solution
int decide_sol(int b, int t, int a, int r, int c) { return solution[t][b]; }

int simulate(int b, int (*decide)(int, int, int, int, int)) {
  // compute the run of loon b using decision function decide
  // return the score improvement
  int score = 0;
  int r = rs, c = cs, a = 0;
  for (int t = 0; t < T; t++) {
    int bestda = (*decide)(b, t, a, r, c);
    // store the move
    solution[t][b] = bestda;
    // apply it
    a += bestda;
    Pt next = dest[a][r][c];
    r = next.r;
    c = next.c;
    if (r < 0)
      break; // loon exited
    if (a > 0) {
      // update covered target cells
      for (unsigned int i = 0; i < cov[r][c].size(); i++) {
        int l = cov[r][c][i];
        // score improved if target cell not already covered
        score += covered[t+1][l] ? 0 : 1;
        covered[t+1][l] = true;
      }
    }
  }
  return score;
}

void route(int b) {
  // route loon b given the already covered targets (in covered)

  // save a factor A: pre-count the uncovered targets for each cell
  for (int t = 0; t <= T; t++)
    for (int r = 0; r < R; r++)
      for (int c = 0; c < C; c++) {
        int score = 0;
        for (unsigned int i = 0; i < cov[r][c].size(); i++) {
          int l = cov[r][c][i];
          score += covered[t][l] ? 0 : 1;
        }
        left[t][r][c] = score;
      }

  // dynamic programming, t == T is a sentinel value
  for (int t = T-1; t >= 0; t--)
    for (int a = 0; a <= A; a++)
      for (int r = 0; r < R; r++)
        for (int c = 0; c < C; c++) {
          int bestda = 0, bestv = 0; // best altitude adjustment, best value
          for (int da = -1; da <= 1; da++) {
            if (a <= 1 && da == -1)
              continue; // can't go down when on ground, or back on ground
            if (a + da > A)
              continue; // can't go above max altitude
            // pretend we moved b by da at time t
            Pt next = dest[a + da][r][c];
            if (next.r < 0)
              break; // loon exited, so score remains 0
            // score if going at these new coordinates, computed dynamically
            int rscore = best[t+1][a+da][next.r][next.c];
            // if above ground, score increased by number of covered targets
            if ((a + da) > 0)
              rscore += left[t+1][next.r][next.c];
            // break ties at random
            if (rscore > bestv || (rscore == bestv && (rand() % 2))) {
              // da is the best altitude adjustment
              bestv = rscore;
              bestda = da;
            }
          }
          // store best altitude adjustment, best value
          best[t][a][r][c] = bestv;
          dir[t][a][r][c] = bestda;
        }
}

void print_sol() {
  FILE *f = fopen("output", "w");
  for (int t = 0; t < T; t++) {
    for (int b = 0; b < B; b++)
      fprintf(f, "%d ", solution[t][b]);
    fprintf(f, "\n");
  }
  fclose(f);
}

int main(int argc, char **argv) {
  srand(42); // make it deterministic
  scanf("%d%d%d", &R, &C, &A);
  scanf("%d%d%d%d", &L, &V, &B, &T);
  scanf("%d%d", &rs, &cs);
  for (int l = 0; l < L; l++) {
    int r, c;
    scanf("%d%d", &r, &c);
    target[l] = Pt(r, c);
  }
  for (int r = 0; r < R; r++)
    for (int c = 0; c < C; c++) {
    }
  for (int a = 0; a <= A; a++)
    for (int r = 0; r < R; r++)
      for (int c = 0; c < C; c++)
        if (!a) {
          // wind at altitude 0 does not move loons
          dest[a][r][c] = Pt(r, c);
        } else {
          int dr, dc;
          scanf("%d%d", &dr, &dc);
          // populate dest with the coordinates where the wind takes the loon
          int destr, destc;
          destr = r + dr;
          destc = (c + dc + C) % C;
          if (destr < 0 || destr >= R)
            destr = destc = -1; // loon goes out
          dest[a][r][c] = Pt(destr, destc);
        }

  // compute for each cell the list of targets covered by a loon at this cell
  for (int r = 0; r < R; r++)
    for (int c = 0; c < C; c++)
      for (int l = 0; l < L; l++)
        if (iscovered(l, r, c))
          cov[r][c].push_back(l);

  int orig_score = 0;
  // iteratively compute a route for each loon given other routes
  while (true) {
    for (int br = 0; br < B; br++) {
      // route br given the routes of the other loons
      // first, forget about which cells are covered
      for (int t = 0; t <= T; t++)
        for (int l = 0; l < L; l++)
          covered[t][l] = false;
      // then, simulate the other loons (default routes are 0 ... 0)
      int score = 0;
      for (int b = 0; b < B; b++)
        if (b != br)
          score += simulate(b, &decide_sol);
      // now, compute the route of br with dynamic programming
      route(br);
      score += simulate(br, &decide_dyn);
      fprintf(stderr, "route %d\n", br);
      if (score > orig_score) {
        fprintf(stderr, "score was %d is %d\n", orig_score, score);
        orig_score = score;
        print_sol();
      }
    }
  }

  return 0;
}

Extensions

The above code gets stuck fairly quickly. After reaching 698684 in about 17 minutes on my machine, it painfully and progressively goes up to 698734 in about 30 more minutes, and then stalls and does not seem to improve further (I left it running for about 30 additional minutes). Of course, this may vary given your choice of RNG seed (I hardcoded one in the code for reproducibility), and indeed for some unlucky seeds (one out of 5 that I tested) it seems that the code fails to reach a winning score. In any case, our final score of 700913 was achieved with a slightly more elaborate approach. If anyone is brave enough to want to have a look at our actual code, it is here.

We used the following additional improvement: whenever iterating over loons no longer makes progress, start trying to reroute two loons (or more than two) instead of just one. When doing this, unlike when rerouting only one loon, the solution score may decrease, so the code must rollback to the previous solution if the solution got worse. (If the score is still the same but the solution is different, however, it is good to keep the change, so that further optimizations with just one loon can be more effective.) We alternated in parallel optimization attempts with a single loon and attempts with two or three loons that did not make things worse. To be precise, we also tweaked solutions twice, by moving as many as 4 loons, at the expense of a slight drop in quality, to escape local optima and reach better solutions.

All of this was very hacky and we did not have a clean way to run many explorations in parallel on multiple machines, passing improved solutions to other running instances as they were found, some being more adventurous and others more conservative, and hopefully following what is known about metaheuristics. I trust that such approaches will be explored to unimaginable depths during the extended online contest. ;) I don't know if a more clever approach could yield substantial score improvements (rather than incremental ones); this may also turn up during the extended contest.

In any case, we had lots of fun during the contest, and we are very grateful to the organizers for making it happen. :)