Google Hash Code 2014
A French translation of this post is available. — Une traduction de ce billet en français est disponible.
In April 2014, Google Paris organized an onsite programming contest, the Google Hash Code. The task to solve was quite fun. This post describes the problem, and some of the strategies to solve it that were used by the ENS teams. The reason why this gets published now is because the 2015 edition is currently being organized, which for some reason gave us the impetus to document our work.
ENS scored pretty well in the contest (ENS students were in teams ranked 1st, 2nd, 3rd, 4th, 7th, 10th; mine was "ENS Ulm 1" which ranked 7th), and substantially improved on their solutions afterwards during the Extended Contest, which lasted for some time after the contest.
Problem description
The full task assignment by Google is available.
We are given a map of Paris (from OpenStreetMap data) that includes1 all intersections and streets. As in real life, some of the streets are one-way. The segments are labeled by their length in meters and the time it takes to traverse them. We have 8 Street View cars that start from the Google office and want to photograph as many street meters as possible in a limited amount of time (15 hours). You can traverse the same street as many times as you want, but it only counts once towards your total meters.
The goal is to choose a route for the 8 cars which allows you to traverse as many unique meters of the streets of Paris as possible, within the alloted time.
The brute-force method
Of course, in principle we can solve the problem by enumerating all of the possible routes for the 8 cars, and picking the best one. There are finitely many routes, so this approach is guaranteed to find the optimal solution. This is good enough for a mathematician, but not for a computer scientist: the paths are too numerous and you will never be able to enumerate them in practice.
Imagine there is a single car that reaches every minute an intersection with at least two possible streets, during 10 hours. The number of paths is 210×60, which is about 10180. Even if every atom in the Solar System were an impossibly powerful supercomputer, you wouldn't be able to reach this number before the Sun goes bust.
The greedy method
At the opposite end of the spectrum, the simplest possible approach (beyond purely random approaches) is to choose every street to take, one after the other, based on some greedy optimization criterion.
Basic greedy
For instance, plan a route for each car, one after the other. Whenever the car reaches an intersection that allows it to reach streets that haven't been traversed yet, choose one of them. Maybe pick the one with most meters, or the one with most meters per second. Otherwise, if all reachable streets have been visited, choose one of them at random. Repeat while there is time, then do the same for all cars.
The greedy strategy will yield reasonable results. There are, of course, tweaks to make, for instance planning the route for all cars simultaneously instead of sequentially... Or sending the cars at the beginning to hardcoded start positions which are far apart, to increase coverage.
However, the key drawback of greedy solutions is that they cannot plan things in advance. When choosing among already visited streets, for instance, you are not trying to get closer to unvisited streets. If visiting a new street will trap you in a maze of already visited streets, you will still fall in the trap.
Greedy with lookahead
A simple solution to this problem is to improve the greedy strategy with brute-force that looks ahead for a few steps. Whenever you need to take a decision, don't take it based on what it will bring you now. Rather, consider all possible paths of length, say, 10, that you can do from this point, and see which one is bringing you the best results. Choose your next move to be the first move of the best path. Do the routing for cars in parallel following this idea. The path length that you use is tweakable, and in general using higher path lengths will improve the result: I think we used something like 17 in practice.
This simple idea is easy to implement and yields pretty good results. My team used it with a few tweaks, and to my knowledge many of the top-ranking teams also did.
The clever method
Of course, the point of this post is to describe the clever solution, the one that was implemented after the contest, for the extended round.
Rephrasing the problem as Eulerian paths
The key insight is to notice the similarity between our problem and the Eulerian path problem: computing a path in a graph which visits each edge exactly once. The map of Paris is seen as a graph, where the intersections are the vertices, and the streets are the edges. If we could find an Eulerian path in this graph, then it would be the ideal solution, going through all segments without wasting any time in already visited segments.
Of course, you need to squint a bit to see the reduction. First, we have 8 cars; but for now we will assume that we have only one, and will split the path between the various cars later. Second, we have a time limit, and edges have a value in terms of their number of meters; but we will disregard this. We will pretend that we will be able to visit all edges in the alloted time (which isn't so unrealistic, as we will see ;)), so that we ignore the time limit and the lengths of the streets in meters (we only care about the time needed to traverse them).
Third, and more importantly, our graph is a bit weird, because some edges are undirected, and others (the one-way streets) are directed. Because of the one-way streets, we cannot rely on undirected Eulerian paths, and will be forced to reduce to directed Eulerian path. We will thus need to choose an orientation for the undirected edges.
Fourth: well, obviously, we won't be able to find an Eulerian path. There is a well-known characterization of directed graphs that have Eulerian paths (I call them Eulerian graphs), which is not hard to prove: the in-degree of each vertex (the number of edges that go towards it) must be equal to its out-degree2. Does the input graph satisfy this condition? Not a chance. However, in our case it isn't really forbidden to traverse edges multiple times; we just want to avoid doing it as it is wasteful.
Hence, there will be two steps to our process to make the graph Eulerian. First, we will choose a direction for all undirected edges, in a way that makes the graph "more Eulerian". Second, we will add edges to make it really Eulerian, which materialize the (wasteful) choice of going through edges which are otherwise visited (that is, that have been visited or will be visited at another point in the solution).
Orienting undirected edges
In order to reduce to a directed Eulerian path, we have to assign a direction to all undirected edges. To make the Eulerian path problem easier to solve on the resulting graph, we will try to minimize the sum, across all vertices, of the difference between in-degree and the out-degree: intuitively, the smaller this quantity, the less edges we have to add to make the graph Eulerian.
To introduce some terminology, I will say that a vertex is overloaded (in the original graph, or in a directed graph obtained from it by edge orientation) if its in-degree (counting only directed edges) is more than its out-degree; it is underloaded if it is less, and balanced if both are equal. The load of the vertex is the difference between in-degree and out-degree (which is >0 for overloaded vertices and <0 for underloaded vertices).
How can we choose the best orientation of the edges? Our choice of terminology gives some insight: we would like edge orientations that allow originally overloaded vertices to "transfer" some of their load to the underloaded vertices, following directed paths, which have to be disjoint to avoid double-counting the intermediate edges. It turns out that this can actually be phrased neatly as a flow problem.
Overloaded and underloaded vertices are respectively represented as sources and sinks, with capacity equal to their load. Each undirected edge is represented as a pair of edges going in both directions, with capacity 1. Edges that are already oriented are ignored.
We solve this flow problem using any flow algorithm. By the integral flow theorem, the flow along each edge in the result is either 0 or 1. Hence, the result describes a way to orient some (but not all) of the undirected edges, depending on the direction of the flow going through them.
We still have the problem that some edges in the result are still undirected, namely those which were not used at all in the optimal flow. To take care of this, we simply orient all of the remaining edges at random. Of course, this may make the final sum worse, and to alleviate this we enforce manually the property that the flow algorithm had enforced for us before: for any path of our newly oriented edges, if reversing the direction of the path makes the sum better (because the path origin was underloaded and the path target was overloaded), we do the reversal, which increases the load of the origin by one, decreases the load of the target by one, and does not change the other degrees. To perform this last step efficiently, observe that it suffices to consider all pairs of vertices. Whenever we find a pair of an underloaded vertex and an overloaded vertex, we can search for a path from one to the other, and reverse it if it exists.
Hence, at the end of this process, we have a directed graph where we have tried to minimize the sum of differences between in-degree and out-degree, to make it as Eulerian as possible. We now make it Eulerian for real, by adding edges so that all vertices are balanced.
Adding edges
Intuitively, the edges that we add to make the graph Eulerian correspond to the possibility of adding already traversed edges to the path.
In other words, if we add an edge from vertex i to vertex j, it means that we will be going from i to j through a path of edges that will be visited anyway. All the time that we spend doing such traversals is wasted, and does not contribute towards our objective. Hence, the cost of adding an edge from i to j should be the time needed to go from i to j following any edge path. In fact, we may follow a path that violates our chosen orientation for the edges that used to be undirected, and it does not matter; so we are really talking about following a shortest path in the original graph.
Hence, we start by precomputing the shortest path time between all pairs of vertices in the input graph, using the Floyd-Warshall algorithm.
Now, we model the problem of adding more edges as a minimum weight matching in a weighted bipartite graph. On one part of the bipartite graph, we put all vertices that are overloaded in the result of the previous phase, duplicated as many times as their load. We analogously create the other part3 of the graph with the underloaded vertices. We add edges in the bipartite graph between any two vertices in the two parts, weighted by the time of the shortest path from one to the other.
Now, a perfect matching in this graph represents a way to add the right number of edges to ensure that all vertices are balanced, and if its weight is minimal then it wastes little total time as possible traversing edges that are otherwise visited. We use the Hungarian algorithm to solve this problem, and to complete the oriented graph.
Constructing the Eulerian path
The resulting graph is now directed, and each vertex is balanced, so it has an Eulerian path. To compute it, it is well-known that a simple greedy approach works. (Maybe there would be possible improvements with respect to the later steps, but we didn't look into this.)
From the Eulerian path, expanding the edges that were added in the previous phase to the shortest paths to which they correspond, we obtain an itinerary, which is a fairly good solution for a single car to traverse all the streets of Paris in as little time as possible.
Splitting the path among cars
We now remember that we have 8 cars rather than a single one, so that our route must be split among them.
Here is our approach: we take the first car, and have it run the beginning of the route for one quarter of the alloted time. Then, we look at the rest of the route until the end of the alloted time, and search for the traversed vertex that is the easiest to reach (in terms of shortest path time) from the starting vertex (the Google office). We continue the route of the first car until it reaches this vertex, at which point we end it. Then, the next car starts at the starting vertex, goes to the cut vertex following the shortest path, follows the rest of the route again for one quarter of the alloted time, and continues until the closest point to the starting vertex. We repeat until the last car, which takes care of all the rest of the route.
Of course, the reason why we cut when each car goes close to the starting vertex is to minimize the effort spent by the next car to reach the position where they can resume the route. The solution will still give too much work to the last car, but we will see later how to balance the duration of the routes of all cars. The reason why we ensured that each car ran for at least one quarter of the time is to make this balancing operation easier, by ensuring that the car paths are sufficiently long to intersect many times.
We use this as our initial solution, and now optimize it further.
Optimizations
We now iterate a bunch of optimizations to tweak the paths and make them better, until we get something which we judge is good enough.
Of course, not all optimizations yield the same savings; and the order in which they are applied can make a large difference. That said, I don't remember in which order, or how many times, they were applied; there probably wasn't much logic to it. :)
Pruning
For any subpath of the path of any car that consists only of otherwise visited edges, we replace it by a shortest path between its endpoints (courtesy of the precomputed Floyd-Warshall).
Loop swapping
Assume some car's path includes a move from vertex i to vertex j, then a cycle C, then a move from vertex j to vertex i; and that the edge from i to j is otherwise visited. If we can find any other car that passes through an intersection of C, we can have that other car take care of the edges of C on the way, so that the first car doesn't need to do C and doesn't need to do the steps from i to j and back (so we save twice the time of the i—j edge).
Of course, this can lead to a situation where some cars have a longer path than others, but this will be dealt with later.
Trimming
If the last edge of a car's path is already visited by someone else, just remove it.
Brute-force
For every subpath of every car's path, up to certain length, if the path can be replaced by any shorter path that leaves no street unvisited, then do so. This differs from the pruning optimization in that the subpaths are allowed to contain edges that are not otherwise visited; of course, the replacement subpath will also need to traverse these edges (possibly in a different order).
Balancing
The time limit of 15 hours has been ignored until now, but it has to be obeyed by all cars, so we shouldn't return a solution where one car does all the work and the others do not do much. Further, in case we do manage to traverse all edges (which we did!), the quality of a solution is judged according to its remaining time, which is T−maxc∈1,…,8tc where T is the time limit and tc the time of car c.
Hence, we would ideally want all the cars to follow routes that take the same time. This implies that the busiest cars should try to swap longer parts of their paths with shorter parts of less busy cars.
In practice, look for any two vertices x and y and two cars such that both cars will pass through x before passing through y. Now, in general the paths followed by the two cars between x and y will take a different amount of time; if swapping the paths will decrease the time difference between the two cars' routes, then swap them.
This is a simple greedy scheme that can be iterated as long as we want. In fact, as the paths of cars will tend to intersect very often, it works very well and quickly converges to a state where all cars take the same time to do their routes, no matter the initial routes.
Results
The clever strategy described here managed to traverse all streets on Paris in the alloted time (in fact, in 9 minutes less than the alloted time). It's a pity that this didn't happen during the contest proper, but still, it is nice. :)
Related work
We developed all of this on our own, and we didn't look seriously into the related work of people who already tried to solve similar problems. It turns out that our problem is known as the postman problem, because of course it is also useful to optimize mail delivery.
More specifically, our version is known as the "k-mixed postman problem", because we have multiple cars (here, k=8), and because the graph is mixed: some of the edges are directed while others are undirected. Surprisingly, the directed postman problem and the undirected postman problem (for k=1) can be optimally solved in polynomial time4, whereas the mixed postman problem is NP-hard5.
It would be nice to compare the results of our method to state-of-the-art approaches. We did a preliminary comparison to existing datasets and we have no reason to believe that our algorithms are competitive. But, still, we had fun. :)
Credits
The ideas in this post are not mine: essentially everything was done by other ENS students. All steps of the clever solution except for the post-processing were conceived and implemented by gnomnain (code). The optimizations were implemented by elarnon (code), and Mc and mehdi (code). (All code is purely write-only and there is no documentation of any kind, so don't try to use it.)
This post was re-written from an earlier version authored by Mc. Thanks to Jill-Jênn, elarnon, Armavica and gprano for proofreading drafts and offering feedback. There is another write-up about the topic, in French, by Jill-Jênn.
-
The input also included the geographic coordinates of the intersections, but we won't be using this. That said, to anticipate on the sequel, some teams had nice results with the simple greedy heuristic of taking the leftmost non-visited street at each intersection, to make spiral-like itineraries. ↩
-
Technically, this is the criterion for the existence of a cycle, not a path, but as we would have to force the path at the starting point anyway, it is simpler to look for a cycle, which is just a special case of path. ↩
-
It is clear that both parts have the same size, because the total of in-degrees is the same as the total of out-degrees: it is exactly the number of edges in the directed graph of the previous phase. ↩
-
J. Edmonds, E.L. Johnson. Matching Euler tours and the Chinese postman, Mathematical Programming 5 (1973) 88–124. ↩
-
C.H. Papadimitriou. On the complexity of edge traversing, Journal of the ACM 23 (1976) 544–554. ↩