a3nm's blog

Google Hash Code 2014

A French translation of this post is available. — Une traduction de ce billet en français est disponible.

In April 2014, Google Paris organized an onsite programming contest, the Google Hash Code. The task to solve was quite fun. This post describes the problem, and some of the strategies to solve it that were used by the ENS teams. The reason why this gets published now is because the 2015 edition is currently being organized, which for some reason gave us the impetus to document our work.

ENS scored pretty well in the contest (ENS students were in teams ranked 1st, 2nd, 3rd, 4th, 7th, 10th; mine was "ENS Ulm 1" which ranked 7th), and substantially improved on their solutions afterwards during the Extended Contest, which lasted for some time after the contest.

Problem description

The full task assignment by Google is available.

We are given a map of Paris (from OpenStreetMap data) that includes1 all intersections and streets. As in real life, some of the streets are one-way. The segments are labeled by their length in meters and the time it takes to traverse them. We have 8 Street View cars that start from the Google office and want to photograph as many street meters as possible in a limited amount of time (15 hours). You can traverse the same street as many times as you want, but it only counts once towards your total meters.

The goal is to choose a route for the 8 cars which allows you to traverse as many unique meters of the streets of Paris as possible, within the alloted time.

The brute-force method

Of course, in principle we can solve the problem by enumerating all of the possible routes for the 8 cars, and picking the best one. There are finitely many routes, so this approach is guaranteed to find the optimal solution. This is good enough for a mathematician, but not for a computer scientist: the paths are too numerous and you will never be able to enumerate them in practice.

Imagine there is a single car that reaches every minute an intersection with at least two possible streets, during 10 hours. The number of paths is 210×60, which is about 10180. Even if every atom in the Solar System were an impossibly powerful supercomputer, you wouldn't be able to reach this number before the Sun goes bust.

The greedy method

At the opposite end of the spectrum, the simplest possible approach (beyond purely random approaches) is to choose every street to take, one after the other, based on some greedy optimization criterion.

Basic greedy

For instance, plan a route for each car, one after the other. Whenever the car reaches an intersection that allows it to reach streets that haven't been traversed yet, choose one of them. Maybe pick the one with most meters, or the one with most meters per second. Otherwise, if all reachable streets have been visited, choose one of them at random. Repeat while there is time, then do the same for all cars.

The greedy strategy will yield reasonable results. There are, of course, tweaks to make, for instance planning the route for all cars simultaneously instead of sequentially... Or sending the cars at the beginning to hardcoded start positions which are far apart, to increase coverage.

However, the key drawback of greedy solutions is that they cannot plan things in advance. When choosing among already visited streets, for instance, you are not trying to get closer to unvisited streets. If visiting a new street will trap you in a maze of already visited streets, you will still fall in the trap.

Greedy with lookahead

A simple solution to this problem is to improve the greedy strategy with brute-force that looks ahead for a few steps. Whenever you need to take a decision, don't take it based on what it will bring you now. Rather, consider all possible paths of length, say, 10, that you can do from this point, and see which one is bringing you the best results. Choose your next move to be the first move of the best path. Do the routing for cars in parallel following this idea. The path length that you use is tweakable, and in general using higher path lengths will improve the result: I think we used something like 17 in practice.

This simple idea is easy to implement and yields pretty good results. My team used it with a few tweaks, and to my knowledge many of the top-ranking teams also did.

The clever method

Of course, the point of this post is to describe the clever solution, the one that was implemented after the contest, for the extended round.

Rephrasing the problem as Eulerian paths

The key insight is to notice the similarity between our problem and the Eulerian path problem: computing a path in a graph which visits each edge exactly once. The map of Paris is seen as a graph, where the intersections are the vertices, and the streets are the edges. If we could find an Eulerian path in this graph, then it would be the ideal solution, going through all segments without wasting any time in already visited segments.

Of course, you need to squint a bit to see the reduction. First, we have 8 cars; but for now we will assume that we have only one, and will split the path between the various cars later. Second, we have a time limit, and edges have a value in terms of their number of meters; but we will disregard this. We will pretend that we will be able to visit all edges in the alloted time (which isn't so unrealistic, as we will see ;)), so that we ignore the time limit and the lengths of the streets in meters (we only care about the time needed to traverse them).

Third, and more importantly, our graph is a bit weird, because some edges are undirected, and others (the one-way streets) are directed. Because of the one-way streets, we cannot rely on undirected Eulerian paths, and will be forced to reduce to directed Eulerian path. We will thus need to choose an orientation for the undirected edges.

Fourth: well, obviously, we won't be able to find an Eulerian path. There is a well-known characterization of directed graphs that have Eulerian paths (I call them Eulerian graphs), which is not hard to prove: the in-degree of each vertex (the number of edges that go towards it) must be equal to its out-degree2. Does the input graph satisfy this condition? Not a chance. However, in our case it isn't really forbidden to traverse edges multiple times; we just want to avoid doing it as it is wasteful.

Hence, there will be two steps to our process to make the graph Eulerian. First, we will choose a direction for all undirected edges, in a way that makes the graph "more Eulerian". Second, we will add edges to make it really Eulerian, which materialize the (wasteful) choice of going through edges which are otherwise visited (that is, that have been visited or will be visited at another point in the solution).

Orienting undirected edges

In order to reduce to a directed Eulerian path, we have to assign a direction to all undirected edges. To make the Eulerian path problem easier to solve on the resulting graph, we will try to minimize the sum, across all vertices, of the difference between in-degree and the out-degree: intuitively, the smaller this quantity, the less edges we have to add to make the graph Eulerian.

To introduce some terminology, I will say that a vertex is overloaded (in the original graph, or in a directed graph obtained from it by edge orientation) if its in-degree (counting only directed edges) is more than its out-degree; it is underloaded if it is less, and balanced if both are equal. The load of the vertex is the difference between in-degree and out-degree (which is >0 for overloaded vertices and <0 for underloaded vertices).

How can we choose the best orientation of the edges? Our choice of terminology gives some insight: we would like edge orientations that allow originally overloaded vertices to "transfer" some of their load to the underloaded vertices, following directed paths, which have to be disjoint to avoid double-counting the intermediate edges. It turns out that this can actually be phrased neatly as a flow problem.

Overloaded and underloaded vertices are respectively represented as sources and sinks, with capacity equal to their load. Each undirected edge is represented as a pair of edges going in both directions, with capacity 1. Edges that are already oriented are ignored.

We solve this flow problem using any flow algorithm. By the integral flow theorem, the flow along each edge in the result is either 0 or 1. Hence, the result describes a way to orient some (but not all) of the undirected edges, depending on the direction of the flow going through them.

We still have the problem that some edges in the result are still undirected, namely those which were not used at all in the optimal flow. To take care of this, we simply orient all of the remaining edges at random. Of course, this may make the final sum worse, and to alleviate this we enforce manually the property that the flow algorithm had enforced for us before: for any path of our newly oriented edges, if reversing the direction of the path makes the sum better (because the path origin was underloaded and the path target was overloaded), we do the reversal, which increases the load of the origin by one, decreases the load of the target by one, and does not change the other degrees. To perform this last step efficiently, observe that it suffices to consider all pairs of vertices. Whenever we find a pair of an underloaded vertex and an overloaded vertex, we can search for a path from one to the other, and reverse it if it exists.

Hence, at the end of this process, we have a directed graph where we have tried to minimize the sum of differences between in-degree and out-degree, to make it as Eulerian as possible. We now make it Eulerian for real, by adding edges so that all vertices are balanced.

Adding edges

Intuitively, the edges that we add to make the graph Eulerian correspond to the possibility of adding already traversed edges to the path.

In other words, if we add an edge from vertex i to vertex j, it means that we will be going from i to j through a path of edges that will be visited anyway. All the time that we spend doing such traversals is wasted, and does not contribute towards our objective. Hence, the cost of adding an edge from i to j should be the time needed to go from i to j following any edge path. In fact, we may follow a path that violates our chosen orientation for the edges that used to be undirected, and it does not matter; so we are really talking about following a shortest path in the original graph.

Hence, we start by precomputing the shortest path time between all pairs of vertices in the input graph, using the Floyd-Warshall algorithm.

Now, we model the problem of adding more edges as a minimum weight matching in a weighted bipartite graph. On one part of the bipartite graph, we put all vertices that are overloaded in the result of the previous phase, duplicated as many times as their load. We analogously create the other part3 of the graph with the underloaded vertices. We add edges in the bipartite graph between any two vertices in the two parts, weighted by the time of the shortest path from one to the other.

Now, a perfect matching in this graph represents a way to add the right number of edges to ensure that all vertices are balanced, and if its weight is minimal then it wastes little total time as possible traversing edges that are otherwise visited. We use the Hungarian algorithm to solve this problem, and to complete the oriented graph.

Constructing the Eulerian path

The resulting graph is now directed, and each vertex is balanced, so it has an Eulerian path. To compute it, it is well-known that a simple greedy approach works. (Maybe there would be possible improvements with respect to the later steps, but we didn't look into this.)

From the Eulerian path, expanding the edges that were added in the previous phase to the shortest paths to which they correspond, we obtain an itinerary, which is a fairly good solution for a single car to traverse all the streets of Paris in as little time as possible.

Splitting the path among cars

We now remember that we have 8 cars rather than a single one, so that our route must be split among them.

Here is our approach: we take the first car, and have it run the beginning of the route for one quarter of the alloted time. Then, we look at the rest of the route until the end of the alloted time, and search for the traversed vertex that is the easiest to reach (in terms of shortest path time) from the starting vertex (the Google office). We continue the route of the first car until it reaches this vertex, at which point we end it. Then, the next car starts at the starting vertex, goes to the cut vertex following the shortest path, follows the rest of the route again for one quarter of the alloted time, and continues until the closest point to the starting vertex. We repeat until the last car, which takes care of all the rest of the route.

Of course, the reason why we cut when each car goes close to the starting vertex is to minimize the effort spent by the next car to reach the position where they can resume the route. The solution will still give too much work to the last car, but we will see later how to balance the duration of the routes of all cars. The reason why we ensured that each car ran for at least one quarter of the time is to make this balancing operation easier, by ensuring that the car paths are sufficiently long to intersect many times.

We use this as our initial solution, and now optimize it further.

Optimizations

We now iterate a bunch of optimizations to tweak the paths and make them better, until we get something which we judge is good enough.

Of course, not all optimizations yield the same savings; and the order in which they are applied can make a large difference. That said, I don't remember in which order, or how many times, they were applied; there probably wasn't much logic to it. :)

Pruning

For any subpath of the path of any car that consists only of otherwise visited edges, we replace it by a shortest path between its endpoints (courtesy of the precomputed Floyd-Warshall).

Loop swapping

Assume some car's path includes a move from vertex i to vertex j, then a cycle C, then a move from vertex j to vertex i; and that the edge from i to j is otherwise visited. If we can find any other car that passes through an intersection of C, we can have that other car take care of the edges of C on the way, so that the first car doesn't need to do C and doesn't need to do the steps from i to j and back (so we save twice the time of the ij edge).

Of course, this can lead to a situation where some cars have a longer path than others, but this will be dealt with later.

Trimming

If the last edge of a car's path is already visited by someone else, just remove it.

Brute-force

For every subpath of every car's path, up to certain length, if the path can be replaced by any shorter path that leaves no street unvisited, then do so. This differs from the pruning optimization in that the subpaths are allowed to contain edges that are not otherwise visited; of course, the replacement subpath will also need to traverse these edges (possibly in a different order).

Balancing

The time limit of 15 hours has been ignored until now, but it has to be obeyed by all cars, so we shouldn't return a solution where one car does all the work and the others do not do much. Further, in case we do manage to traverse all edges (which we did!), the quality of a solution is judged according to its remaining time, which is Tmaxc1,,8tc where T is the time limit and tc the time of car c.

Hence, we would ideally want all the cars to follow routes that take the same time. This implies that the busiest cars should try to swap longer parts of their paths with shorter parts of less busy cars.

In practice, look for any two vertices x and y and two cars such that both cars will pass through x before passing through y. Now, in general the paths followed by the two cars between x and y will take a different amount of time; if swapping the paths will decrease the time difference between the two cars' routes, then swap them.

This is a simple greedy scheme that can be iterated as long as we want. In fact, as the paths of cars will tend to intersect very often, it works very well and quickly converges to a state where all cars take the same time to do their routes, no matter the initial routes.

Results

The clever strategy described here managed to traverse all streets on Paris in the alloted time (in fact, in 9 minutes less than the alloted time). It's a pity that this didn't happen during the contest proper, but still, it is nice. :)

Related work

We developed all of this on our own, and we didn't look seriously into the related work of people who already tried to solve similar problems. It turns out that our problem is known as the postman problem, because of course it is also useful to optimize mail delivery.

More specifically, our version is known as the "k-mixed postman problem", because we have multiple cars (here, k=8), and because the graph is mixed: some of the edges are directed while others are undirected. Surprisingly, the directed postman problem and the undirected postman problem (for k=1) can be optimally solved in polynomial time4, whereas the mixed postman problem is NP-hard5.

It would be nice to compare the results of our method to state-of-the-art approaches. We did a preliminary comparison to existing datasets and we have no reason to believe that our algorithms are competitive. But, still, we had fun. :)

Credits

The ideas in this post are not mine: essentially everything was done by other ENS students. All steps of the clever solution except for the post-processing were conceived and implemented by gnomnain (code). The optimizations were implemented by elarnon (code), and Mc and mehdi (code). (All code is purely write-only and there is no documentation of any kind, so don't try to use it.)

This post was re-written from an earlier version authored by Mc. Thanks to Jill-Jênn, elarnon, Armavica and gprano for proofreading drafts and offering feedback. There is another write-up about the topic, in French, by Jill-Jênn.


  1. The input also included the geographic coordinates of the intersections, but we won't be using this. That said, to anticipate on the sequel, some teams had nice results with the simple greedy heuristic of taking the leftmost non-visited street at each intersection, to make spiral-like itineraries. 

  2. Technically, this is the criterion for the existence of a cycle, not a path, but as we would have to force the path at the starting point anyway, it is simpler to look for a cycle, which is just a special case of path. 

  3. It is clear that both parts have the same size, because the total of in-degrees is the same as the total of out-degrees: it is exactly the number of edges in the directed graph of the previous phase. 

  4. J. Edmonds, E.L. Johnson. Matching Euler tours and the Chinese postman, Mathematical Programming 5 (1973) 88–124. 

  5. C.H. Papadimitriou. On the complexity of edge traversing, Journal of the ACM 23 (1976) 544–554. 

Compiling and installing a recent version of Mesa on Debian

This post is about how (and why) I recently produced my own Debian packages for a recent upstream release of Mesa to work around a bug on my own machine. Though the specifics are not specifically interesting, you may extrapolate my story to a more general guide of what to do when you need to run an upstream software release that's too recent to be packaged by Debian, and too complex to be easy to install by hand with make install.

I recently tried to figure out the source of weird crashes on my machine, which runs Debian testing. I understood from /var/log/syslog that I was probably affected by Bug #771045. From the linked discussions it seemed that a patch in Mesa had a good chance of fixing the problem. The patch was part of version 10.3.4 of Mesa, but Debian testing only had 10.3.2-1, judging from the output of:

dpkg-query -l | grep -i mesa

So, if I wanted the patch, I needed to install manually a newer version of Mesa.

A first thing to check was whether it was not packaged in Debian unstable. I have the unstable repository set up and pinned at a lower priority with the following in /etc/apt/preferences.d/prefs:

Package: *
Pin: release a=testing
Pin-Priority: 650

Package: *
Pin: release a=unstable
Pin-Priority: 600

I use this to manually install a recent version of iceweasel from unstable. Could it be the case that a recent Mesa was packaged in unstable? Alas, no, checking on a random mesa package among those at version 10.3.2-1:

apt-cache policy libegl1-mesa

libegl1-mesa:   
  Installed: 10.3.2-1
  Candidate: 10.3.2-1
  Version table:
 *** 10.3.2-1 0
        650 http://debian.proxad.net/debian/ testing/main amd64 Packages
        600 http://cdn.debian.net/debian/ unstable/main amd64 Packages
        100 /var/lib/dpkg/status

So here is how I manually compiled a more recent version. I was surprised to see how simple it turned out to be, in fact: precious little manual intervention is required.

Of course, a word of warning applies: make sure you will know how to fix your system if it gets broken by the home-made packages that you installed or by anything else you did.

I'm assuming that your machine's architecture is amd64, otherwise what I say later about compiling for i386 may make no sense.

You may need to start by installing some packages such as build-essential if you don't have them. Then, issue:

sudo apt-get build-dep libegl1-mesa

This will install1 the packages needed to compile Mesa.

Then, go to a fresh folder and download the sources for Mesa:

mkdir -p ~/mesa
cd ~/mesa
apt-get source libegl1-mesa

The source for the packaged version (10.3.2-1) should be downloaded. You could then go in the source tree and run uscan from the devscripts package:

cd mesa-10.3.2
uscan

This would automatically search for a newer release on the Mesa website and download it. However, it would pick the latest version (10.4.1 as of this writing) for which the packaging wouldn't work as is. So instead, download directly Mesa 10.3.4, the version that we want, from the official website, and symlink it to the conventions used by the Debian tools.

cd ~/mesa
wget ftp://ftp.freedesktop.org/pub/mesa/older-versions/10.x/10.3.4/MesaLib-10.3.4.tar.gz
ln -s MesaLib-10.3.4.tar.gz mesa_10.3.4.orig.tar.gz

Now we can go back in the source tree for the old version, and use uupdate to unpack the new version and migrate the Debian-specific stuff. Then go to the new source tree thus created:

cd mesa-10.3.2
uupdate ../mesa_10.3.4.orig.tar.gz
cd ../mesa-10.3.4

Now we are ready to start the compilation process:

debuild -us -uc

This should compile everything and take 10-30 minutes. If a problem occurs, you should have copious output to try to understand what went wrong. Otherwise, if all goes well, all the deb packages for the new version will have been generated in the parent folder. That was easy!2 :)

That's all nice and well, but then I realized that I needed the i386 version of the packages, because I use Skype, whose "multiarch" package is actually i386. How to cross-compile the packages? In fact, it's quite easy. We will use pbuilder to create a chroot where the compilation can take place, but everything is automatic:

sudo apt-get install pbuilder
sudo pbuilder --create --architecture i386

Now go in the source tree and issue3:

sudo pbuilder --build ../mesa_10.3.4-1.dsc

This should take care of everything. The packages will be produced in /var/cache/pbuilder/result.

Now you just have to figure out which packages you need to install (using dpkg-query -l) and install them with sudo dpkg -i, then correct any dependency problems with sudo apt-get -f install. Alternatively, you can just issue sudo debi -u from the source tree, which should do those steps automatically. However, I haven't tested it. Of course to install the i386 packages in this way you will need to move them, or let debi know where to find them.

Note that to avoid problems it looks like it is safer to install all packages at once. If you run into file conflicts between the i386 and amd64 packages, a solution may be to apt-get remove all the old versions and then install the new ones. Of course, this will probably remove a bunch of useful packages on your system that depend on Mesa; if you do this should make sure that you have kept a trace of the packages that you will want to reinstall afterwards, and also that you are confident using a tty or ssh if your graphical environment breaks.

And, well, that's it. If you have installed the packages successfully, you can just reboot. Then you can check in dpkg-query that you are indeed using the newer version of the packages, and double-check using:

glxinfo | grep -i version

Of course, if something went bad, maybe your system will not boot (or at least not boot graphically) and then it will be up to you to investigate. Hopefully removing the new packages and installing back the Debian ones should make things work again. :)

One last thing: hopefully, as the version numbers of your home-made Mesa packages correspond to the actual version number of the release, apt-get upgrade should move you to the more recent versions as soon as they start getting packaged by Debian.


  1. Note that the packages installed by apt-get build-dep will be marked as "manually installed" and clutter the output of apt-mark showmanual. If you care, you may want to save it before proceeding. 

  2. If you're wondering how this automated process could take care of things like updating the changelog, have a look at debian/changelog in the source folder. A dummy entry will have been generated. 

  3. If you don't have the file ../mesa_10.3.4-1.dsc, it should have been produced when compiling for amd64 before. 

A local copy of Wikipedia with Kiwix

— updated

I am a fond user of Wikipedia, so I was interested to find out how to use it when I have no Internet connection. On my mobile phone, I already use the free and open-source Aard Dictionary, but for space reasons I only have the French Wikipedia, without images, and with some formatting glitches. On my computers, where I have more space and where it is easier to set things up, I wanted to have more, so that I can have a usable Wikipedia when I'm on the train, in a plane, or in a place with a crappy Internet connection.

This blog post explains how to set up a local Wikipedia mirror using Kiwix. Start by downloading Kiwix and unpacking it. Kiwix ships with a special GUI to browse Wikipedia offline, but I prefer to use my usual Web browser. Fortunately, Kiwix also includes the kiwix-serve program that can serve dumps as a regular Web server.

Next, download the dumps for the projects that you want. Ideally I would be interested in generating my own dumps, but I haven't looked into this yet. I used Wikipedia and Wiktionary French and English, totalling to about 60 GB.

Next, to be able to perform full-text search, you must index the dumps. Maybe the pre-indexed dumps can be used, I haven't tried. I indexed them manually instead. For each file a.zim, I did kiwix-index -v a.zim a.zim.idx, where kiwix-index is provided by Kiwix. The process takes a lot of time (10-30 hours or so) but does not require any interaction. The indexes take another 30 GB.

To serve all dumps with kiwix-serve, you need to build a library. First, move the .zim and the .zim.idx files (or symlink them) to have shorter names; this will make the URLs shorter afterwards. I use wen.zim, wfr.zim, wtfr.zim and wten.zim. Now run, for each file a.zim in the working directory: kiwix-manage `pwd`/wiki.xml add `pwd`/a.zim --indexPath=`pwd`/a.zim.idx, where `pwd`/wiki.xml is the name of the library file that will be created. It is safer to use absolute paths throughout.

This should have created the library file wiki.xml. Now, to start the server, choose a port number (say 4242) and run kiwix-serve --port=4242 --library /where/you/put/wiki.xml. Test it by browsing to http://localhost:4242 and checking that it works. If it does, you probably want to arrange for this command to be run at startup.

Please note that Kiwix will also be available from other machines, not just localhost. I couldn't find a way to change this behavior. For now, I use iptables to filter incoming connections from other hosts:

sudo iptables -A INPUT -p tcp --dport 4242 -s 127.0.0.0/8 -j ACCEPT
sudo iptables -A INPUT -p tcp --dport 4242 -j REJECT
sudo ip6tables -A INPUT -p tcp --dport 4242 -s ::1 -j ACCEPT
sudo ip6tables -A INPUT -p tcp --dport 4242 -j REJECT
sudo iptables-save | sudo tee /etc/iptables/rules.v4
sudo ip6tables-save | sudo tee /etc/iptables/rules.v6

The last step, if you use Firefox, is to use Smart Keywords to be able to reach your local dump efficiently. To do this, for every of the dumps that you have on localhost:4242, right-click on the search text field and add a keyword for it. As I use Firefox Sync, those bookmarks are synchronized across my different machines.

These smart keywords work for full-text search in the dumps. If you want to have bookmarks that directly reach articles and will never perform full-text search (because it is faster), you can edit the bookmarks in Firefox to set "Location" to, e.g., "http://localhost:23552/wen/A/%s.html". However, this technique does not work with all dumps (it depends on the URL structure), and in this case you must remember that the first letter of the search term must be uppercase.

In terms of formatting, the Kiwix dumps are fairly OK. There are occasional glitches (e.g., with <span>) but not many of them, and certainly a lot less than with Aard Dictionary. Equations are supported, and images are there if you pick the right dumps. Most templates, formatting, etc., is fine. The HTML interface added by kiwix-serve is not perfect but it's mostly unobtrusive.

Managing passwords with pass

I have recently migrated my passwords to pass and so far I've been really happy about it. The previous system was to have them all in a huge text file, which wasn't especially convenient or secure1, and wasn't shared between my various machines. Here is some info about pass.

pass has been packaged for Debian since Jessie, so installing it is as simple as sudo apt-get install pass. However, it's just a shell script just over 600 lines, so really easy to review, and install manually if you need to.

The way pass manages passwords is dead simple: a hierarchy of gpg-encrypted files. The assumption is that each file corresponds to a website, or machine, or other authentication realm, and contains the password. The use of gpg provides a layer of security, so that your gpg key and passphrase serve as a master password. Of course, it is nice to have a properly configured gpg-agent(1) to avoid having to enter the passphrase multiple times.

The basic commands of pass are pass init KEYID which sets up the store for gpg key KEYID (by default in ~/.password-store), pass FILE which decrypts and shows FILE, and pass edit FILE, which decrypts FILE to a secure temporary location in /dev/shm, edits it, and encrypts it back. You can also use pass ls (which shows a nice output using tree), pass find to search for files using find, pass grep to search in the decrypted password files using grep, and pass rm, pass mv, pass cp. Of course, you can also mess around in the password store by hand.

As pass has this very nice CLI interface, migrating my passwords from my custom system was very easy, although it seems like the Debian package also installs a bunch of script to migrate from other password managers.

Beyond the generic commands I have presented, pass obviously offers commands tailored for password management. You have pass insert FILE which creates FILE with the password you provide (and turns off echo and makes you enter it twice for confirmation). You have pass -c FILE which copies the password in FILE to the clipboard, so you can input the password where you need it, and automatically clears it after 45 seconds (which is a reasonable thing to do). You have pass generate FILE LENGTH which generates a password of LENGTH chars in FILE and displays it (or copies it to the clipboard with -c); what is very nice is that pass itself does not include password generation logic, but entrusts pwgen(1) with the task.

Icing on the cake: pass is designed to be used with git, and provides pass git to call git commands. If you use git, all the pass commands will automatically git commit what's needed. This makes it very easy to share passwords between different machines. Of course, as the files are encrypted, git cannot be expected to solve conflicts within files, but it can nicely merge changes across various files. You can also use this setup to share passwords between different people, as pass supports encrypting for multiple keys.

For once, I find it hard to find something to dislike about pass. Eventually I may want to tweak password generation so that it generates passwords the way I'm used to, but this would be easy to do. I'm also missing support for usernames, as I use different usernames on different websites, but pass allows you to store anything in the password file (and only the first line is taken into account for pass -c and others), so I can just add the username as the second line if needed, it's just that I will have to retrieve it by hand, or script something that does what I want. Other than that, I'm very happy to have a convenient, lightweight, and secure way to manage my passwords and share them across machines using git.


  1. My home partition is encrypted, but there was no security whatsoever if the machine was ever compromised. 

Managing installed packages in Debian

This post is about how I manage the apt packages that are installed on my Debian systems.

Left to myself, I tend to apt-get install various things every now and then: to try them out, or because I temporarily need them. In most cases it turns out I never really use them, but of course I will never remember to clean them up. Eventually my / partition fills up and I have to waste time tracking down which packages are useless. (Of course, accumulating useless packages is also a bad idea in terms of security, performance, etc.)

If I want to back up my current selection of packages, I could dump the output of apt-mark showmanual somewhere, but that's not really satisfactory; the list of packages that I use should be stored as a first-class citizen in my config, not just obtained from the current system state.

If I need to set up a new machine, I can install all the packages that were installed on the previous one, but this will end up downloading gigs of packages including lots that I don't really care about. Of course, as soon as I start using several different machines, it is necessary to install on all machines the new packages that I need, so the installed packages must be kept in sync somehow. Otherwise, I have to waste time watching apt-get installing stuff every time I run a command on a host and realize that a package is missing. (However, remember that most installed packages are not really important and probably don't need to be synchronized at all.) All of this is made worse by the fact that not all machines are equal (I don't want to install graphical stuff on servers, laptop tools should only go on laptops, etc.).

My current answer to this is to maintain in my public configuration repository a bunch of program lists for various kinds of uses. This list is synchronized between machines using git, as is the rest of my configuration. It contains only the packages that I really have some use for, rather than all the random crap I have ever installed and don't remember the point of1. (This being said, I don't care about it being minimalistic, and I'm OK with including large tools that I use rarely, as long as there are reasonable odds I will use them again.)

I have a private file that lists, for my various hosts, which program selections it needs from that list, for instance:

my-laptop-1 minimal laptop server util
my-server minimal server

New packages that I install don't go to this list by default, but my crontab on each host mails me every month the diff between the currently manually selected packages on that host and the ones which should be installed according to the host's list. Here is the script: check-packages.sh.

From this monthly report, I can then update the list by removing the new installed packages that I don't need after all, putting the ones that I do need in the right selection, and installing the ones which should be installed. (And, of course, postponing the ones I haven't yet made up my mind about.)

The system is of course fairly rudimentary: there is no dependency between package selections, the dependencies of every piece of software that I compile myself have to be listed in a separate file, the sync process is manual, etc. Yet I am happy that I have, at last, one central list of the packages that I consider useful to have on my systems; as a bonus I can even share it with the world.


  1. Of course, the hard part was to clean up the currently installed packages so as to come up with this list in the first place.