a3nm's blog

What's wrong with academia?

I have just finished writing up a long document that tries to give a comprehensive list of problems affecting academic research.

Writing this document is something that I had been meaning to do for a very long time, almost since I got started in academia in 2012 with my master's internship. Many academic practices did not make any sense to me already at the time, e.g., hiding research articles behind paywalls rather than simply hosting them online. I tried to ignore these concerns for a while, and did my PhD without questioning too much the order of things: for some practices I eventually saw a justification, but for many others I did not, and they made me more and more uneasy. So I thought that I should eventually come back to these problems, to re-examine my beliefs about the way academia works. Hence this long list of all the problems that annoy me, which I will try to keep up-to-date as time passes and my experience evolves.

Of course, complaining is always easy, so I have also tried to give some thought in the document about ways to fix these problems. The document does not mention my own initiatives in this direction (e.g., refusing to review for closed-access venues), which I will eventually write up separately.

So I encourage you to have a look at the document, What's wrong with academia?, and share with me any feedback that you may have!

Finding the members of the theoretical database community with DBLP

— updated

The DBLP service is a great bibliographical tool for computer science research. In this post, I explain how to use it to prepare the list of members of a research community. I will be using the theoretical database community, whose two conferences are PODS and ICDT.

The list of publications for one edition of a conference can be found on DBLP as XML, e.g., for ICDT'18. It is then easy to use xmlstarlet to find the list of people who have published at that conference:

curl -s 'https://dblp.uni-trier.de/db/conf/icdt/icdt2018.xml' |
  xmlstarlet sel -T -t -m "//inproceedings/author" -m . -c '.' -n |
  sort | uniq

For each person in the list, we can obtain detailed XML information, including its homepage, ORCID, etc., using the DBLP API again. (This also gives us a canonical form for the name, which may appear in different ways in various inproceedings entries.) This is just a bit more complicated than it should, because of a limitation of the DBLP search API: when queried with a name, sometimes the API inexplicably favors non-exact matches even in some cases where an exact match exist. So we must filter the matches ourselves to use an exact match if one exists, and a non-exact match otherwise. Of course, independently from this problem, you may be getting the wrong author, in particular because of homonyms, so these results should be taken with a grain of salt.

NAME="Antoine Amarilli"
ENAME=$(echo "$NAME" | sed 's/ /%20/g')
curl -s "https://dblp.org/search/author/api?h=1000&q=$ENAME" > matches.xml
URL=$(xmlstarlet sel -T -t -m "/result/hits/hit/info[author='$NAME']" \
    -c url -n < matches.xml | head -1)
if [[ -z "$URL" ]]
  URL=$(xmlstarlet sel -T -t -m /result/hits/hit/info/url \
      -c . -n < matches.xml | head -1)
curl -L "${URL}.xml"

From there, we can use this to prepare a list of community members. Of course, any criterion for inclusion is completely arbitrary... My criterion to get a list of "active community members" is to select the who have published on three different years, with one publication in 2015 or later. Which gives:

Click to see the list...

Another inclusion criterion for a "historical" list would be the list of people who are not necessarily still active but have published over a long period, say, 10 different (not necessarily contiguous) years. Here is the resulting list, sorted by the year where the person has last published in ICDT or PODS.

Click to see the list...

Another kind of statistics that can be computed in this way is the "neighboring" conferences, i.e., the other conferences where members of the community have published. Here is the list of the top neighboring conferences of PODS and ICDT, sorted by the number of active community members who have published at least once there since 2015 (with hyperlinks and descriptions added manually):

  • 38: SIGMOD Conference, the practical database conference held jointly with PODS
  • 34: AMW, the database theory workshop held in honor of Alberto O. Mendelzon (whom you may remember from the previous list)
  • 29: IJCAI, an AI conference
  • 28: ICALP, a theoretical CS conference on logics and automata
  • 26: LICS, another theoretical CS conference about logics
  • 20: SODA, a theoretical CS conference on algorithms
  • 19: AAAI, another AI conference
  • 19: EDBT, the practical database conference held jointly with ICDT
  • 18: WWW, a conference about the World Wide Web
  • 15: Description Logics, the workshop on description logics
  • 15: ICDE, a practical data management conference
  • 13: CIKM, an information and knowledge management conference
  • 12: SEBD, the Italian conference on databases
  • 12: STOC, a general-purpose theoretical computer science conference
  • 11: KR, a conference on knowledge representation and reasoning
  • 11: FOCS, another general-purpose theoretical computer science conference

It would be interesting to visualize this data differently, e.g., visualize a world map with the community members, but sadly the affiliation information in DBLP is too sparse for this to work.

Indexing encrypted email with notmuch

— updated

Since version 0.26, the mail indexing tool that I use, notmuch, now makes it easy to index encrypted mail.

The original behavior was that notmuch did not index the contents of encrypted emails, as they were encrypted and it couldn't access them. This meant that you couldn't search inside encrypted emails (except for headers, e.g., the subject, recipient, etc.).

Now, notmuch is able to use gpg (and gpg-agent) to read and index the cleartext of encrypted emails. Of course, this means that notmuch's index can now be used to reconstruct encrypted emails; in particular, as notmuch stores the session keys for messages in its index, this means that any attacker who can access the index can decrypt the messages1. For my use case, I think that this security risk is acceptable: I essentially see GPG as a tool to ensure that messages are not altered between the sender and recipient, my notmuch index is stored on an encrypted partition anyway, and my GPG passphrase is usually cached by gpg-agent so an attacker who has control over my machine would be able to access the plaintext of encrypted messages quite easily.

So, if you also use notmuch, if you also have your passphrase cached by gpg-agent at least part of the time, and if you want notmuch to index the cleartext of your encrypted emails, here is what you should do. First, you should make sure that you have notmuch 0.26 or a more recent version. Second, you should tell notmuch that you want it to index the cleartext of encrypted email:

notmuch config set index.decrypt true

Beware, this configuration flag lives only in the database, not in the config file; hence, e.g., it will not be synchronized across multiple machines if you synchronize your config files from one machine to another.

Then you should reindex all encrypted email that notmuch knows about but hasn't indexed yet (this took around 15 mins in my case):

notmuch reindex tag:encrypted and not property:index.decryption=success

Of course, you will be prompted for your GPG passphrase if it isn't cached (and also possibly for the passphrase of other keys that you have used in the past). Once this has completed, you should check the encrypted messages that notmuch was still unable to index:

notmuch search tag:encrypted and not property:index.decryption=success

In my case, there were only a few that couldn't be indexed -- and usually it was because they hadn't been encrypted for my key because the sender had made some mistake.

From now on, notmuch new should automatically index the cleartext of incoming messages when your GPG passphrase is cached by gpg-agent. The last step is the following: if your passphrase is not cached all the time, then you should arrange for the notmuch reindex command above to be executed regularly, so that encrypted messages will eventually be indexed.

The setup described in this post lead to unpleasant side effects where GPG invocations would hang, probably because notmuch tried to ask for a passphrase. To avoid this, I had to ensure that the notmuch reindex command, when run regularly, never tried to ask for a passphrase if it wasn't currently stored by the agent. I did this by setting PINENTRY_USER_DATA=none and modifying my custom pinentry script to handle properly this value. (Of course, this means that encrypted messages will not be correctly indexed when the GPG agent hasn't cached the passphrase, but the hope is that they will eventually be indexed.)

Another problem that I had is that notmuch reindex would waste CPU time by trying to reindex each time the emails where it had previously failed. To avoid this, I reviewed manually the mails that couldn't be indexed, tagged them with a special tag, and then excluded mails with that tag from the notmuch reindex command. I also added a crontab entry to review periodically the emails where indexing failed, so I will tag them appropriately if the failure is expected. A more elaborate idea would be to exclude from the notmuch reindex command the emails that are too ancient; or maybe script things so that when all GPG keys are available in the agent but notmuch cannot index a message then it should tag it so as not to try again.

  1. In fact, the historical workaround to index encrypted email with notmuch was simply to arrange for it to be decrypted when it arrives. I would also be OK with the security implications of this, but I have never set it up, because it's complicated to do right, especially because my GPG passphrase isn't always available in gpg-agent's cache. Besides, I prefer to keep an original copy of the email that I receive, so I think it's cleaner to keep the encrypted messages as-is and have notmuch store in its index its additional information that it needs. 

Migrating from cgit to stagit

I serve my git repositories over HTTP for people who want to browse them without having to clone them. I used to do this with cgit, which is a server-side dynamic solution written in C. It worked nicely, but lately some bots have been busy crawling these git repositories, and I regularly ran into trouble where the cgit.cgi processes ended up in a busy loop, eating 100% of CPU for unclear reasons. More generally, I had always been anxious about using a dynamic solution to serve these repositories: all the rest of my website is static, which I think is more elegant and more reassuring in terms of security.

The natural approach would be to turn cgit into a static solution by precompiling all pages whenever a git repository is updated. However, this is not reasonable: cgit allows you, e.g., to see the status of every file at every commit, or to diff any pair of commits, which would be too expensive to precompute. These features are not very useful, so I was considering to do it but tweak cgit's output to suppress the useless parts; but this would have been tedious.

Fortunately, there is a better way: the stagit tool is a minimalistic variant of cgit, also written in C, which is designed to be static. So I have just removed cgit from my server and installed stagit instead. Obviously it's too early for me to say whether stagit is a perfect solution, but I'm happy with what I have seen so far. Here are some quick and messy notes about how I did it and what surprised me, in case you are considering doing the same.

Stagit is not packaged for Debian yet but it's easy to compile and install (and the source code is rather short if you want to hack it). You will need libgit2-dev, which is packaged by Debian. I edited a bit the source to suit my needs; cf my local fork: I changed a bit the HTML, fixed the CSS to work better on mobile displays, renamed some files, etc. It's a bit ugly to have HTML boilerplate hardcoded in the C code, but it works, and if it starts misbehaving it will be easier for me to investigate.

Stagit provides one command stagit to generate the HTML for a repository, and one command stagit-index to generate an index of the various repositories. The README is rather clear (you can also look at the manpages in the repo). Of course, you need to re-run stagit whenever a git repository is updated, so you'll need a post-receive hook like the one they provide, which I adapted to my needs. One concern is that running stagit is synchronous, i.e., when doing a git push, you must wait for stagit to complete. However, it seems to run instantly on my repositories, so that's no big deal.

To get a nice index of the repositories, you need to change your git repositories to edit description with a description and url with the clone URL. There is also support for a owner field, but I removed this from the generated HTML as I'm the owner of all the repos I host. As the setup of a new git repository had become a bit tedious, I wrote a script for that, too.

About the url: you should know that stagit does not take care of allowing people to clone your repository. One solution is to run a git server for that (which the official stagit repository seems to do), but I didn't want it because it's not static. Instead, I intend people to clone my repositories using the dumb HTTP protocol: it only requires you to serve your git repositories with your Web server, and to run git update-server-info, as can be done easily using the post-update.sample hook. So for each repository you will have the stagit version and the bare repository. However, this will mean that the git clone URL will be different from the stagit URL, which is a bit jarring. So I cheated using some lighttpd mod_rewrite rules to transparently do the redirection. (Note that git clone will still point out the existence of this redirect when doing the cloning, so it's not completely transparent.) Here are the rules, following this page thanks to immae for suggesting an improvement:

  "^/git/([^/.]*)/HEAD$" => "/git/$1.git/HEAD",
  "^/git/([^/.]*)/info/(.*)$" => "/git/$1.git/info/$2",
  "^/git/([^/.]*)/objects/(.*)$" => "/git/$1.git/objects/$2",
  "^/git/([^/.]*)/git-upload-pack$" => "/git/$1.git/git-upload-pack",
  "^/git/([^/.]*)/git-receive-pack$" => "/git/$1.git/git-receive-pack",

One last thing about the migration to stagit is that I didn't want to break all the cgit URLs that used to work before. Of course, not all cgit pages have a stagit counterpart, but most of the important ones do, however their names are a bit different. Again, not very robust, but here goes:

  "^/git/([^/.]*)/commit/\?id=(.*)$" => "/git/$1/commit/$2.html",
  "^/git/([^/.]*)/about(/.*)?$" => "/git/$1/file/README.html",
  "^/git/([^/.]*)/log(/.*)?$" => "/git/$1/index.html",
  "^/git/([^/.]*)/refs(/.*)?$" => "/git/$1/refs.html",
  "^/git/([^/.]*)/tree/?(\?.*)?$" => "/git/$1/files.html",
  "^/git/([^/.]*)/tree/([^?]*)(\?.*)?$" => "/git/$1/file/$2.html",
  "^/git/([^/.]*)/plain/([^?]*)(\?.*)?$" => "/git/$1/file/$2.html",
  "^/git/([^.?]*)\?.*$" => "/git/$1",
  "^/git/([^/.]*)/([^?]*)\?.*$" => "/git/$1",

So there you have it: a completely static web version of my git repositories that can also be used to clone them with the dumb HTTP transport, a hook to update the web version, a script to create a new repository, and no more problems or possible security vulnerabilities with cgit!

An update on CalDAV and CardDAV with Radicale

This is a quick update to a previous post where I explained how to self-host your calendar and contacts using the Radicale CalDAV and CardDAV server, and how to access them on Android devices with DAVdroid.

Three years later, I am still using this setup. I only use my Android phone to access the calendar and contacts, so the Radicale server is essentially a way to back the contacts and calendars up; although I have also tried accessing them, e.g., with Evolution. Over these three years, DAVdroid has evolved and gotten a bit more user-friendly and stable, though I have had a few problems (e.g., duplicated calendar events). Radicale has evolved too, I'm currently at version 1.1.1, which is the one provided by Debian even though it is really outdated. (Also, as of this writing, Radicale is not available in the Debian testing repos, see here, but it can be installed from Debian stable.)

The main change that I did is on the server. In the old guide, I explained how to set up Radicale so that it listens on port 5232, manages authentication and encryption, and DAVdroid connects to it directly. I have changed this setup so that DAVdroid now connects to Apache2, which manages authentication and encryption, and talks to Radicale using WSGI. This has a number of advantages:

  • You can encrypt the connection with SSL managed by Apache, e.g., using Let's Encrypt, without self-signed certificates or other ad-hoc setup; and you don't need to trust Radicale to do the encryption correctly.
  • The server listens on the standard HTTPS port (443) rather than the custom Radicale port (5232) so the connections aren't blocked on unfriendly networks.
  • You can use vhosts, e.g., to host it on a subdomain.
  • Authentication is managed by Apache, not Radicale. This is somewhat reassuring: even if Radicale has a massive security flaw, only users that correctly authenticated with Apache can talk to it at all.
  • The most important point: with the old setup, Radicale would inexplicably hang every now and then, presumably when the phone disconnected messily from it. (I think it is this bug). With the new setup, this does not happen. (Maybe the bug has been fixed in more recent Radicale versions anyway, I don't know.)

Of course, the downside of this new setup is that you need Apache just to route requests to Radicale. As I needed Apache for other purposes, though, I didn't mind.

The setup

I haven't documented this setup while I did it, so here a hopefully complete description of what I currently have.

You need to install Apache, and enable the SSL and WSGI and auth_basic modules (run as root a2enmod ssl and a2enmod wsgi and a2enmod auth_basic and service apache2 restart). Of course, basic HTTP authentication may sound insecure, but we will only be doing it over HTTPS.

You should set up Let's Encrypt certificates (e.g., with certbot), something I mentioned in this previous guide.

Of course you need to install radicale. We are going to put all radicale-related stuff in /srv/radicale, but of course this can be changed. The files in this directory should be readable and writable by the Web server.

You then need to create a file in /etc/apache2/sites-enabled whose contents look as follows:

<IfModule mod_ssl.c>
<VirtualHost *:443>
        ServerName dav.example.com

        ServerAdmin youremail@example.com
        DocumentRoot /var/www/html/

        ErrorLog ${APACHE_LOG_DIR}/error.log
        CustomLog ${APACHE_LOG_DIR}/access.log combined

        WSGIDaemonProcess radicale user=www-data group=www-data threads=1
        WSGIScriptAlias / /srv/radicale/radicale.wsgi

        <Directory /srv/radicale/>
            WSGIProcessGroup radicale
            WSGIApplicationGroup %{GLOBAL}
            AllowOverride None
            AuthType basic
            AuthName "dav.example.com"
            AuthUserFile /srv/radicale/passwd
            Require user youruser

SSLCertificateFile /etc/letsencrypt/live/example.com/fullchain.pem
SSLCertificateKeyFile /etc/letsencrypt/live/example.com/privkey.pem
Include /etc/letsencrypt/options-ssl-apache.conf

The file /srv/radicale/passwd contains the username and passwords of who can access the server, managed as usual with the htpasswd utility. The file /srv/radicale/radicale.wsgi contains the invocation to run Radicale and points to the config file, as follows:

import radicale
configuration = radicale.config.read(["/srv/radicale/config"])
application = radicale.Application()

To create the config file, you can, e.g., write the following in /srv/radicale/config

request = utf-8
stock = utf-8

type = owner_only

type = filesystem
filesystem_folder = /srv/radicale/collections

config = /srv/radicale/logging

In this file, /srv/radicale/collections contains the Radicale collections as in the old guide. The file /srv/radicale/logging contains the radicale logging configuration. Here is mine:

# inspired by https://github.com/Kozea/Radicale/issues/266#issuecomment-121170414
keys = root

keys = file

keys = full

level = DEBUG
handlers = file

args = ('/srv/radicale/logs/radicale.log','a',32768,3)
level = INFO
class = handlers.RotatingFileHandler
formatter = full

format = %(asctime)s - %(levelname)s: %(message)s

In the above, /srv/radicale/logs is where you want radicale to write its log files. You probably need to specify it manually, because radicale is run by the Web server, which may not have the right to log, e.g., in /var/log/radicale as the default configuration would do.