a3nm's blog

A local copy of Wikipedia with Kiwix

— updated

I am a fond user of Wikipedia, so I was interested to find out how to use it when I have no Internet connection. On my mobile phone, I already use the free and open-source Aard Dictionary, but for space reasons I only have the French Wikipedia, without images, and with some formatting glitches. On my computers, where I have more space and where it is easier to set things up, I wanted to have more, so that I can have a usable Wikipedia when I'm on the train, in a plane, or in a place with a crappy Internet connection.

This blog post explains how to set up a local Wikipedia mirror using Kiwix. Start by downloading Kiwix and unpacking it. Kiwix ships with a special GUI to browse Wikipedia offline, but I prefer to use my usual Web browser. Fortunately, Kiwix also includes the kiwix-serve program that can serve dumps as a regular Web server.

Next, download the dumps for the projects that you want. Ideally I would be interested in generating my own dumps, but I haven't looked into this yet. I used Wikipedia and Wiktionary French and English, totalling to about 60 GB.

Next, to be able to perform full-text search, you must index the dumps. Maybe the pre-indexed dumps can be used, I haven't tried. I indexed them manually instead. For each file a.zim, I did kiwix-index -v a.zim a.zim.idx, where kiwix-index is provided by Kiwix. The process takes a lot of time (10-30 hours or so) but does not require any interaction. The indexes take another 30 GB.

To serve all dumps with kiwix-serve, you need to build a library. First, move the .zim and the .zim.idx files (or symlink them) to have shorter names; this will make the URLs shorter afterwards. I use wen.zim, wfr.zim, wtfr.zim and wten.zim. Now run, for each file a.zim in the working directory: kiwix-manage `pwd`/wiki.xml add `pwd`/a.zim --indexPath=`pwd`/a.zim.idx, where `pwd`/wiki.xml is the name of the library file that will be created. It is safer to use absolute paths throughout.

This should have created the library file wiki.xml. Now, to start the server, choose a port number (say 4242) and run kiwix-serve --port=4242 --library /where/you/put/wiki.xml. Test it by browsing to http://localhost:4242 and checking that it works. If it does, you probably want to arrange for this command to be run at startup.

Please note that Kiwix will also be available from other machines, not just localhost. I couldn't find a way to change this behavior. For now, I use iptables to filter incoming connections from other hosts:

sudo iptables -A INPUT -p tcp --dport 4242 -s -j ACCEPT
sudo iptables -A INPUT -p tcp --dport 4242 -j REJECT
sudo ip6tables -A INPUT -p tcp --dport 4242 -s ::1 -j ACCEPT
sudo ip6tables -A INPUT -p tcp --dport 4242 -j REJECT
sudo iptables-save | sudo tee /etc/iptables/rules.v4
sudo ip6tables-save | sudo tee /etc/iptables/rules.v6

The last step, if you use Firefox, is to use Smart Keywords to be able to reach your local dump efficiently. To do this, for every of the dumps that you have on localhost:4242, right-click on the search text field and add a keyword for it. As I use Firefox Sync, those bookmarks are synchronized across my different machines.

These smart keywords work for full-text search in the dumps. If you want to have bookmarks that directly reach articles and will never perform full-text search (because it is faster), you can edit the bookmarks in Firefox to set "Location" to, e.g., "http://localhost:23552/wen/A/%s.html". However, this technique does not work with all dumps (it depends on the URL structure), and in this case you must remember that the first letter of the search term must be uppercase.

In terms of formatting, the Kiwix dumps are fairly OK. There are occasional glitches (e.g., with <span>) but not many of them, and certainly a lot less than with Aard Dictionary. Equations are supported, and images are there if you pick the right dumps. Most templates, formatting, etc., is fine. The HTML interface added by kiwix-serve is not perfect but it's mostly unobtrusive.

comments welcome at a3nm<REMOVETHIS>@a3nm.net