A local copy of Wikipedia with Kiwix
I am a fond user of Wikipedia, so I was interested to find out how to use it when I have no Internet connection. On my mobile phone, I already use the free and open-source Aard Dictionary, but for space reasons I only have the French Wikipedia, without images, and with some formatting glitches. On my computers, where I have more space and where it is easier to set things up, I wanted to have more, so that I can have a usable Wikipedia when I'm on the train, in a plane, or in a place with a crappy Internet connection.
This blog post explains how to set up a local Wikipedia mirror using
Kiwix. Start by
downloading Kiwix and
unpacking it. Kiwix ships with a special GUI to browse Wikipedia offline, but I
prefer to use my usual Web browser. Fortunately, Kiwix also includes the
kiwix-serve
program that can serve dumps as a regular Web server.
Next, download the dumps for the projects that you want. Ideally I would be interested in generating my own dumps, but I haven't looked into this yet. I used Wikipedia and Wiktionary French and English, totalling to about 60 GB.
Next, to be able to perform full-text search, you must index the dumps. Maybe
the pre-indexed dumps can be used, I haven't tried. I indexed them manually
instead. For each file a.zim
, I did kiwix-index -v a.zim a.zim.idx
, where
kiwix-index
is provided by Kiwix. The process takes a lot of time (10-30 hours
or so) but does not require any interaction. The indexes take another 30 GB.
To serve all dumps with kiwix-serve
, you need to build a library. First, move
the .zim
and the .zim.idx
files (or symlink them) to have shorter names;
this will make the URLs shorter afterwards. I use wen.zim
, wfr.zim
,
wtfr.zim
and wten.zim
. Now run, for each file a.zim
in the working
directory: kiwix-manage `pwd`/wiki.xml add `pwd`/a.zim
--indexPath=`pwd`/a.zim.idx
, where `pwd`/wiki.xml
is
the name of the library file that will be created. It is safer to use absolute
paths throughout.
This should have created the library file wiki.xml
.
Now, to start the server, choose a port number (say 4242
) and run kiwix-serve
--port=4242 --library /where/you/put/wiki.xml
. Test it by browsing to
http://localhost:4242
and checking that it works. If it does, you probably want to
arrange for this command to be run at startup.
Please note that Kiwix will also be available from other machines, not just localhost. I couldn't find a way to change this behavior. For now, I use iptables to filter incoming connections from other hosts:
sudo iptables -A INPUT -p tcp --dport 4242 -s 127.0.0.0/8 -j ACCEPT
sudo iptables -A INPUT -p tcp --dport 4242 -j REJECT
sudo ip6tables -A INPUT -p tcp --dport 4242 -s ::1 -j ACCEPT
sudo ip6tables -A INPUT -p tcp --dport 4242 -j REJECT
sudo iptables-save | sudo tee /etc/iptables/rules.v4
sudo ip6tables-save | sudo tee /etc/iptables/rules.v6
The last step, if you use Firefox, is to use Smart
Keywords to
be able to reach your local dump efficiently. To do this, for every of the dumps
that you have on localhost:4242
, right-click on the search text field and add
a keyword for it. As I use Firefox
Sync, those bookmarks are
synchronized across my different machines.
These smart keywords work for full-text search in the dumps. If you want to have
bookmarks that directly reach articles and will never perform full-text search
(because it is faster), you can edit the bookmarks in Firefox to set "Location"
to, e.g., "http://localhost:23552/wen/A/%s.html
". However, this technique does
not work with all dumps (it depends on the URL structure), and in this case you
must remember that the first letter of the search term must be uppercase.
In terms of formatting, the Kiwix dumps are fairly OK. There are occasional
glitches (e.g., with <span>
) but not many of them, and certainly a lot less
than with Aard Dictionary. Equations are supported, and images are there if you
pick the right dumps. Most templates, formatting, etc., is fine. The HTML
interface added by kiwix-serve
is not perfect but it's mostly unobtrusive.