Experiments about a better locate using grep
I have a lot of files and I'm not fond of sorting them intelligently, so I
usually give them long and descriptive file names and rely on locate
to find
them quickly. I was always a bit annoyed to see that locate
was not
instantaneous even though intuition suggested that it should be, but I was
shocked to
find out
that locate is actually an antique, and that reimplementing a better version of
it seems trivial. So I thought I had to investigate more, to replace locate
with something that worked better.
Current setup
I am currently using locate
, by which I mean mlocate
, not slocate
which is
not packaged for Debian, and not the old locate from
findutils.
My database in /var/lib/mlocate/mlocate.db
takes 300 MB, and indexes about 10M
files. Indexing is run every night, I don't know how much time it takes, mlocate
is maybe smart about which files to examine, but I don't really care about that.
(To be honest, I am a bit concerned about power consumption and hard drive and
SSD wear, but that's rather hard to estimate.) Hence, I will focus on the
performance of query evaluation only, not indexing. To estimate performance for
typical use cases I grepped from my history file to find the latest locate
commands that I ran:
grep ';locate' ~/.history |
cut -d'|' -f1 | cut -d';' -f2 | grep -v '^grep' |
cut -d ' ' -f2- > workload
That's 650 invocations (over a bit over 100k lines of history). I filtered them
to remove those that did not run, and to remove the very rare cases where I used
a locate
-specific feature (there was one -0
and a few cases where I was
searching simultaneously on multiple patterns, otherwise the only flag was
-i
), so that's exactly 646 queries (of which 527 are unique). Let's time all
those commands, running it 3 times to avoid outliers, with no outstanding IO or
CPU activity:
for a in 1 2 3; do
time sed 's/^/locate /' workload | sh > /dev/null;
done
This takes around 1 hour, so about 5.5 seconds per query.1 In terms of hardware, I should point out that the machine that I am using has an SSD, most specifically a 120 GB Samsung SSD 840, which hosts the system (plus some of the data, the rest is on hard drives), and of course I'm always storing indexes on the SSD. The CPU is an Intel Core i5-4570 at 3.20 Ghz with 4 cores, there is 8 GB of RAM.
For completeness I also timed locate.findutils
(so locate from the GNU
findutils) and it is marginally faster, taking about 50 minutes.
The database is much more compact, however, and takes only 75 MB.
I did not bother benchmarking slocate
as Debian does not package it.
Making it faster
Let's try to replace locate
by the simplest possible approach to optimize for
pure speed.
sudo find / > files
This takes some time, but I don't care. The result is about 900 MB. Let's time the workload on it:
for a in 1 2 3; do
time sed 's/^/grep /;s/$/ files/' workload | sh > /dev/null;
done
This takes about 19 minutes, so it's already about 2.5 to 3 times faster than
locate
. (Yes, ideally, I should have checked that it returns exactly the same
results, or quantify the differences, but I didn't bother. I just verified for
some example queries that the result made sense.)
It turns out one easy way to make grep faster is to set LC_ALL=C
. This is
unsafe with respect to multibyte characters, but I don't use those very often in
file names and never in queries, and my understanding is in this case this
option can only add false positives with low probability, which is OK for me.
for a in 1 2 3; do
time sed 's/^/LC_ALL=C grep /;s/$/ files/' workload | sh > /dev/null;
done
This takes about 7 and a half minutes, so faster than the vanilla grep
solution by a factor of 2.5 to 3.
How to make this even faster? It turns out that grep
is CPU-bound in this
case, so with a multicore CPU it is tempting to parallelize it. This is easy to
do: we can clearly split the indexed files in buckets that we index and query
separately, merging the results. I use 8 buckets here.
split files files. -nr/8
This splits files
in files.aa
, ..., files.ah
, in a round-robin
fashion.2
Let's now create a script grep.sh
to grep the files in parallel, using the
parallel
command from GNU parallel.
It is slightly confusing but we must re-quote the arguments to the grep.sh
script in this strange way so that we can
pass them to the command invoked by parallel
:
ARGS=$(printf " %q" "$@")
ls files.* | parallel 'bash -c "LC_ALL=C grep '$ARGS' {}"'
The parallel
command will spawn as many parallel jobs as there are CPUs (i.e.,
cores, so should be 4 in my case). Note that it would
not be safe
to use xargs
or parallel
from
moreutils they could garble output by two
parallel processes if they tried to write at the same moment; GNU parallel, by
contrast, prevents this (with --group
, which is implied by default).
Now let's time this:
for a in 1 2 3; do
time sed 's/^/.\/grep.sh /' workload | sh > /dev/null;
done
This takes around 3 minutes 30, so around 2 times faster than the non-parallel version. This amounts to 1/3 second per request (plus the overhead of displaying the results, which is not accounted for here), which I think is good enough for my needs.
Compressing the index
One respect in which this grep-based approaches is worse than locate
is that
its index is three times larger than that of locate
.
The original post about
locate
conjectured that this explained why locate
's performance is worse than naive
grep. However, as we will see, it is easy to compress the index and still be
much faster than locate. First, let's simply compress the files with gzip
:
split --additional-suffix=.gzip \
--filter="bash -c 'gzip > \$FILE'" files files. -nr/8
The resulting index is 105 MB in size in total, so 10 times smaller than the uncompressed index, and still 3 times smaller than that of locate. We can then decompress these files when querying them, by replacing our search script with:
ARGS=$(printf " %q" "$@")
parallel -i bash -c "gunzip -dc {} | LC_ALL=C grep $ARGS" -- files.*.gz
The timing of this is 14 minutes, which is 4 times slower than the uncompressed
version, but still 4 times faster than locate
.
In fact, we can still make this much more efficient. One easy way is to use the
pigz
command to decompress, which spawns additional
threads and is faster. Replacing
gunzip
by pigz
in the above (which is packaged for Debian), we get a timing
of 11 minutes for the same index size.
Of course, if we want to achieve different tradeoffs between file size and
speed, we can look at different compression and decompression algorithms,
optimized for speed. One possibility is lz4,
packaged as liblz4-tool
in Debian. The index is 160 MB in total, which is 50%
more than with gzip
but half of the size of the locate
index, and replacing
gzip
by lz4
and timing, we get 8 minutes. So this last solution that
assembles off-the-shelf tools with around 5 lines of bash is about 7.5 times
faster than locate
while using an index which is twice smaller.
Result summary
Approach | Time/query (s) | Index size (MB) |
---|---|---|
mlocate | 5.5 | 300 |
GNU locate | 4.6 | 75 |
grep | 1.7 | 890 |
LC_ALL grep | 0.7 | 890 |
LC_ALL grep (parallel) | 0.3 | 890 |
LC_ALL grep (parallel) gzip | 1.3 | 105 |
LC_ALL grep (parallel) pigz | 1.0 | 105 |
LC_ALL grep (parallel) lz4 | 0.7 | 160 |
Concluding
I am considering replacing locate
by the fastest of these approaches (the one
with parallelization and no compression) for my own needs, but I just intend to
do it as a personal hack; doing it in a reusable way would be much harder.
Of course, this post is just scratching the surface; the list of file names is
not preprocessed in any way to make grep faster, whereas in principle it could.
However, I could not find any
standard Unix utility that could do index file > file.idx
and search
--index=file.idx pattern file
, such that file.idx
is not too large compared
to file
and the search
operation is much faster than a grep (at least for
patterns that are fixed strings).3
It would be fun to implement such a tool, based, e.g., on a suffix
tree, but I probably won't bother as
the grep
approach is fast enough.
In addition to choosing the best algorithms, a good replacement of locate
should also implement the various options of locate
, so as to be a
drop-in replacement for it. In fact, it would probably be relevant to index more
metadata in addition to file names, to optimize searches, e.g., for recently
modified files, etc. I am quite surprised that such a tool does not seem to
exist. To be precise, it does exist in a stronger form in all the graphical
desktop search
tools,
so this could alternatively be implemented as a usable CLI interface to such
tools, intended as a locate
replacement.4
-
All the results I am presenting in this post were obtained by this code, the raw output of which is available (I'm not publishing the workload for privacy reasons.) I did other things on the machine when timing, though, so these are just estimations. If you're using these numbers to do anything important, you're insane, as they say. (Also, I'm aware of the crc32 mismatch that was logged at one point, but I have no clue about what caused it and I was not able to reproduce it. Scary...) ↩
-
Ideally I would prefer a round-robin that works in blocks of more than one line: it is often useful in the output of a search to have the order somewhat preserved, and there are often runs of contiguous results that would probably be maintained in order in their file if blocks were longer than one line. However, it doesn't look like
split
supports this. Another approach would be to use the suitable options of GNU parallel to read one large file and do the round-robin, but this seems slightly less efficient and wouldn't work well with compression. Fortunately, I often want the results of mylocate
invocations to be small, so sorting them back as postprocessing should be affordable in practice. ↩ -
The only relevant tools seem to be full-fledged and heavy indexing solutions like Xapian, Lucene, or Sphinx, or tools for document indexing (Recoll), or tools specific to DNA tasks (Bowtie). I tried libdivsufsort which seems to be (or have been) the state of the art, and it has some example CLI bindings provided, but the indexes are roughly five times faster than the input files, so that having to read them means that it is not faster than a grep-based solution. I also did a lot of
apt-cache search
, to no avail. ↩ -
To be precise about what existing desktop search tools seem to be missing to be a suitable
locate
replacement, there is: having a CLI interface installable separately from graphical dependencies to be usable on headless servers; having reasonable documentation for that interface; having options to index only file metadata and not content (or be explicit about scalability: will you be able to maintain efficiently an index for my 5 TB of data?); ideally having a CLI interface designed to be a direct replacement oflocate
or (more ambitiously)find
; preferably have a way to update the index with a cron job and not with mysterious daemons sitting around. ↩