Experiments about a better locate using grep

I have a lot of files and I'm not fond of sorting them intelligently, so I usually give them long and descriptive file names and rely on locate to find them quickly. I was always a bit annoyed to see that locate was not instantaneous even though intuition suggested that it should be, but I was shocked to find out that locate is actually an antique, and that reimplementing a better version of it seems trivial. So I thought I had to investigate more, to replace locate with something that worked better.

Current setup

I am currently using locate, by which I mean mlocate, not slocate which is not packaged for Debian, and not the old locate from findutils.

My database in /var/lib/mlocate/mlocate.db takes 300 MB, and indexes about 10M files. Indexing is run every night, I don't know how much time it takes, mlocate is maybe smart about which files to examine, but I don't really care about that. (To be honest, I am a bit concerned about power consumption and hard drive and SSD wear, but that's rather hard to estimate.) Hence, I will focus on the performance of query evaluation only, not indexing. To estimate performance for typical use cases I grepped from my history file to find the latest locate commands that I ran:

grep ';locate' ~/.history |
  cut -d'|' -f1 | cut -d';' -f2 | grep -v '^grep' |
  cut -d ' ' -f2- > workload

That's 650 invocations (over a bit over 100k lines of history). I filtered them to remove those that did not run, and to remove the very rare cases where I used a locate-specific feature (there was one -0 and a few cases where I was searching simultaneously on multiple patterns, otherwise the only flag was -i), so that's exactly 646 queries (of which 527 are unique). Let's time all those commands, running it 3 times to avoid outliers, with no outstanding IO or CPU activity:

for a in 1 2 3; do
  time sed 's/^/locate /' workload | sh > /dev/null;
done

This takes around 1 hour, so about 5.5 seconds per query.¹ In terms of hardware, I should point out that the machine that I am using has an SSD, most specifically a 120 GB Samsung SSD 840, which hosts the system (plus some of the data, the rest is on hard drives), and of course I'm always storing indexes on the SSD. The CPU is an Intel Core i5-4570 at 3.20 Ghz with 4 cores, there is 8 GB of RAM.

For completeness I also timed locate.findutils (so locate from the GNU findutils) and it is marginally faster, taking about 50 minutes. The database is much more compact, however, and takes only 75 MB. I did not bother benchmarking slocate as Debian does not package it.

Making it faster

Let's try to replace locate by the simplest possible approach to optimize for pure speed.

sudo find / > files

This takes some time, but I don't care. The result is about 900 MB. Let's time the workload on it:

for a in 1 2 3; do
  time sed 's/^/grep /;s/$/ files/' workload | sh > /dev/null;
done

This takes about 19 minutes, so it's already about 2.5 to 3 times faster than locate. (Yes, ideally, I should have checked that it returns exactly the same results, or quantify the differences, but I didn't bother. I just verified for some example queries that the result made sense.)

It turns out one easy way to make grep faster is to set LC_ALL=C. This is unsafe with respect to multibyte characters, but I don't use those very often in file names and never in queries, and my understanding is in this case this option can only add false positives with low probability, which is OK for me.

for a in 1 2 3; do
  time sed 's/^/LC_ALL=C grep /;s/$/ files/' workload | sh > /dev/null;
done

This takes about 7 and a half minutes, so faster than the vanilla grep solution by a factor of 2.5 to 3.

How to make this even faster? It turns out that grep is CPU-bound in this case, so with a multicore CPU it is tempting to parallelize it. This is easy to do: we can clearly split the indexed files in buckets that we index and query separately, merging the results. I use 8 buckets here.

split files files. -nr/8

This splits files in files.aa, ..., files.ah, in a round-robin fashion.²

Let's now create a script grep.sh to grep the files in parallel, using the parallel command from GNU parallel. It is slightly confusing but we must re-quote the arguments to the grep.sh script in this strange way so that we can pass them to the command invoked by parallel:

ARGS=$(printf " %q" "$@")
ls files.* | parallel 'bash -c "LC_ALL=C grep '$ARGS' {}"'

The parallel command will spawn as many parallel jobs as there are CPUs (i.e., cores, so should be 4 in my case). Note that it would not be safe to use xargs or parallel from moreutils they could garble output by two parallel processes if they tried to write at the same moment; GNU parallel, by contrast, prevents this (with --group, which is implied by default). Now let's time this:

for a in 1 2 3; do
  time sed 's/^/.\/grep.sh /' workload | sh > /dev/null;
done

This takes around 3 minutes 30, so around 2 times faster than the non-parallel version. This amounts to 1/3 second per request (plus the overhead of displaying the results, which is not accounted for here), which I think is good enough for my needs.

Compressing the index

One respect in which this grep-based approaches is worse than locate is that its index is three times larger than that of locate. The original post about locate conjectured that this explained why locate's performance is worse than naive grep. However, as we will see, it is easy to compress the index and still be much faster than locate. First, let's simply compress the files with gzip:

split --additional-suffix=.gzip \
  --filter="bash -c 'gzip > \$FILE'" files files. -nr/8

The resulting index is 105 MB in size in total, so 10 times smaller than the uncompressed index, and still 3 times smaller than that of locate. We can then decompress these files when querying them, by replacing our search script with:

ARGS=$(printf " %q" "$@")
parallel -i bash -c "gunzip -dc {} | LC_ALL=C grep $ARGS" -- files.*.gz

The timing of this is 14 minutes, which is 4 times slower than the uncompressed version, but still 4 times faster than locate.

In fact, we can still make this much more efficient. One easy way is to use the pigz command to decompress, which spawns additional threads and is faster. Replacing gunzip by pigz in the above (which is packaged for Debian), we get a timing of 11 minutes for the same index size.

Of course, if we want to achieve different tradeoffs between file size and speed, we can look at different compression and decompression algorithms, optimized for speed. One possibility is lz4, packaged as liblz4-tool in Debian. The index is 160 MB in total, which is 50% more than with gzip but half of the size of the locate index, and replacing gzip by lz4 and timing, we get 8 minutes. So this last solution that assembles off-the-shelf tools with around 5 lines of bash is about 7.5 times faster than locate while using an index which is twice smaller.

Result summary

Approach	Time/query (s)	Index size (MB)
mlocate	5.5	300
GNU locate	4.6	75
grep	1.7	890
LC_ALL grep	0.7	890
LC_ALL grep (parallel)	0.3	890
LC_ALL grep (parallel) gzip	1.3	105
LC_ALL grep (parallel) pigz	1.0	105
LC_ALL grep (parallel) lz4	0.7	160

Concluding

I am considering replacing locate by the fastest of these approaches (the one with parallelization and no compression) for my own needs, but I just intend to do it as a personal hack; doing it in a reusable way would be much harder.

Of course, this post is just scratching the surface; the list of file names is not preprocessed in any way to make grep faster, whereas in principle it could. However, I could not find any standard Unix utility that could do index file > file.idx and search --index=file.idx pattern file, such that file.idx is not too large compared to file and the search operation is much faster than a grep (at least for patterns that are fixed strings).³ It would be fun to implement such a tool, based, e.g., on a suffix tree, but I probably won't bother as the grep approach is fast enough.

In addition to choosing the best algorithms, a good replacement of locate should also implement the various options of locate, so as to be a drop-in replacement for it. In fact, it would probably be relevant to index more metadata in addition to file names, to optimize searches, e.g., for recently modified files, etc. I am quite surprised that such a tool does not seem to exist. To be precise, it does exist in a stronger form in all the graphical desktop search tools, so this could alternatively be implemented as a usable CLI interface to such tools, intended as a locate replacement.⁴

All the results I am presenting in this post were obtained by this code, the raw output of which is available (I'm not publishing the workload for privacy reasons.) I did other things on the machine when timing, though, so these are just estimations. If you're using these numbers to do anything important, you're insane, as they say. (Also, I'm aware of the crc32 mismatch that was logged at one point, but I have no clue about what caused it and I was not able to reproduce it. Scary...) ↩
Ideally I would prefer a round-robin that works in blocks of more than one line: it is often useful in the output of a search to have the order somewhat preserved, and there are often runs of contiguous results that would probably be maintained in order in their file if blocks were longer than one line. However, it doesn't look like split supports this. Another approach would be to use the suitable options of GNU parallel to read one large file and do the round-robin, but this seems slightly less efficient and wouldn't work well with compression. Fortunately, I often want the results of my locate invocations to be small, so sorting them back as postprocessing should be affordable in practice. ↩
The only relevant tools seem to be full-fledged and heavy indexing solutions like Xapian, Lucene, or Sphinx, or tools for document indexing (Recoll), or tools specific to DNA tasks (Bowtie). I tried libdivsufsort which seems to be (or have been) the state of the art, and it has some example CLI bindings provided, but the indexes are roughly five times faster than the input files, so that having to read them means that it is not faster than a grep-based solution. I also did a lot of apt-cache search, to no avail. ↩
To be precise about what existing desktop search tools seem to be missing to be a suitable locate replacement, there is: having a CLI interface installable separately from graphical dependencies to be usable on headless servers; having reasonable documentation for that interface; having options to index only file metadata and not content (or be explicit about scalability: will you be able to maintain efficiently an index for my 5 TB of data?); ideally having a CLI interface designed to be a direct replacement of locate or (more ambitiously) find; preferably have a way to update the index with a cron job and not with mysterious daemons sitting around. ↩