a3nm's blog

Managing disk image files, and compression with lrzip

— updated

Here is a random collection of tricks I figured out when trying to manage disk image files and compressing them (and other stuff) with lrzip.

Getting disk image files

When I stop to use a machine and think I have migrated all important things, or when a hard drive seems about to die, or when a machine dies and I salvage its hard drive, the easy solution is to just take an image of the drive to keep as a file on some other medium.

The standard tool for this is dd. It is very powerful, though its syntax is arcane and entirely different from any other command I know of, for historical reasons probably. Interesting things about it:

  • Its default operation is quite slow, but it can be made faster by specifying a different block size. However, if you make the size too large, it may become slower again. Apparently the only way to figure out the fastest size is by testing different possibilities, which I think is also what gparted does, but from experience it is often around 1M. Beware that all parameters like count=, seek=, skip= are always given in blocks (not as an absolute size). The size of input and output blocks may be different.

  • While dd does not display any progress information, it can be made to display some by sending it SIGUSR1, as in pkill -USR1 dd. I often use watch to issue this command periodically to see how the operation progresses. [For simpler tasks, an alternative way to get nice progress information is to just use pv, e.g., pv /dev/sda > sda.img. Thanks Mc for the suggestion.]

When you are copying information from a faulty or failing drive, a better tool is GNU ddrescue. It will attempt to copy the data, but it will skip the sectors that cannot be accessed, to get as much data as possible; it will then progressively refine the missing sectors. Don't miss the possibility of using a logfile as a third argument, to save the progress information and be able to resume the operation later (see manpage).

When taking disk images, you can either retrieve a partition (e.g., /dev/sda1), or the entire drive (e.g., /dev/sda). Partition image files can be mounted, e.g., with sudo mount -o loop sda1.img /mnt/sda, you can also use -o ro for readonly. Whole device files can be examined with, e.g., fdisk -l sda.img, to see the partitions. You can also mount each partition manually:

LOOP=$(losetup -f)
fdisk -l sda.img
losetup --offset N "$LOOP" sda.img
mount "$LOOP" /mnt/mountpoint

where N is the offset of the partition in blocks indicated by fdisk, multiplied by the block size (also indicated by fdisk, usually 512 bytes).

Shrinking disk image files

When you are keeping image files for archival purposes, you want to make them as small as possible. For instance, if the file has free space, you want to resize it to the smallest possible size. Note that if you do not do this, the free space will be filled with random garbage, so you cannot even hope that it compresses well. (Of course, before doing this, you must be OK with modifying the image file, and you must be sure that you wouldn't want to recover anything from this space, e.g., a deleted file on the partition.)

For ext2, ext3, and ext4 filesystems, the standard filesystems on GNU/Linux, it turns out that resize2fs works just fine on image files, and has a convenient option to resize to the smallest size: -M. It can also show progress with -p. It requires e2fsck to be run first, which you should run with -C 0 to see progress on stdout, -tt if you want more timing information, -f to force checking even when it seems unnecessary, and -p to automatically fix errors when it is safe to do so.

For FAT filesystems, you should first run fsck.vfat -p, and you can then use fatresize, with -p, and with -s giving the size (to be obtained with fatresize -i IMAGE | grep '^Min'). Afterwards, you need to truncate the file (which fatresize does not do) with truncate -s SIZE, with the SIZE obtained from fatresize -i IMAGE | grep '^Size'.

All of the above applies to image files for single partitions; to perform them on full device images you can do it on a loop device created as explained above, though of course the image file will not be shrunk then.

In some cases it is not safe to resize partitions. For instance, if you took an image of a full device of an embedded system, it would be possible that resizing partitions and moving them around would confuse the bootloader. In this case, an alternative is to zero out the unneeded space, so that the file is not smaller but at least it will compress better. For extN systems, you can use zerofree to do this.

I wrote a script, shrinkimg, to automate such tasks (also compression and decompression). I do not guarantee that it will work for you, but you can always adapt it.

Compressing files

Another important thing to do when archiving files is to compress them. First, using tar without any compression option, you can always reduce your files to a single .tar file to be compressed. There is a tradeoff between compressing each file individually, which makes it easier to access each of them, and tar-ing multiple files together to compress them together, which will yield a smaller result (redundancies across files can be used). My heuristic is to tar together files that are related (e.g., different versions of the same codebase).

Not all compression algorithms are alike. The common ones I have used are the following (from fastest, oldest, and least efficient, to slowest, most recent, and most efficient):

  • gzip, using DEFLATE, so essentially like most ZIP files
  • bzip2, similar but with some improvements, e.g., the BWT
  • xz, using LZMA, essentially the same as 7-zip
  • lrzip, the most recent, which I will describe in more detail.

Unfortunately lrzip used to have a bug where it will crash if you do not have enough memory, yielding a bogus file, not printing any error, and not yielding a non-zero exit code. This bug is presumably fixed in later lrzip versions. There is also now a fork of the original lrzip (I did not test it). In any case, if you plan on using lrzip, I'd strongly recommend testing that the compressed file can be decompressed and correctly returns the original file, and possibly making an independent note of a checksum and of the precise version of the software that you used.

lrzip is packaged for Debian, and performs two independent steps. First, it uses a large sliding window to find and eliminate redundancies in the file even when the occurrences are far apart. Second, it compresses the result using a configurable algorithm, the default one being LZMA, the most efficient but slowest being ZPAQ. It is multithreaded by default at the expense of a small loss of compression efficiency.

lrzip provides its own benchmarks see also here but I wanted to test it on my own data. I used a 19 GiB image file. I tried xz and lrzip with the default compression level or the increased one -L 9, with the default LZMA algorithm or with ZPAQ -z. Take these timings with a grain of salt, the machine was often busy doing other things: they are wall clock timings (so in particular lrzip can take advantage of parallelization).

Command   Time (s)    Filesize (bytes)
xz -9 10588.75 7300161660
lrzip 4871.58s 5734885539
lrzip -L9 5980.53s 5693722686
lrzip -z 17852.32s   5544631223
lrzip -z -L9 39809.40s   5332451600

The default lrzip version took around 20 minutes to decompress, the version obtained with lrzip -L9 -z took around 12 hours (!). I checked that the file decompressed fine, and with the same SHA1 image, to verify that no corruption had occurred.

So, in brief, lrzip is noticeably more efficient than xz, at the expense of being slower if you use the advanced settings. So I think it is worth it to use it on very large images where one can expect long-distance redundancy.

The main drawbacks of lrzip are that it is RAM-hungry. It tries to allocate a sensible amount only, but you definitely shouldn't run multiple copies in parallel unless you know what you are doing: it may get killed by the OOM killer, starve the machine, or fail to allocate with a message like Failed to malloc ckbuf in hash_search2.

I should also point out an impressive feat of lrzip, where it did amazingly well when compressing a tar archive of a bunch of database backups (each of them being mostly a prefix of the latest one). The .tar was 1.6 GiB, gzip compressed it to 518 MiB, xz to 112 MiB, and lrzip with advanced settings to just 2.5 MiB. In this case lrzip was much faster than the others, because the redundancy elimination went fast, and the size of the relevant data to be compressed was very small indeed!

Thanks to Pablo Rauzy for proofreading this.

comments welcome at a3nm<REMOVETHIS>@a3nm.net