Managing disk image files, and compression with lrzip
Here is a random collection of tricks I figured out when trying to manage disk image files and compressing them (and other stuff) with lrzip.
Getting disk image files
When I stop to use a machine and think I have migrated all important things, or when a hard drive seems about to die, or when a machine dies and I salvage its hard drive, the easy solution is to just take an image of the drive to keep as a file on some other medium.
The standard tool for this is dd
. It is very powerful, though its syntax is
arcane and entirely different from any other command I know of, for historical
reasons probably. Interesting things about it:
-
Its default operation is quite slow, but it can be made faster by specifying a different block size. However, if you make the size too large, it may become slower again. Apparently the only way to figure out the fastest size is by testing different possibilities, which I think is also what gparted does, but from experience it is often around 1M. Beware that all parameters like
count=
,seek=
,skip=
are always given in blocks (not as an absolute size). The size of input and output blocks may be different. -
While
dd
does not display any progress information, it can be made to display some by sending itSIGUSR1
, as inpkill -USR1 dd
. I often usewatch
to issue this command periodically to see how the operation progresses. [For simpler tasks, an alternative way to get nice progress information is to just usepv
, e.g.,pv /dev/sda > sda.img
. Thanks Mc for the suggestion.]
When you are copying information from a faulty or failing drive, a better tool
is GNU ddrescue. It will attempt to
copy the data, but it will skip the sectors that cannot be accessed, to get as much data as
possible; it will then progressively refine the missing sectors. Don't miss the
possibility of using a logfile
as a third argument, to save the progress
information and be able to resume the operation later (see manpage).
When taking disk images, you can either retrieve a partition (e.g., /dev/sda1
), or
the entire drive (e.g., /dev/sda
). Partition image files can be
mounted, e.g., with sudo mount -o loop sda1.img /mnt/sda
, you can also use -o
ro
for readonly. Whole device files can be examined with, e.g., fdisk -l
sda.img
, to see the partitions. You can also mount each partition manually:
LOOP=$(losetup -f)
fdisk -l sda.img
losetup --offset N "$LOOP" sda.img
mount "$LOOP" /mnt/mountpoint
where N
is the offset of the partition in blocks indicated by fdisk
,
multiplied by the block size (also indicated by fdisk
, usually 512 bytes).
Shrinking disk image files
When you are keeping image files for archival purposes, you want to make them as small as possible. For instance, if the file has free space, you want to resize it to the smallest possible size. Note that if you do not do this, the free space will be filled with random garbage, so you cannot even hope that it compresses well. (Of course, before doing this, you must be OK with modifying the image file, and you must be sure that you wouldn't want to recover anything from this space, e.g., a deleted file on the partition.)
For ext2,
ext3,
and ext4 filesystems,
the standard filesystems on GNU/Linux,
it turns out that resize2fs
works just
fine on image files, and has a convenient option to resize to the smallest size:
-M
. It can also show progress with -p
. It requires e2fsck
to be run first,
which you should run with -C 0
to see progress on stdout
, -tt
if you want
more timing information, -f
to force checking even when it seems unnecessary,
and -p
to automatically fix errors when it is safe to do so.
For FAT filesystems,
you should first run
fsck.vfat -p
, and
you
can then use fatresize
, with -p
, and with -s
giving the size (to be obtained
with fatresize -i IMAGE | grep '^Min'
). Afterwards, you need to truncate the file (which fatresize
does not do) with truncate -s SIZE
, with the SIZE
obtained from fatresize
-i IMAGE | grep '^Size'
.
All of the above applies to image files for single partitions; to perform them on full device images you can do it on a loop device created as explained above, though of course the image file will not be shrunk then.
In some cases it is not safe to resize partitions. For instance, if you took an
image of a full device of an embedded system, it would be possible that resizing
partitions and moving them around would confuse the bootloader. In this case, an
alternative is to zero out the unneeded space, so that the file is not smaller
but at least it will compress better. For extN systems, you can use zerofree
to do this.
I wrote a script, shrinkimg, to automate such tasks (also compression and decompression). I do not guarantee that it will work for you, but you can always adapt it.
Compressing files
Another important thing to do when archiving files is to compress them.
First, using tar
without any compression option, you can always reduce your
files to a single .tar
file to be compressed. There is a tradeoff between
compressing each file individually, which makes it easier to access each of
them, and tar
-ing multiple files together to compress them together, which
will yield a smaller result (redundancies across files can be used). My
heuristic is to tar
together files that are related (e.g., different versions
of the same codebase).
Not all compression algorithms are alike. The common ones I have used are the following (from fastest, oldest, and least efficient, to slowest, most recent, and most efficient):
- gzip, using DEFLATE, so essentially like most ZIP files
- bzip2, similar but with some improvements, e.g., the BWT
- xz, using LZMA, essentially the same as 7-zip
- lrzip, the most recent, which I will describe in more detail.
Unfortunately lrzip used to have a bug where it will crash if you do not have enough memory, yielding a bogus file, not printing any error, and not yielding a non-zero exit code. This bug is presumably fixed in later lrzip versions. There is also now a fork of the original lrzip (I did not test it). In any case, if you plan on using lrzip, I'd strongly recommend testing that the compressed file can be decompressed and correctly returns the original file, and possibly making an independent note of a checksum and of the precise version of the software that you used.
lrzip is packaged for Debian, and performs two independent steps. First, it uses a large sliding window to find and eliminate redundancies in the file even when the occurrences are far apart. Second, it compresses the result using a configurable algorithm, the default one being LZMA, the most efficient but slowest being ZPAQ. It is multithreaded by default at the expense of a small loss of compression efficiency.
lrzip provides its own
benchmarks
see also
here
but I wanted to
test it on my own data. I used a 19 GiB image file. I tried xz
and lrzip
with
the default compression level or the increased one -L 9
, with the default LZMA
algorithm or with ZPAQ -z
. Take these timings with a grain of salt, the
machine was often busy doing other things: they are wall clock timings (so in
particular lrzip
can take advantage of parallelization).
Command | Time (s) | Filesize (bytes) |
---|---|---|
xz -9 |
10588.75 | 7300161660 |
lrzip |
4871.58s | 5734885539 |
lrzip -L9 |
5980.53s | 5693722686 |
lrzip -z |
17852.32s | 5544631223 |
lrzip -z -L9 |
39809.40s | 5332451600 |
The default lrzip
version took around 20 minutes to decompress, the version
obtained with lrzip -L9 -z
took around 12 hours (!). I checked that the file
decompressed fine, and with the same SHA1 image, to verify that no corruption
had occurred.
So, in brief, lrzip
is noticeably more efficient than xz
, at the expense of
being slower if you use the advanced settings. So I think it is worth it to use
it on very large images where one can expect long-distance redundancy.
The main drawbacks of lrzip
are that it is RAM-hungry. It tries to allocate a
sensible amount only, but you definitely shouldn't run multiple copies in
parallel unless you know what you are doing: it may get killed by the
OOM killer, starve the machine,
or fail to allocate with a message like Failed to malloc ckbuf in
hash_search2
.
I should also point out an impressive feat of lrzip
, where it did amazingly
well when compressing a tar
archive of a bunch of database backups (each of
them being mostly a prefix of the latest one). The .tar
was 1.6 GiB, gzip
compressed it to 518 MiB, xz
to 112 MiB, and lrzip
with advanced settings to
just 2.5 MiB. In this case lrzip
was much faster than the others, because the
redundancy elimination went fast, and the size of the relevant data to be
compressed was very small indeed!
Thanks to Pablo Rauzy for proofreading this.