Here is a random collection of tricks I figured out when trying to manage disk image files and compressing them (and other stuff) with lrzip.
Getting disk image files
When I stop to use a machine and think I have migrated all important things, or when a hard drive seems about to die, or when a machine dies and I salvage its hard drive, the easy solution is to just take an image of the drive to keep as a file on some other medium.
The standard tool for this is
dd. It is very powerful, though its syntax is
arcane and entirely different from any other command I know of, for historical
reasons probably. Interesting things about it:
Its default operation is quite slow, but it can be made faster by specifying a different block size. However, if you make the size too large, it may become slower again. Apparently the only way to figure out the fastest size is by testing different possibilities, which I think is also what gparted does, but from experience it is often around 1M. Beware that all parameters like
skip=are always given in blocks (not as an absolute size). The size of input and output blocks may be different.
dddoes not display any progress information, it can be made to display some by sending it
SIGUSR1, as in
pkill -USR1 dd. I often use
watchto issue this command periodically to see how the operation progresses. [For simpler tasks, an alternative way to get nice progress information is to just use
pv /dev/sda > sda.img. Thanks Mc for the suggestion.]
When you are copying information from a faulty or failing drive, a better tool
is GNU ddrescue. It will attempt to
copy the data, but it will skip the sectors that cannot be accessed, to get as much data as
possible; it will then progressively refine the missing sectors. Don't miss the
possibility of using a
logfile as a third argument, to save the progress
information and be able to resume the operation later (see manpage).
When taking disk images, you can either retrieve a partition (e.g.,
the entire drive (e.g.,
/dev/sda). Partition image files can be
mounted, e.g., with
sudo mount -o loop sda1.img /mnt/sda, you can also use
ro for readonly. Whole device files can be examined with, e.g.,
sda.img, to see the partitions. You can also mount each partition manually:
LOOP=$(losetup -f) fdisk -l sda.img losetup --offset N "$LOOP" sda.img mount "$LOOP" /mnt/mountpoint
N is the offset of the partition in blocks indicated by
multiplied by the block size (also indicated by
fdisk, usually 512 bytes).
Shrinking disk image files
When you are keeping image files for archival purposes, you want to make them as small as possible. For instance, if the file has free space, you want to resize it to the smallest possible size. Note that if you do not do this, the free space will be filled with random garbage, so you cannot even hope that it compresses well. (Of course, before doing this, you must be OK with modifying the image file, and you must be sure that you wouldn't want to recover anything from this space, e.g., a deleted file on the partition.)
and ext4 filesystems,
the standard filesystems on GNU/Linux,
it turns out that
resize2fs works just
fine on image files, and has a convenient option to resize to the smallest size:
-M. It can also show progress with
-p. It requires
e2fsck to be run first,
which you should run with
-C 0 to see progress on
-tt if you want
more timing information,
-f to force checking even when it seems unnecessary,
-p to automatically fix errors when it is safe to do so.
For FAT filesystems,
you should first run
fsck.vfat -p, and
can then use
-p, and with
-s giving the size (to be obtained
fatresize -i IMAGE | grep '^Min'). Afterwards, you need to truncate the file (which
does not do) with
truncate -s SIZE, with the
SIZE obtained from
-i IMAGE | grep '^Size'.
All of the above applies to image files for single partitions; to perform them on full device images you can do it on a loop device created as explained above, though of course the image file will not be shrunk then.
In some cases it is not safe to resize partitions. For instance, if you took an
image of a full device of an embedded system, it would be possible that resizing
partitions and moving them around would confuse the bootloader. In this case, an
alternative is to zero out the unneeded space, so that the file is not smaller
but at least it will compress better. For extN systems, you can use
to do this.
I wrote a script, shrinkimg, to automate such tasks (also compression and decompression). I do not guarantee that it will work for you, but you can always adapt it.
Another important thing to do when archiving files is to compress them.
tar without any compression option, you can always reduce your
files to a single
.tar file to be compressed. There is a tradeoff between
compressing each file individually, which makes it easier to access each of
tar-ing multiple files together to compress them together, which
will yield a smaller result (redundancies across files can be used). My
heuristic is to
tar together files that are related (e.g., different versions
of the same codebase).
Not all compression algorithms are alike. The common ones I have used are the following (from fastest, oldest, and least efficient, to slowest, most recent, and most efficient):
- gzip, using DEFLATE, so essentially like most ZIP files
- bzip2, similar but with some improvements, e.g., the BWT
- xz, using LZMA, essentially the same as 7-zip
- lrzip, the most recent, which I will describe in more detail.
Unfortunately lrzip used to have a bug where it will crash if you do not have enough memory, yielding a bogus file, not printing any error, and not yielding a non-zero exit code. This bug is presumably fixed in later lrzip versions. There is also now a fork of the original lrzip (I did not test it). In any case, if you plan on using lrzip, I'd strongly recommend testing that the compressed file can be decompressed and correctly returns the original file, and possibly making an independent note of a checksum and of the precise version of the software that you used.
lrzip is packaged for Debian, and performs two independent steps. First, it uses a large sliding window to find and eliminate redundancies in the file even when the occurrences are far apart. Second, it compresses the result using a configurable algorithm, the default one being LZMA, the most efficient but slowest being ZPAQ. It is multithreaded by default at the expense of a small loss of compression efficiency.
lrzip provides its own
but I wanted to
test it on my own data. I used a 19 GiB image file. I tried
the default compression level or the increased one
-L 9, with the default LZMA
algorithm or with ZPAQ
-z. Take these timings with a grain of salt, the
machine was often busy doing other things: they are wall clock timings (so in
lrzip can take advantage of parallelization).
|Command||Time (s)||Filesize (bytes)|
lrzip version took around 20 minutes to decompress, the version
lrzip -L9 -z took around 12 hours (!). I checked that the file
decompressed fine, and with the same SHA1 image, to verify that no corruption
So, in brief,
lrzip is noticeably more efficient than
xz, at the expense of
being slower if you use the advanced settings. So I think it is worth it to use
it on very large images where one can expect long-distance redundancy.
The main drawbacks of
lrzip are that it is RAM-hungry. It tries to allocate a
sensible amount only, but you definitely shouldn't run multiple copies in
parallel unless you know what you are doing: it may get killed by the
OOM killer, starve the machine,
or fail to allocate with a message like
Failed to malloc ckbuf in
I should also point out an impressive feat of
lrzip, where it did amazingly
well when compressing a
tar archive of a bunch of database backups (each of
them being mostly a prefix of the latest one). The
.tar was 1.6 GiB,
compressed it to 518 MiB,
xz to 112 MiB, and
lrzip with advanced settings to
just 2.5 MiB. In this case
lrzip was much faster than the others, because the
redundancy elimination went fast, and the size of the relevant data to be
compressed was very small indeed!
Thanks to Pablo Rauzy for proofreading this.