The Dropship
affair is, to my knowledge, the first example of an interesting
attack on deduplication mechanisms which I thought I'd present it here
for people who might not want to read about the whole story. Some of the
ideas in this post come from comments on
Hackernews, especially from this comment by
limmeau and replies by alex1 and 0x0.
Data
deduplication is a natural and well-known idea: when you're storing
redundant data, you want to implement something at the filesystem level
to store multiple copies of the same data at the same place and save
disk space. This is especially true for personal hosting services like
Dropbox, where some big files will be uploaded multiple times by a lot
of users (redundancy) and seldom modified (better efficiency because you
have to actually copy the data to create the modified version when this
happens). [I know that deduplication is done on blocks rather than
files, but let's simplify.]
Now, a natural extension of the scheme for personal hosting services
is to do the deduplication on the client side. When the user wants to
upload something, you ask them to give you a hash of the content first,
and check if you already have a file with that hash. If you do, you
assume it's the same file (collisions would be both unlikely and
impossible to force if you use the
right sort of hash functions, and we will indeed assume that the
hash used is secure). This scheme both saves space on the server (you
only keep one copy of the file) and saves bandwidth (because you don't
even have to transfer the file). Sounds great, doesn't it?
Well, the slight problem is that, given a hash, this allows the
client to know if any user on the server has a file with this hash, and
to get a copy of it if there is one. That is, assuming that you know how
the client and server communicate, but unless you rely on security by
obscurity, this should be assumed anyway since the client runs on
untrusted hardware. (As a side note, the Dropbox client is actually
closed source, but this didn't protect them, though I guess Dropbox has
better things to do than trying to obfuscate things.)
Hence, we can imagine at least two attacks:
- File sharing
- Instead of sharing files, you can just share hashes and get the file
from Dropbox, which is more convenient and might make Dropbox liable if
you use this for illegal purposes such as copyright infringement.
- Spying
- You can try to guess if someone is hosting a file on Dropbox by
testing if the hash is known. In particular, if you have a redacted
document for which you can generate possible values for the redacted
content and for which you think an unredacted version is on Dropbox, you
can test all values and see which one Dropbox knows.
These are not unique attacks, because public file sharing sites like
Megaupload will let you exchange files by just transmitting URLs anyway,
and because you shouldn't put sensitive data on Dropbox to start with
(yes, they encrypt your
files, but I think it's done on the server, and the client is
proprietary anyway). However, I find this quite elegant.
As a purely useless exercise, I will now concentrate on the file
sharing attack, forgetting that public file sharing services exist.
limmeau's comment suggests the following protection against the attack:
the server can ask the client for bytes of the file at random offsets of
its choosing along with requesting the hash. It seems to me that, if the
client only has constant size information about the file, the
probability of the client managing to pass the challenge decreases as
the file size grows (of course, usual disclaimers
apply). What's more, the overhead in bandwidth and computation for
this scheme (both on client and server) is quite small.
A possible refinement of the attack (alex1's comment) would be to
have a network where attackers can forward the server challenges and get
the responses from people who actually have the file. We have added the
constraint that the exchange must take place when the server challenge
is known, but the file can still be retrieved from the server with the
attacker exchanging a constant size amount of information with a
file-holding accomplice. Can we prevent that?
The answer is yes, if the server and client share some amount of
information unknown to the accomplices. For instance, the server could
require the client to keep an entropy pool generated from the client's
uploads to the server (and bounded by a size C, because the
hosting service isn't very useful otherwise). The server could then
build challenges requiring knowledge of the entropy pool, like
requesting the checksum of the file using the pool as salt, which would
require the attacker to exchange more information with the accomplice:
what we would hope to achieve if the pool is full would be to force the
exchange to require min(C, N) with N the size
of the file.
There are still a lot of questions we could ask about that, though.
What happens if we take into account the fact that there are multiple
transactions? Also, can we do this in a way which is more efficient for
the server? Checking random bytes of the file was easy; computing a
checksum over it using the entropy pool as salt is hard. Keeping a copy
of the pool takes up some space too, though it should be possible to
store it efficiently since it is computed from uploads which actually
reside on the server.