Deduplication attacks

The Dropship affair is, to my knowledge, the first example of an interesting attack on deduplication mechanisms which I thought I'd present it here for people who might not want to read about the whole story. Some of the ideas in this post come from comments on Hackernews, especially from this comment by limmeau and replies by alex1 and 0x0.

Data deduplication is a natural and well-known idea: when you're storing redundant data, you want to implement something at the filesystem level to store multiple copies of the same data at the same place and save disk space. This is especially true for personal hosting services like Dropbox, where some big files will be uploaded multiple times by a lot of users (redundancy) and seldom modified (better efficiency because you have to actually copy the data to create the modified version when this happens). [I know that deduplication is done on blocks rather than files, but let's simplify.]

Now, a natural extension of the scheme for personal hosting services is to do the deduplication on the client side. When the user wants to upload something, you ask them to give you a hash of the content first, and check if you already have a file with that hash. If you do, you assume it's the same file (collisions would be both unlikely and impossible to force if you use the right sort of hash functions, and we will indeed assume that the hash used is secure). This scheme both saves space on the server (you only keep one copy of the file) and saves bandwidth (because you don't even have to transfer the file). Sounds great, doesn't it?

Well, the slight problem is that, given a hash, this allows the client to know if any user on the server has a file with this hash, and to get a copy of it if there is one. That is, assuming that you know how the client and server communicate, but unless you rely on security by obscurity, this should be assumed anyway since the client runs on untrusted hardware. (As a side note, the Dropbox client is actually closed source, but this didn't protect them, though I guess Dropbox has better things to do than trying to obfuscate things.)

Hence, we can imagine at least two attacks:

File sharing: Instead of sharing files, you can just share hashes and get the file from Dropbox, which is more convenient and might make Dropbox liable if you use this for illegal purposes such as copyright infringement.
Spying: You can try to guess if someone is hosting a file on Dropbox by testing if the hash is known. In particular, if you have a redacted document for which you can generate possible values for the redacted content and for which you think an unredacted version is on Dropbox, you can test all values and see which one Dropbox knows.

These are not unique attacks, because public file sharing sites like Megaupload will let you exchange files by just transmitting URLs anyway, and because you shouldn't put sensitive data on Dropbox to start with (yes, they encrypt your files, but I think it's done on the server, and the client is proprietary anyway). However, I find this quite elegant.

As a purely useless exercise, I will now concentrate on the file sharing attack, forgetting that public file sharing services exist. limmeau's comment suggests the following protection against the attack: the server can ask the client for bytes of the file at random offsets of its choosing along with requesting the hash. It seems to me that, if the client only has constant size information about the file, the probability of the client managing to pass the challenge decreases as the file size grows (of course, usual disclaimers apply). What's more, the overhead in bandwidth and computation for this scheme (both on client and server) is quite small.

A possible refinement of the attack (alex1's comment) would be to have a network where attackers can forward the server challenges and get the responses from people who actually have the file. We have added the constraint that the exchange must take place when the server challenge is known, but the file can still be retrieved from the server with the attacker exchanging a constant size amount of information with a file-holding accomplice. Can we prevent that?

The answer is yes, if the server and client share some amount of information unknown to the accomplices. For instance, the server could require the client to keep an entropy pool generated from the client's uploads to the server (and bounded by a size C, because the hosting service isn't very useful otherwise). The server could then build challenges requiring knowledge of the entropy pool, like requesting the checksum of the file using the pool as salt, which would require the attacker to exchange more information with the accomplice: what we would hope to achieve if the pool is full would be to force the exchange to require min(C, N) with N the size of the file.

There are still a lot of questions we could ask about that, though. What happens if we take into account the fact that there are multiple transactions? Also, can we do this in a way which is more efficient for the server? Checking random bytes of the file was easy; computing a checksum over it using the entropy pool as salt is hard. Keeping a copy of the pool takes up some space too, though it should be possible to store it efficiently since it is computed from uploads which actually reside on the server.