a3nm's blog

Which Creative Commons license?

Why can't people just stop writing "under a Creative Commons license" and actually say which one they have in mind? "Under a CC license" only indicates that attribution is required and that verbatim non-commercial reproduction is allowed, but does nothing to identify where you fall between the almost all-permissive CC-BY and the much more restrictive CC-BY-NC-ND.

Deduplication attacks

— updated

The Dropship affair is, to my knowledge, the first example of an interesting attack on deduplication mechanisms which I thought I'd present it here for people who might not want to read about the whole story. Some of the ideas in this post come from comments on Hackernews, especially from this comment by limmeau and replies by alex1 and 0x0.

Data deduplication is a natural and well-known idea: when you're storing redundant data, you want to implement something at the filesystem level to store multiple copies of the same data at the same place and save disk space. This is especially true for personal hosting services like Dropbox, where some big files will be uploaded multiple times by a lot of users (redundancy) and seldom modified (better efficiency because you have to actually copy the data to create the modified version when this happens). [I know that deduplication is done on blocks rather than files, but let's simplify.]

Now, a natural extension of the scheme for personal hosting services is to do the deduplication on the client side. When the user wants to upload something, you ask them to give you a hash of the content first, and check if you already have a file with that hash. If you do, you assume it's the same file (collisions would be both unlikely and impossible to force if you use the right sort of hash functions, and we will indeed assume that the hash used is secure). This scheme both saves space on the server (you only keep one copy of the file) and saves bandwidth (because you don't even have to transfer the file). Sounds great, doesn't it?

Well, the slight problem is that, given a hash, this allows the client to know if any user on the server has a file with this hash, and to get a copy of it if there is one. That is, assuming that you know how the client and server communicate, but unless you rely on security by obscurity, this should be assumed anyway since the client runs on untrusted hardware. (As a side note, the Dropbox client is actually closed source, but this didn't protect them, though I guess Dropbox has better things to do than trying to obfuscate things.)

Hence, we can imagine at least two attacks:

File sharing
Instead of sharing files, you can just share hashes and get the file from Dropbox, which is more convenient and might make Dropbox liable if you use this for illegal purposes such as copyright infringement.
Spying
You can try to guess if someone is hosting a file on Dropbox by testing if the hash is known. In particular, if you have a redacted document for which you can generate possible values for the redacted content and for which you think an unredacted version is on Dropbox, you can test all values and see which one Dropbox knows.

These are not unique attacks, because public file sharing sites like Megaupload will let you exchange files by just transmitting URLs anyway, and because you shouldn't put sensitive data on Dropbox to start with (yes, they encrypt your files, but I think it's done on the server, and the client is proprietary anyway). However, I find this quite elegant.

As a purely useless exercise, I will now concentrate on the file sharing attack, forgetting that public file sharing services exist. limmeau's comment suggests the following protection against the attack: the server can ask the client for bytes of the file at random offsets of its choosing along with requesting the hash. It seems to me that, if the client only has constant size information about the file, the probability of the client managing to pass the challenge decreases as the file size grows (of course, usual disclaimers apply). What's more, the overhead in bandwidth and computation for this scheme (both on client and server) is quite small.

A possible refinement of the attack (alex1's comment) would be to have a network where attackers can forward the server challenges and get the responses from people who actually have the file. We have added the constraint that the exchange must take place when the server challenge is known, but the file can still be retrieved from the server with the attacker exchanging a constant size amount of information with a file-holding accomplice. Can we prevent that?

The answer is yes, if the server and client share some amount of information unknown to the accomplices. For instance, the server could require the client to keep an entropy pool generated from the client's uploads to the server (and bounded by a size C, because the hosting service isn't very useful otherwise). The server could then build challenges requiring knowledge of the entropy pool, like requesting the checksum of the file using the pool as salt, which would require the attacker to exchange more information with the accomplice: what we would hope to achieve if the pool is full would be to force the exchange to require min(C, N) with N the size of the file.

There are still a lot of questions we could ask about that, though. What happens if we take into account the fact that there are multiple transactions? Also, can we do this in a way which is more efficient for the server? Checking random bytes of the file was easy; computing a checksum over it using the entropy pool as salt is hard. Keeping a copy of the pool takes up some space too, though it should be possible to store it efficiently since it is computed from uploads which actually reside on the server.

Representing fuel efficiency

— updated

In Europe, fuel efficiency is usually expressed as fuel consumption, which is the number of liters you use for every 100 kilometers you travel (in French, "litres au 100").

The funny thing is that, from the point of view of physics, "litres au 100" are homogeneous to a surface (because liters are a unit for volume and kilometers a unit for distance). For example, if you take a typical consumption of 10 liters per 100 km, this means a consumption of 10-2 cubic meters for 106 meters, ie 10-8 square meters (or 0.01 square millimeters, ie. a square 0.1 mm by 0.1 mm).

The most incredible thing, however, is that this way of imagining things lends itself to a very clear visual interpretation which makes it possible to see the quantity of fuel used when travelling a given distance. Imagine that the surface representing the consumption is some kind of antenna over the car. When the car is running, the quantity of petrol used is the volume traversed by the surface.

In the US, however, you don't use fuel consumption that much; you talk about fuel economy, expressed as miles per gallon. Not only do Americans use of imperial units rather than SI units, but they're also thinking the other way round: whereas the French indicate the number of liters to use for a given distance, Americans indicate how far you can go with a given quantity of petrol. This means that the unit of fuel economy is the inverse of a surface. Can we get a visual interpretation out of that too?

The answer is yes, though the interpretation isn't quite as straightforward as in the previous case. Assuming that we have a given volume of fuel and that we know the value of the fuel economy, we would like to see the distance which we can travel. The idea is to see the petrol tank as a cylinder with a base surface S and a height h, and imagine your fuel efficiency of x (in m-2) as indicating that you get x "efficiency points" per square meter. (This is quite logical: a frequency of 42 Hz, ie. 42 s-1, means that you encounter 42 "beats" per second.)

Now, imagine a plane which sweeps through the tank while staying parallel to the base. The intersection between the tank and the plane contains a constant number of efficiency points (because it is always of surface S), and you can see them as travelling along with the plane. It happens that the total distance travelled by all of them (ie. h times the number of efficiency points in surface S) is the distance which you can travel.

Of course, if the number of points is fractional, just count the fractional part as appearing only for that fraction of the sweeping, ie. if you have 2.5 points, then count 2.5 times h. Another remark: in fact, you could arrange your fuel in any shape you want, and integrate the number of efficiency points over the (changing) size of the intersection of the sweeping plane with the tank. The assumption that it is arranged in a cylindrical volume is only there for simplicity's sake.

It feels a bit annoying that this second visualization seems more complicated than the first one although it seems to be more or less the same idea. I don't know if this could count as a proof that European way to do things is better than the American one? :-P

Thanks to Claude Perdigou for help in finding the second visualization.

Some more explanations on xkcd What If? and Physics.SE.

Wikileaks is so depressingly unsecure

— updated
Wikileaks is the first concrete realisation of the crypto-anarchist dream: completely anonymous leaking, dealing blows to tyranny. (Source.)

I can't help but feel a bit depressed whenever I read something like that. Yes, Wikileaks is probably following the spirit of crypto-anarchy, but it is surprisingly lacking as far as the crypto part is concerned. You'd think that it would be using technologies such as Freenet to ensure that submitters cannot be traced and the Wikileaks site cannot be censored, or GPG keys to authenticate their releases, or HTTPS to serve their website (it's not working). Instead, Wikileaks is using ordinary technologies, and is trying to solve problems as they can. The most incredible thing is that it's working despite all this. (Of course, you still have to trust them when they say that they aren't tracing submitters, because they can't really prove that...)

(Yes, I know that using the right tools would have prevented Wikileaks from reaching a wide audience, and that this would probably have reduced their impact. I'm not saying that they should have used Freenet or anything; I'm just sad to notice that using the wrong tools seems to be good enough.)

Microinstitutions

Among the numerous social structures which you can think of, there are some which seem quite easy to create (essentially associations and companies). However, how are trade unions created? Banks? Insurance companies? Towns? Universities? NGOs?

The one big institutions about which people fantasize are countries: for some reason, people love to create micronations (probably because countries are supposed to be sovereign entities). Yet, that's not the easiest way to start, because it's kind of hard to get recognized as a country, whereas it is probably possible to create most of the other structures (assuming you live in a free country). How come nobody seems to be trying to create this kind of things and report on the result?