Learning the gender of French nouns

The gender of French nouns is a pain for foreigners and even occasionally for native speakers. Learners of French usually rely (besides rote learning) on rules that classify words as masculine or feminine depending on their ending. In this post, I present what happens if you try to derive the minimal set of rules to determine the gender of a French noun from its ending. (In brief: it doesn't give a very compact set of rules because there are too many exceptions.)

The problem that we will study is: given a French noun, determine its gender. Let us start by taking the database from Lexique and keep the words that:

are not "derived" forms (e.g., plurals);
are nouns;
are either masculine or feminine (there's not much I can say about nouns that can be both, except that most of them are words like "journaliste" or "enfant" that are used to refer to people so the gender to choose is usually clear depending on the person you're referring to);
do not contain spaces or hyphens (because the gender of such words is usually determined from the component words, so a strategy that looks at their ending will not work well)
do not contain dots (remove pesky abbreviations)

Here is the code I use (see lexique.org to obtain lexique, and note that I use a custom version with some errors fixed by hand so your result may differ slightly).

cut -f1,4,5,14 lexique |
  grep '1$' |
  cut -f1,2,3 |
  grep NOM |
  grep  '[mf]$' |
  cut -f1,3 |
  grep -v ' ' |
  grep -v -- '[-\.]' > nouns.txt

Now, we need to find rules to predict the label ('m' or 'f') of nouns in the input list, in a manner that is as concise as possible. To do so, we said that we would try to determine the label by reading the noun starting from its ending. I will describe what we want to do with an example. Suppose we get an unknown noun and start reading it. The last letter is 'e'. At this point we don't know the gender, so we continue. The two last letters are 've'. We must continue still. The three last letters are 'uve'. This narrows down the set of possible nouns, but it can still be either masculine or feminine (think "fauve" vs "guimauve"). The last four letters are "luve". At this point, we know that all nouns ending in "luve" are masculine (there is only "effluve"), so we answer 'm'.

According to this example, we want a set of rules that says, for every possible suffix read so far, whether we can decide 'm' or 'f' or must continue reading. The set of rules should be minimal, which means that it should decide 'm' or 'f' as soon as possible (i.e., as soon as all nouns ending with this suffix have the same gender). Such classification strategies look a lot like deterministic finite automata, except they are acyclic. A more standard term is trie. With such strategies, you can determine the gender of all nouns of the list, and (hopefully) do a reasonable job for unknown nouns by answering given on the longest common prefix.

Now, it turns out I already wrote some code to generate tries from examples, for my project about determining if an initial 'h' in a French word is aspirated or not. Let us reuse that.

So, let us reverse those nouns, and pass them to programs from the haspirater suite to compile the trie and obtain the leaves of the trie, namely, the suffixes at which a decision is taken (and sort them nicely).

rev nouns.txt |
  buildtrie.py |
  compresstrie.py |
  leavestrie.py -1 |
  rev |
  LC_ALL=C sort -k1,1 |
  rev > leaves.txt

Following our previous example, observe that the leaves.txt file contains a line for "luve" (line 4,417). This means that "luve" is a suffix at which we decide 'm', but all shorter suffixes ("uve", "ve", "e", "") were still ambiguous. An initial space in a word indicates "beginning of word" (when we read "rive" we don't know yet between, say, "dérive" and "drive", but if the full word is "rive" then we should decide 'f'). To determine the gender of a noun using this list, look at the line containing the longest suffix of the noun, and the first field of the line should be its gender. Note that the longest leaves in this file are "patriarche" and "matriarche", for which reading "atriarche" is still insufficient to decide (that illustrates that sometimes the relevant info isn't at the end of words...).

The leaves file has 7,032 lines, to be compared to the 24,839 initial nouns in the example list. Thus, the strategy of looking at word endings gives classification rules that are shorter than the full example list, but not by much. In a way, this result illustrates that rules telling you "words in -tion are feminine" and such will always lead to mistakes, unless you have a large number of them.

To see how bad this is, I tested a strategy which reads words from the beginning instead of from the end, which seems to be a worse idea: it has 20,607 leaves, so reading from the end is definitely a better idea than reading from the beginning. Maybe different rules would be more helpful to classify (maybe using general decision trees without restricting the order of choices by saying "read from the end" or "read from the beginning"), but it doesn't seem that obvious to me.

If you ever learnt this list by heart (for instance using a spaced repetition system), you would know the gender of every French noun (except the ones with hyphens, except the ones missing from Lexique, except the ones in which both genders are possible depending on meaning, and accounting for possible errors in Lexique). I wouldn't recommend it, though, because of those caveats, and also because it still seems too long so there has to be a better way than what I did. If you still wanted to do it, though, it might be more convenient to use this file, in which I replaced the suffixes by one noun that matches this suffix (the one with the highest registered frequency in Lexique). So, if you know this last file by heart, your intuition for gender will be flawless, modulo the caveats and modulo the big assumption that your intuition proceeds by matching the longest suffix of the unknown word with a word that you know.

[Further work: looking at pronunciation instead of spelling (or in addition to it), give weights to the rules and rank them by weight, have a richer rule language (e.g., allow to give a fixed list of exceptions for each rule, which would seriously cut down the impact of pesky words like "cation")...]

Portrait of a hacker

Version française de cet article sur le blog de F.. Translations in other languages much welcome!

In this post, I present the world view, philosophy and thought system of a hypothetic person, John Hacker. Of course, what I am going to say about John Hacker also applies to me to some extent, as well as to a certain type of people which are familiar with computers in a certain way; however, because I did not want to give the impression that all of those people share all of John Hacker's beliefs, and because I am not sure anymore to what extent I myself agree with him, I will ascribe this system to this fictitious persona.

Given Internet access and sufficient time, you can learn any intellectual skill.

John Hacker is aware of the mind-boggling quantity of information available online, and knows how to find what he is looking for. Because he is largely self-educated, he is confident in his intellectual capacities and in the plasticity of his brain. He does not think that missing skills like drawing, math, music, are forever out of his reach; he could learn them if he wanted to.

John Hacker is sometimes too enthusiastic; he underestimates the wealth of information that is not on the Web (e.g. the vast majority of books) and forgets that the Web isn't the best reference about all topics yet. He sometimes fails to remember that learning from a real teacher in the real world can be more efficient: you can ask questions whenever you want, get spontaneous feedback, and influence your memory more because going to a class, as opposed to sitting in front of your computer, means moving to a specific place with specific people in front of a specific teacher. He has to remember that he cannot necessarily become the best at any skill... because a lot of other people are usually trying as hard as him and the Copernican principle means some of them will probably beat him.

Computers and the Internet are an extension of your brain.

John Hacker takes it for granted that he will have access to his computer and to the Internet. He knows that his memory is limited and unreliable. For this reason, he sees little value in memorizing things rather than storing them on a computer and being able to look them up when needed. He stores photos, emails, IRC transcripts as a way to archive memories about events, discussions, relationships. He treats his brain memory as a cache for the things he needs to think about, and serializes the results of his thoughts in writeups so that he can forget them and load them back as needed. He specializes in remembering where information is located, and how to access it, rather than memorizing the information itself. For this reason, John Hacker may seem mentally crippled whenever he has no access to a computer or when Wikipedia is down, but may seem uncannily smart whenever he interacts with someone through the Internet because no one sees him look up words and search for documents and sift through archives.

Use abstractions to tame complexity and work around your mind's shortcomings.

An abstraction is something that you can use without having to understand how it works. John Hacker is intelligent, but he is lazy, and values abstractions because they can be used to forget about unnecessary complexity. He is thus eager to abstract things away, as long as the abstraction is not treacherous, is not too leaky, and can be broken down and unfolded if the need arises. Conversely, John Hacker will be eager to peer into things that try to hide how they work (such as proprietary software or hobbyist-unfriendly electronics). John Hacker has no trouble building hierarchies of abstractions. In the real world, he will spontaneously build models of how complicated things (such as human beings) work, and will be confused whenever his abstractions leak or break down.

A computer is almost all you need.

Many people are stuck with a world view in which the technical means to get your work published, or to manufacture a product, or to have access to arcane knowledge, were only available to a tiny minority. Now, these powers are still restricted to the minory of people that have a computer and Internet access and sufficient free time, but that's a much larger group of people than before. Anyone with a computer has the tools necessary to write and publish an opinion piece that will change the world. Anyone with a computer has the material possibility to design a program that will make them rich and famous and change the life of millions of people. The cost of entry that must be paid to build and host something cool is ridiculously low (counted in thousands of dollars), and the material headstart that a Google engineer would have over a teenager in his parents' basement is not really discriminating. This feeling can make John Hacker very unhappy whenever he is unable to be productive: he could create the next big thing, so he'd rather die trying.

Good ideas aren't hard to come by.

People have this stereotype that the great inventors of the past were acting out of some great Vision that had been Imparted to them, and that the ideas of today that will define the computing of tomorrow are highly confidential trade secrets in underground vaults at Google. John Hacker, however, knows that great ideas and good ideas are hard to tell apart before the fact, and also knows that good ideas aren't a scarce resource: he has lots of them stored in todo files, ideas of things that ought to be built, that would certainly be cool, and that, maybe, could even be successful — who knows? He also has dozens of side projects inspired by such ideas, but most of them are unfinished, and those that are finished would appeal to no one except (maybe) fellow hackers. John Hacker thus understands that true limiting resources aren't the good ideas themselves, but:

Time: sometimes you would like to work on a cool idea but you are kept busy by other things.
Motivation: sometimes you have time to work on a cool idea but you can't manage to do it and just waste your time instead for no apparent reason.
Quality execution: building something that doesn't suck is hard.
Marketing: even when you have built something, you need to polish it so that people will want to use it, and you may even need to go out there and promote it to potential users.
Brand appeal: no matter the intrinsic quality of what you do, a lot of people will take it more seriously if you are a big IT company, so it may be a bit harder to get things to take off on your own.
Luck: lots of technologies were worse than their competitors when they came out, were neither marketed in a clever way nor backed by the right people, but just stuck for entirely random reasons.

On the Internet, what matters is what you do, not who or what you are.

Though most people on the Internet go by their real name, John Hacker knows that it is possible to have a pseudonymous identity or multiple identities. If you do something great, its quality can be appreciated without anyone knowing who you are. Your sex, age, race, nationality, are not relevant. It doesn't even matter if you're a robot. On the Internet, nobody knows you're a dog: that's a feature, not a bug. John Hacker knows the extent to which interpersonal relations in the real world are affected by the physical traits of the people who are interacting; he is more secure with pure disembodied information exchange.

Failure and success should preferably have an unambiguous definition.

When you work with a computer, you will often use it as a neutral and unambiguous judge of whether you achieved a certain goal or not. It is true that some goals, like "writing a bug-free program", cannot be checked by a computer, but others, like "getting a program to compile" or "getting a unit test to pass", are. For this reason, John Hacker often relies on having a clear definition of success and failure when he tries to do something. He will therefore prefer creative activities that require him to follow (mostly) unambiguous constraints (e.g. writing verse, or writing lipograms) rather than those that rely on a subjective or social feeling of beauty or quality.

It is not intrinsically bad to use things in unexpected ways.

Most things that you buy are marketed as having a certain purpose. Most people will not think of using them for some other purpose, or will assume it's a bad idea to try to do so. However, John Hacker is familiar with agnostic tools such as programming languages that can be bent to do things that were never envisioned by their designers. For this reason, when he tries to do something in the real world, he will see objects for what they are rather than for what they were designed to do, so he may take them apart (and void the warranty) or maybe just find new creative ways to use them. When he does so, he follows his own judgement, without the safety net of the designer's invisible hand. Sometimes, of course, he is wrong, and he will break things or his contraptions will miserably fail. Still, John Hacker thinks that it's always better to think out of the box in this way, and believes that submitting exclusively to a designer's impression of how their object should be used is nothing short of intellectual slavery.

Decentralized systems are more robust than centralized systems.

John Hacker designs computer systems, so he can identify possible points of failure — the minimal set of things that can bring the whole system down if they break. He knows a centralized system has one single point of failure (its center), whereas a decentralized system is more robust because it achieves better redundancy. John Hacker is aware of the history of technology and knows that centralized technologies (that depend on a single manufacturer) disappear as time passes and companies go bankrupt and the markets change, but that decentralized technologies (the Web, Internet, email) tend to stick around as long as there are sufficiently many people that use them, because there is no single entity that has the power to kill them and make everyone switch to something else. Of course, decentralization also implies disadvantages (like increased complexity, or trust management issues), but decentralized systems that managed to get moderately successful are usually there to stay. For this reason, John Hacker is suspicious towards organizations that have only one leader, towards political structures whose central offices have the power to bring the whole system down through incompetence or (more rarely) malice, and towards anything that depends too much on a small designated group of people or institutions. He is worried when too many things depend on single companies that get too big to fail, and would insist that crucial organizations such as the State should be as decentralized as possible.

Distinguish between the formal and the informal.

John Hacker spends his time formalizing processes, that is, explaining them in a language so basic and unambiguous that a mechanical machine will be able to perform them. For this reason, John Hacker is very good at spotting informality — those parts of a process that appeal to human intuition and judgement and that will be impossible to automatize. For instance, sorting forms in drawers depending on the first letter of their last name is a formal process, but whenever a secretary decides that someone inverted their first and last name and sorts one file according to the first name, then the process suddenly turns into an informal one. John Hacker believes that appeals to human intuition are best avoided, or should at least be clearly labeled so that you're aware whenever you need to use this black box.

Repetitive processes should be automatized.

Computers are machines that can perform repetitive tasks if you formalize them. For this reason, when you use a computer, the difficulty of accomplishing something is usually the difficulty of describing it rather than performing it. For instance, when asking a computer to count from 0 to 999, the difficulty is to describe what "to count" means, not the fact that 999 is a large number. In the real world, where repetitive things are often much harder to automatize, John Hacker will sometimes be frustrated and try to automatize them anyway. When he builds a machine for, say, fold T-shirts automatically, it's because of the conviction that a repetitive task like T-shirt folding should be automatized for once so that you can then fold as many T-shirts as you want with no effort — even though, in the real world, it is much harder to make a machine fold T-shirts in a reproducible fashion, and the long-term benefits of such a machine are not worth the investment.

Society would work better if people were technologically competent.

Bureaucracies usually work with paper, and many secretaries spend tremendous amounts of time doing repetitive things that could be automatized. If people had sufficient computer skills to automatize the bulk of their work and only leave to a human those parts which require intelligence and good judgement, then organizations could be made much more efficient while employing less people. John Hacker feels physical pain for people that spend their working day doing things that he could automatize in one week so that they could be done by a computer in a split second. Of course, he knows that you cannot just train everyone to be a computer expert, but he would say that society is still lacking in basic technological literacy.

John Hacker is annoyed whenever his government's administration reminds him (usually by requiring him to perform paperwork on actual paper, or talk directly to humans through a phone, or move physically in meatspace) that they do not understand computers and that it will take decades for them to get it.

Information should be preserved.

In the real world, you cannot keep everything, because it would take up too much room. Old buildings must be destroyed to make way for new ones: though you would want to preserve some important ones, it is clear that a compromise has to be found and that you cannot preserve everything. With computers, though, you can store information at absurdly high and continuously increasing densities. For this reason, the default strategy is to keep, and whenever you run out of space, you only delete the very largest things ... or buy a new hard drive. John Hacker thinks it a crime to deliberately remove potentially useful information from a computer, or to let valuable information decay. He experiences the same awe towards full hard disk drives or datacenters that normal people do in front of full bookshelves. There are no logs of what happens to him in meatspace, so he compensates by logging everything he does in cyberspace, and preserving and redundantly archiving his valuable records, in a way that may seem futile to outside observers. As for real life possessions, John Hacker will sometimes be a packrat and keep everything because of the memories attached to objects, or, alternatively, throw them away whenever the need arises... after having taken pictures of them so that the information survives somewhere.

Information should be free.

John Hacker knows that, with access to the Internet and with cryptography, you can communicate in a secure manner. Hence, any attempts to censor information will seem misguided to him, because he knows that such measures can always be circumvented in an demonstrably safe way. To John Hacker, information is neutral, and sharing information should never be a crime; the law should punish action, not communication. Sharing data may infringe on someone's copyright, but enforcing copyright is certainly less important than having the freedom to share. Data may have been private, but once it's been leaked, there's no point in trying to put the genie back in the bottle. Data may be obscene, but no one forces you to look at it, and it is no one's business to impose their moral standards on other people. Data may be wrong, or subtly misleading, but the solution is not to censor it: let people read it and make their own opinion. Of course, all of this only applies to data thas has been made public (or that should be public, like most State records); John Hacker would never give anyone access to the his private data, or store it unencrypted on untrusted third-party servers.

Information should be dematerialized.

Meatspace has a lot of annoying defects, and cyberspace has a lot of pleasant properties, so information should belong to cyberspace, not meatspace. Of course, in the end, information is always stored on physical media such as hard drives, but John Hacker abstracts this away (in much the same way as you abstract away the working of your organs when you think about your body). In particular, he sees single-purpose sub-optimal physical media as a heresy. He dislikes books (replace them by ebooks), CDs and DVDs (it's stupid to load and unload the media by hand, just digitize them and store them all on an hard drive or on a remote machine or something), paper mail and postcards (emails are a much more efficient way to move information around), paper forms and certificates and authentic documents (you should send digital versions with a cryptographic signature instead), etc.

Human interaction is limited to information transfer through language.

Computers communicate to transmit information on well-defined channels following specific languages carefully engineered to suit this purpose. John Hacker expects human interaction to serve the same goals and work according to the same rules. He dislikes side-channel non-verbal information transfer such as tone or body language. He might insist on saying only things which are literally true: this means that he will not lie and will expect people not to lie, but that he will indulge in cheap fun by saying things that are logically correct but useless or intuitively misleading (e.g. "Do you want tea or coffee?" "Yes."). He might dislike communication such as small talk whose aim is not in the information transferred but in the act of communication itself. He may insist on consistency in language use, may take an extremely prescriptivist approach to language and grammar, and will not hesitate to use language constructions that are grammatically correct but hard to parse for the human brain (long sentences, nested propositions, etc.). He will insist on proper use of quotation marks ("bananas" is a funny word, bananas are delicious fruits), and will be tempted to model human communication as formal languages that are simpler to parse for a computer.

Of course, John Hacker's model of people as fully autonomous systems that exchange information isn't just inaccurate because of his simplistic picture of communication itself. First, it does not account for the basic fact that the vast majority of people cannot survive in isolation and just need some form of human contact — maybe this even applies to him and he hasn't realized yet. Second, it neglects other important aspects of human interpersonal relationships, such as helping others out, influencing others, caring for others, letting oneself be influenced by others, sometimes unhealthily so — and, of course, loving and being loved. John Hacker is usually puzzled and uncomfortable about all this complicated mess; for these reasons, he may sometimes seem blunt, asocial, careless, or cold towards people, or avoid human interaction.

Rationality is the only acceptable framework.

John Hacker thinks that in the real world, like in cyberspace, events have causes and obey general laws. He is aware of the uncanny complexity and chaos-like behavior that can arise from the interaction of extremely rich rules and from the use of randomness, but knows that, in principle, everything could be explained. He rejects the paranormal. He is not superstitious or religious. He rejects non-falsifiable theories, because they do not make predictions that are practically useful. He avoids discussing or thinking about the unknowable, such as the existence of God or the nature of death, because he knows that no definite answer can be given about them and that they have no influence on his life so they are a waste of time. He sometimes speculates about the future or about philosophical questions, but that's mostly because of the influence of science-fiction and he doesn't take this too seriously. He tries to be consistent and to steer clear of contradictions. He tries never to indulge in wishful thinking. He tries to remain critical towards his own beliefs, and to stay in control of the influence of others on them. When person X asserts fact Y, he remembers that the implied fact is not "Y" but "X asserts Y", no matter how much he trusts X or would like to believe Y.

Reality is imperfect, cyberspace is more important.

John Hacker is used to working in cyberspace, so whenever he operates in meatspace he sees what is missing. Why isn't there an undo or a save/restore function? Why isn't there an archive of what happened or what was said? Why do you need to move to go somewhere? Why doesn't this dead tree book have a search function? In some cases, because he is used to building his own virtual universe that behaves like he wants it to, John Hacker will try to adapt reality to his liking. In other cases, he will just give up, neglect his body, clothing, and dwelling, and remain mostly indifferent towards money or physical belongings. His focus on productivity in cyberspace and his disdain for low-level happiness in meatspace could almost be confused with religious moral standards by normal people.

Your brain is a primitive version of a computer.

John Hacker sees his brain as a device that carries a lot of evolution-mandated low level support for some specific tasks (e.g. distinguishing faces) but has extremely primitive support for abstract tasks (e.g. checking if the string of parentheses "(((())()())()))" is correctly balanced, or computing 269 times 42, or reasoning about geometry in higher dimensions), only carries limited support for introspection (most thought processes, like deciding to remember or forget something, or to focus on something, cannot really be controlled consciously) and suffers from documented bugs. He knows the brain was designed through evolution and only achieved a primitive form of higher-level intelligence as a byproduct of trying to satisfy the evolutionary drive of getting genes to reproduce. John Hacker hopes that the human race will eventually achieve true intelligence from this bootstrap (by improving the interface between brain and cyberspace, or maybe hacking the brain itself) and transcend the evolutionary drive (going from genetics to memetics). He believes that his conscious goals (such as the search for beauty, truth, or interestingness) are an attempt to reach for this notion of "true intelligence".

John Hacker feels that this "true intelligence" is a universal property of nature. He believes that if we met intelligent aliens, they should be studying essentially the same mathematics as us: that notions such as integers or prime numbers are universally fundamental, not that they were "invented" by humans or are only interesting to human minds. This belief is grounded in John Hacker's day-to-day interaction with non-human intelligence, namely computers: though they were designed by humans, they seem to reason about abstract things like we do (or would like to do). This belief is also grounded in John Hacker's experience of having his mind hijacked ("nerdsniped") by problems that are natural, beautiful and deep — John Hacker cannot believe that his profound fascination for such problems could have anything to do with his human nature.

This vision can make John Hacker forget about his human nature, because he does not see it as a crucial part of himself. John Hacker sees himself as a proto-intelligent being first and foremost; belonging to the human species is a distant second, and being a man or woman an even more distant third. This can lead him to wishful thinking, namely, believing that his human nature has no hold on his thought processes because he would like things to be so. He will do his best to try to get closer to this ideal state through self-control, but he sometimes fails to realize that it doesn't work completely. Whenever he forgets about primitive urges such as the need for food, sleep, physical exercise, human contact and love, he discovers that those unsatisfied urges can be an obstacle to the proper function of his intelligence.

The soul is information.

Because he is so used to working with information, John Hacker considers that the soul of a person is the information contained in their brain and that death is the destruction of this information. He considers that if you could transplant someone's brain in a different (possibly artificial) body then it would remain the same person; that a brain in a vat would be a full person; that the seat of consciousness and personhood is the brain and that the rest of the body is simply a supporting device for the brain (and that there is no immaterial "soul" or token of a person's existence); that cryonics is probably a good idea on paper, flawed though its current implementations may be; that if you could retrieve the information contained in the brain and store it in a computer then the person would still be alive. He is conscious of the numerous paradoxes left unanswered by such views, but can see no better position. He does not think that mankind has an ethical duty to stay true to its biological nature instead of trying to improve it; quite the contrary. He considers the technological singularity as interesting speculation, though he does not hope to see something of this kind occur within his lifetime.

Related reading: the Jargon File has an appendix called "A Portrait of J. Random Hacker". It is more about how (the file's notion of) hackers live, rather than what they think. Thanks to Pablo Rauzy for pointing it out to me.

A fundamental problem with OpenID

In this post I describe what I think is a fundamental problem with OpenID, and how I think a decentralized authentication scheme should work to avoid this problem. I assume that you are familiar with OpenID, DNS, asymmetric crypto and X.509 client certificates. Just a refresher about the terminology: with OpenID, you log in at a relying party by specifying a URL, the relying party queries this URL to find out who is the provider, and the provider takes care of identification.

This post is not about the practical problems of getting people to understand and adopt OpenID. I do not intend to complain about the many websites where you can only log in with a closed list of proprietary providers and not with the OpenID URL of your choice even though it probably wouldn't be harder to support. I do not intend to complain about the fact that people don't get OpenID, or discuss whether the cause is OpenID's complexity or the general public's cluelessness. My complaint is about the core design of OpenID. It is the following:

OpenID uses URLs to identify people, and URLs rely on the DNS system.

Why is that bad?

DNS is centralized: If you are your OpenID provider, you need to get a domain name at a DNS registrar. This process costs money, and depends on the DNS system, which is centralized.
URLs can change: Do you want to change your OpenID URL? Well, do it at every relying party if they have an option for that. Want to do it globally? Well, you can't. So, you might get stuck paying for that old domain name forever after all.
Domain names are reassigned: If you accidentally let your domain name expire, then you are locked out of your accounts at relying parties. Worse, anyone can buy it. At which point, they get full control of your OpenID URL, and you have no way to stop them. Woops.
DNS is insecure: Securing DNS is an afterthought, so it's a bad idea to assume it's secure unless there's no better way.

It is true that the last two points can be mitigated if you instruct the relying party to use HTTPS (by indicating "https://" explicitly in your OpenID URL), and if you have an HTTPS certificate that relying parties will trust, and if an attacker doesn't (obviously, if a relying party accepts your server's self-signed HTTPS certificate, then it would gladly accept an attacker's).

To summarize: OpenID is decentralized but relies on a system (DNS) which is not very secure, centralized, and in which being your own provider requires you to commit to a domain name which you basically have to renew and pay for indefinitely. Hmm. Can we do better?

The fundamental principle of OpenID is that you provide the identifier of a resource (the URL) and prove that you control this resource. You actually exercise your control over the resource by having it point to an OpenID provider whom you trust to identify you in a suitable way, but the basic idea remains: to log in, you name something that you control.

It turns out that there are other resources that you can name and prove that you control, and that don't need to be registered in a centralized system like DNS. I'm thinking about public keys. You could provide your public key (or its fingerprint under a secure hash function) to relying parties, which would associate you to this public key (a URI of sorts) rather than associating you to a URL. You would then demonstrate ownership of the private part of this key when logging in. Problem solved. This is how ssh public-key authentication works, but this has also been applied to the web: X.509 client certificates work precisely like this. Sadly, no one uses them, because people do not want to generate keys and configure their browsers and such.

All hope is not lost, though. What about the following scheme:

You give a URL to the relying party.
This URL points to a document indicating both a public key and an OpenID provider which controls the associated private key.
The provider demonstrates to the relying party that it owns the private key associated to the public key.
You authenticate with the provider like in vanilla OpenID.
The relying party does not associate your account with the URL you provided, but with the public key.

This indirection would change nothing for casual users who would just register with any OpenID provider and log in with their OpenID URL without caring at all about "their" key which would be entirely managed by their provider. The behavior of relying parties, and the interaction between relying parties and providers, would be different, but none of this would be visible to the user.

However, power users would own their key without being tied to a specific URL, and could configure an endpoint with this key at any domain of their choice. Benefits:

If you want to change your domain name, just put the key at the new domain and identify with a URL at the new domain. The URL is just a pointer, the underlying key is still the same, so you're still the same person to the relying party even though the URL changed.
If your domain name expires, no big deal, someone who buys it will not get your key, and you can just host the key elsewhere.
If you want to stop being your own provider, just entrust your private key to an existing provider and you can start logging in with a URL pointing to this provider.
If you are paranoid and do not trust DNS, it would be easy to extend the protocol a bit so that, when you give your URL, you can also optionally specify your public key to the relying party. In this case, the relying party would be required to check that the public key used by the provider matches the one you specified at login.
If you don't want to have anything to do with DNS, there's nothing stopping you from using the IP of your server as URL. Hell, if you're not browsing from behind a firewall and your machine can accept incoming HTTP connections, you can have your key on your machine, and just give the IP of your machine -- you don't even need a server.

Hopefully I managed to convince you that identifying users (as opposed to locating their provider) using URLs is a bad idea. Of course, this point is not specifically targeted against OpenID. Mozilla Persona (aka. BrowserID) is using email addresses as identities, which also depend on DNS. I'm still looking for a decentralized authentication scheme which understands that you should just use URLs as pointers and use public keys as identifiers.

Addenum: It turns out that OpenID 2.0 is supposed to support XRIs which (according to the last paragraph of this section) can be used to mitigate the problems I'm talking about. However, after spending some time trying to understand what XRIs are and if someone is using them, I'm not convinced that it is really an elegant and practical way to solve this problem, so I think the point still stands.

Shortcomings of the real world

Here is a list of fundamental differences between reality and idealized models of the world. It can provide guidelines when designing virtual worlds, or serve as a checklist when trying to reason about the real world:

Irreversibility.: Some things are more easily done than undone (building something versus destroying it, cleaning something up rather than making it dirty, etc.), and some cannot be undone at all (killing people, losing information, wasting resources, etc.). This means that a small number of wrongdoers can have a disproportionate impact because undoing their mess take so much time, and this implies that preventive measures are needed to limit the occurrence of irreversible bad things. This is in contrast to virtual places like Wikipedia where reverting edits isn't substantially harder than making them, and where you can benefit from the fact that vandals are a small minority.
Low dimensionality.: The world has a small number of spatial dimensions: only two are really usable, the third one is harder to use because of gravity. Because of this, the possibility of interaction is limited: you cannot have a high number of things acceptably close to each other. This holds both for groups of people (large groups of people cannot interact meaningfully in real life, which is an obstacle to large-scale collaboration) and for cities (to have everything close to everything, you need absurdly high density).
Imperfect coordination.: Even with arbitrarily good communication technology, large groups of people are harder to coordinate than small groups, because of cognitive limits. For this reason, whenever two groups have contrary interests and must hold one against the other, the larger group will be disadvantaged and have much higher risk of defection. This is a factor explaining why the masses have a hard time coordinating, even though they are numerous by definition.
Non-autonomy of children.: While the harm principle dictates that consenting adults in isolation can be simplified out of the moral equation, this does not work with children: adults in isolation can have children, and those children will not be able to legally consent to everything their parents might do to them. For this reason, society has to keep an eye on how parents raise their children, and find some compromise between the parents' rights and the child's.
Necessary infrastructure.: Long-distance communication is not a given but depends on artificial infrastructure which is not free, can fail or can be controlled by malicious parties. You cannot assume that everyone has access to the Internet in the same way that anyone has access to air.
Unbounded vital needs.: If the vital needs of people could be bounded, there would be some hope of managing to satisfy the needs of everyone and assuming that the survival of every human being is ensured. Sadly, people can have arbitrarily complex health problems and could need arbitrarily involved and expensive treatment. This can be dealt with through an insurance system, but complicates things because some people who have simple needs will want to opt out of such a system, making it unsustainable.
Critical mind.: To achieve the independent thought and critical spirit required to be a free, autonomous agent, education is required. People who are not given this education cannot be considered as individuals and it might not make sense to consider that they are responsible for their actions. Yet, they need to be dealt with in some way or other.
Inheritance.: Assume that money represents some measure of social utility, and that people who earned money should be allowed to use it as they like. In this setting, it is a major problem that most people will want to give their money to their offspring, because the money that the offspring will thus inherit is not linked to their social value. The problem is that the individual interests of the donor ("benefit my offspring") are at odds with the interests of society ("allocate money to people who produce value"). There are no solutions except restricting the freedom of people to use their money or increasing inequality at birth because of the parents' wealth.
Physical encounters.: It is not possible to assume that people live autonomously in isolation from each other and only communicate by exchanging of information. People desire friendships, close relationships, and physical relationships. For this reason, they have to meet in real life.
Public-private continuum.: You cannot divide the world in public places and private places and say that there should be no expectation of privacy in public places, because you need to go through public places to travel from one private place to another. Besides, private conversations will often take place in public space with some expectation of privacy between the speakers. I tried to think more about this point.
Repetitive work.: In reality, repetitive tasks have to be carried out. If you want to do the same things multiple times, you will have to do so, and it will usually be complicated to build a robot to perform the task for you. This is in contrast to the virtual universe where things are usually much easier to formalize and automatize, and where the effort required by a task is much closer to its Kolmogorov complexity.
No records.: Even if there is no expectation of privacy somewhere, there is usually no complete perpetual record of what took place there. Hence, there cannot always be an objective assessment of the truth of factual statements involving public data. Note that the problem is not that records are not reliable and can be tampered with, but the fact that they are not complete or numerous enough: the higher the number of independent records, the harder it gets to engineer consistent fabrications. This is in contrast to virtual space where there is usually abundant evidence available because recording something is often easier than not recording it.

[I just wrote this list quickly to dump some ideas I had in the back of my head, it might not make much sense.]

Recording all your terminal sessions

I love to log as much information as I can about what I do on my computer. (Of course, I never send those logs to third-party services.) I log all of my keystrokes, I religiously keep all of my command history, all of my email and IRC logs, and so on.

However, something that I didn't log so far is what appears in my terminals. This was a shame: since terminals display text, you would expect that you could log everything which appears on them without using up so much space after all. Logging this information could be useful to reconstitute what you were doing at a particular point in time, to understand how you ended up making a certain mistake or doing a certain thing, to show to someone how to do something, to recover the output of any particular command of your history, and so on.

There is a tool called ttyrec which can be used to log what happens in your terminals (including timing information), but I didn't use it systematically so far because of one simple issue: if you run cat large_file, then ttyrec will happily put all the content of large_file in its log, even though you probably didn't care about it. Just a few accidents like this and your log files can become huge.

The point of this post is to advertise ttyrex, a slight modification of ttyrec which adds an option to cap the quantity of data logged every second. The point of this is that when doing cat large_file, you can just log a small quantity of the file every second and skip the rest, and you will get a reasonable approximation of what you saw on the terminal without using up too much space.

I have been starting ttyrex systematically with urxvt for some time now, compressing logs that are older than two weeks (this saves a tremendous amount of space), and the last two week's worth of logs use up a quantity of disk space which I think is reasonable by today's standards (less that 1 GB). I have also tweaked zsh to store the start time and stop time of the recorded sessions, the start time of ongoing sessions, and the command history of each session: I have then written a command to replay what happened at any point in time (i.e., open one replay terminal for each terminal that was open at some timestamp, and jump at the correct position in each of the replays), and a command to take a line of the command history and open a replay of the right session at the right time to see when the command was entered and which results it gave.

I haven't found any use for all of this yet except playing around, but it's pretty fun (having terminals which replay what I did in the past feels a lot like time travel).