Learning the gender of French nouns
The gender of French nouns is a pain for foreigners and even occasionally for native speakers. Learners of French usually rely (besides rote learning) on rules that classify words as masculine or feminine depending on their ending. In this post, I present what happens if you try to derive the minimal set of rules to determine the gender of a French noun from its ending. (In brief: it doesn't give a very compact set of rules because there are too many exceptions.)
The problem that we will study is: given a French noun, determine its gender. Let us start by taking the database from Lexique and keep the words that:
- are not "derived" forms (e.g., plurals);
- are nouns;
- are either masculine or feminine (there's not much I can say about nouns that can be both, except that most of them are words like "journaliste" or "enfant" that are used to refer to people so the gender to choose is usually clear depending on the person you're referring to);
- do not contain spaces or hyphens (because the gender of such words is usually determined from the component words, so a strategy that looks at their ending will not work well)
- do not contain dots (remove pesky abbreviations)
Here is the code I use (see lexique.org to obtain lexique, and note that I use a custom version with some errors fixed by hand so your result may differ slightly).
cut -f1,4,5,14 lexique | grep '1$' | cut -f1,2,3 | grep NOM | grep '[mf]$' | cut -f1,3 | grep -v ' ' | grep -v -- '[-\.]' > nouns.txt
Now, we need to find rules to predict the label ('m' or 'f') of nouns in the input list, in a manner that is as concise as possible. To do so, we said that we would try to determine the label by reading the noun starting from its ending. I will describe what we want to do with an example. Suppose we get an unknown noun and start reading it. The last letter is 'e'. At this point we don't know the gender, so we continue. The two last letters are 've'. We must continue still. The three last letters are 'uve'. This narrows down the set of possible nouns, but it can still be either masculine or feminine (think "fauve" vs "guimauve"). The last four letters are "luve". At this point, we know that all nouns ending in "luve" are masculine (there is only "effluve"), so we answer 'm'.
According to this example, we want a set of rules that says, for every possible suffix read so far, whether we can decide 'm' or 'f' or must continue reading. The set of rules should be minimal, which means that it should decide 'm' or 'f' as soon as possible (i.e., as soon as all nouns ending with this suffix have the same gender). Such classification strategies look a lot like deterministic finite automata, except they are acyclic. A more standard term is trie. With such strategies, you can determine the gender of all nouns of the list, and (hopefully) do a reasonable job for unknown nouns by answering given on the longest common prefix.
Now, it turns out I already wrote some code to generate tries from examples, for my project about determining if an initial 'h' in a French word is aspirated or not. Let us reuse that.
So, let us reverse those nouns, and pass them to programs from the haspirater suite to compile the trie and obtain the leaves of the trie, namely, the suffixes at which a decision is taken (and sort them nicely).
rev nouns.txt | buildtrie.py | compresstrie.py | leavestrie.py -1 | rev | LC_ALL=C sort -k1,1 | rev > leaves.txt
Following our previous example, observe that the leaves.txt file contains a line for "luve" (line 4,417). This means that "luve" is a suffix at which we decide 'm', but all shorter suffixes ("uve", "ve", "e", "") were still ambiguous. An initial space in a word indicates "beginning of word" (when we read "rive" we don't know yet between, say, "dérive" and "drive", but if the full word is "rive" then we should decide 'f'). To determine the gender of a noun using this list, look at the line containing the longest suffix of the noun, and the first field of the line should be its gender. Note that the longest leaves in this file are "patriarche" and "matriarche", for which reading "atriarche" is still insufficient to decide (that illustrates that sometimes the relevant info isn't at the end of words...).
The leaves file has 7,032 lines, to be compared to the 24,839 initial nouns in the example list. Thus, the strategy of looking at word endings gives classification rules that are shorter than the full example list, but not by much. In a way, this result illustrates that rules telling you "words in -tion are feminine" and such will always lead to mistakes, unless you have a large number of them.
To see how bad this is, I tested a strategy which reads words from the beginning instead of from the end, which seems to be a worse idea: it has 20,607 leaves, so reading from the end is definitely a better idea than reading from the beginning. Maybe different rules would be more helpful to classify (maybe using general decision trees without restricting the order of choices by saying "read from the end" or "read from the beginning"), but it doesn't seem that obvious to me.
If you ever learnt this list by heart (for instance using a spaced repetition system), you would know the gender of every French noun (except the ones with hyphens, except the ones missing from Lexique, except the ones in which both genders are possible depending on meaning, and accounting for possible errors in Lexique). I wouldn't recommend it, though, because of those caveats, and also because it still seems too long so there has to be a better way than what I did. If you still wanted to do it, though, it might be more convenient to use this file, in which I replaced the suffixes by one noun that matches this suffix (the one with the highest registered frequency in Lexique). So, if you know this last file by heart, your intuition for gender will be flawless, modulo the caveats and modulo the big assumption that your intuition proceeds by matching the longest suffix of the unknown word with a word that you know.
[Further work: looking at pronunciation instead of spelling (or in addition to it), give weights to the rules and rank them by weight, have a richer rule language (e.g., allow to give a fixed list of exceptions for each rule, which would seriously cut down the impact of pesky words like "cation")...]