Antoine Amarilli's blog

The strangest and least strange French words according to n-grams

— updated

English version

The (multi)set of (character-level) n-grams of a word consists of its sequences of n consecutive characters. For instance, the 2-grams of "gram" are "gr", "ra", and "am". Duplicates are counted, e.g., for "toto" the 2-grams are "to", "ot", "to". Given the dictionary of all words in a language, we can compute the multiset of all n-grams.

It turns out that this multiset is quite characteristic of the language. For instance, to identify the language of a piece of text, it is often enough to compute its n-grams, normalize it as a frequency distribution, and compare it to the distribution of known languages: usually the closest distribution is that of the language in which the text is written.

Here I explore a related idea: compute the n-grams of a French dictionary obtained from Lexique, and then take each word of the dictionary and compute the average, over its number of n-grams, of the number of times each n-gram was seen in the dictionary. The highest this number, the most common the n-grams of the word.

And indeed, it turns out this approach is quite good at identifying words that look strange or that look common. Let us first prepare a dictionary of French words that are not derived forms, removing those containing strange characters or consisting only of consonants:

cut -f1,14 lexique |
  sed 1d |
  grep '1$' |
  cut -f 1 |
  uniq |
  grep -v "['. -]" |
  grep -v '^[bcdfghjklmnpqrstvwxzç]*$' > words.txt

Now run the script ubac.py (originally written by Armavica). Here are the 100 weirdest words, for n = 1, 2, 3, 4, sorted by ascending score (following the definition above) and breaking ties by alphabetical order:

à       fy      acmé     abbé
çà      où      acné     abcès
ô       éwé     adp      abyme
jà      jà      bey      acmé
y       çà      cañon    acné
fy      dû      dey      afghan
dû      aa      deçà     ailé
by      là      doña     aimé
gy      sûr     dyke     aixois
déjà    mûr     déçu     ajax
khôl    eh      fez      alcôve
âgé     yéyé    fox      alezan
jazzy   fût     foëne    alizé
yéyé    khôl    guzla    allô
gym     oh      hâve     almée
fâché   tôt     ibm      aloès
jazz    jèze    khôl     amibe
bâché   uht     kiwi     amok
éwé     by      kohl     ankh
évêché  gy      lek      apax
buggy   pèze    new      apex
husky   ok      ovni     area
pêché   rhô     oïl      arkose
fêlé    rôt     più      arum
dès     dès     reçu     arôme
bombyx  kohl    rhô      asdic
là      kiwi    señor    aulx
hobby   tweed   tek      awacs
hé      ça      uht      axel
hâlé    oïl     wax      azulejo
bobby   dyke    wigwam   azéri
zébu    kawa    yue      aède
ès      août    yéyé     aéré
haïk    têt     zeb      aïkido
jubé    ako     zob      aïoli
chômé   aïe     âgé      baht
hugh    vu      éwé      baie
junky   wax     zemstvo  banjo
bé      web     muon     baou
bébé    ghât    noël     barzoï
jugé    ès      soja     basé
puff    lès     yeah     beagle
déçu    yakuza  foehn    berk
guppy   oxo     vodka    besef
bédé    zozo    abc      beuh
whisky  zizi    bof      bezef
époxy   mât     bye      bief
puy     jazzy   cèpe     binz
psy     kazakh  dao      birbe
fox     oka     djemââ   bizou
bouzy   ski     dès      bled
ghât    bât     faf      blob
ptyx    axe     grâce    bobsleigh
qu      aïd     jeûne    bock
funky   rêvé    kawa     bodega
body    ska     kif      boer
dé      gym     kot      boghei
zob     skaï    lez      bohême
lâché   kiki    nok      bolge
dévoyé  skunks  nèpe     boskoop
job     zeb     piu      boxon
bluff   aï      puzzle   bozo
pédégé  lynx    sikh     bref
câblé   ah      zicmu    brize
flux    zoo     zip      brol
wax     bye     âge      brook
fût     jazz    île      bufo
psyché  kaki    nikkei   buggy
bug     je      bézef    bunraku
zouk    ajax    taïga    buée
hypo    yak     trèpe    bâté
box     bézef   dîme     bébé
vu      yoyo    naevus   bédé
zéphyr  ka      ptyx     béer
zélé    kayak   râpé     bénef
pédé    jojo    york     bésef
décès   zef     abîme    bévue
boy     jaja    bezef    bézef
mêlé    jeep    shôgun   bôme
haïku   hugh    fax      cajun
yak     junky   geez     canyon
djemââ  ptyx    glèbe    casoar
dégât   buggy   gêné     catgut
goy     funky   ils      cavum
sulky   aya     lao      caïd
bât     âgé     laïus    caïman
haïkaï  ya      lès      cañon
jusqu   excès   moho     ceci
phylum  lys     nez      cheap
péché   oye     ouïe     chez
off     râpé    ouïr     ciao
jèze    foehn   paf      ciré
hadj    psy     puy      city
kazakh  rugby   puîné    clac
pépé    zébu    rez      clef
décédé  bezef   saké     coati
bégum   skip    soft     cobra
huppé   puy     sûr      cohue
baby    yeah    web      coir
humbug  bé      yin      coke

And the 100 least strange, by decreasing score:

e            ter           ion               mention
erre         enter         que               ration
tee          mer           mention           ossement
errer        on            entement          salement
ire          renter        ente              vilement
terre        tenter        cent              râlement
reine        inter         menteur           pâlement
rire         en            lentement         bêlement
are          intenter      vent              cation
terrer       tinter        dent              gisement
renier       entier        lent              nation
enterrer     conter        gent              entement
terrier      contenter     mentalement       mentalement
retirer      rentier       enter             cillement
te           entente       mentir            rationnement
et           ente          entente           virement
rare         enterrer      mental            vitement
raie         entrer        mentez            parement
aire         errer         menton            finement
ente         rentrer       ossement          ornement
ne           rente         rationnement      durement
en           lent          cillement         purement
raire        tente         entablement       bêtement
terne        terrer        mentionner        vêtement
rente        content       mentor            rarement
entre        anti          virement          âprement
enter        ponter        dément            âcrement
retraire     linter        chiquement        sûrement
retenter     lente         sagement          mûrement
reinette     monter        enterrement       vivement
renne        inti          ciment            uniquement
tertre       mener         piment            sagement
ter          sentier       connement         tique
retreinte    venter        gisement          lavement
renter       ante          finement          tellement
entrer       canter        uniquement        chiquement
inerte       retenter      entassement       pavement
entier       intention     salement          bellement
area         menterie      menthe            rudement
enserrer     entement      jument            fadement
rentrer      contrer       ration            logement
serre        rater         moment            jugement
terrine      tiser         identiquement     armement
terrien      contentement  durement          gaiement
retenir      contention    purement          paiement
rentier      tirer         agent             tapement
inertie      tintin        parement          mêmement
tire         lentement     dûment            sapement
rite         tentant       vilement          lapement
relire       mention       vitement          gréement
relier       dentier       sentiment         payement
lierre       miserere      tellement         fixement
serrer       hanter        entichement       cassement
erreur       seriner       menterie          passement
rien         interner      gentiment         bassement
rein         enterrement   cation            tassement
nier         senti         entêtement        mental
trier        conte         pratiquement      mentir
tirer        serrer        bellement         connement
terri        tintement     stationnement     nettement
araire       continent     nation            bonnement
retraite     pinter        contentement      mollement
entente      rentrant      cassement         follement
ternaire     entrant       râlement          oralement
irriter      sente         pâlement          avalement
tertiaire    remonter      bêlement          également
entretien    vanter        passement         nullement
entretenir   entre         entendement       stationnement
ore          attenter      bassement         étalement
oie          intimer       logement          noblement
terrestre    centrer       éventrement       mentez
tette        trente        tassement         roulement
retenue      orienter      actionnement      stablement
tare         montrer       rarement          logiquement
rate         reconter      amendement        utilement
taie         alerter       entièrement       seulement
enterreur    entourer      serment           isolement
rater        der           sentimentalement  feulement
tirette      rencontrer    mollement         amplement
tiare        renier        âprement          diablement
taire        ratier        mens              parlement
verrerie     menton        bonnement         vrillement
tente        patienter     lamentablement    ululement
traire       contenant     ornement          amusement
ratier       cantiner      logiquement       dément
trente       raconter      follement         raclement
tenter       fermenter     entre             versement
teinte       cranter       alitement         drôlement
trentenaire  merisier      tente             ciment
teinter      tin           âcrement          règlement
retentir     serin         rente             frôlement
narrer       teinter       vivement          hurlement
aine         terrier       fadement          piment
arien        errant        tentation         giclement
rainer       lier          comiquement       actionnement
recette      rentamer      mûrement          bâillement
tienne       ventiler      sûrement          menton
interner     riser         lavement          baisement
interne      te            tintement         tintement
resserrer    ralenti       nullement         mentor

Note how for 1-grams the least weird word is "e" consisting of the single most common letter in French; by contrast, for 4-grams, common words are those which contain common 4-letter sequences.

You can download the complete results with the scores, for n = 1 2 3 4. In case you wonder about why the script is called ubac.py: this is just because Armavica and I got started by stumbling on the word "ubac" and wondering how to quantify the fact that it looks weird. Indeed, it is around position 1500 out of around 40k words for n < 4, and it is a unique 4-gram.

Version française

Le (multi)ensemble des n-grammes d'un mot (au niveau des caractères) est formé de ses séquences de n caractères contigus. Par exemple, les 2-grammes de "gramme" sont "gr", "ra", "am", "mm", "me". Les doublons sont comptés : par exemple, pour "toto", les 2-grammes sont "to", "ot" et "to". Étant donné le dictionnaire de tous les mots d'une langue, on peut calculer facilement le multiensemble de tous les n-grammes.

En fait, ce multiensemble est assez caractéristique de la langue. Par exemple, pour identifier la langue dans laquelle un texte est écrit, il suffit généralement de calculer ses n-grammes, normaliser le résultat, et comparer la distribution ainsi obtenue à celle de langues connues : généralement, la distribution la plus proche est celle de la langue dans laquelle le texte est écrit.

Je m'intéresse ici à une idée voisine : calculer les n-grammes pour un dictionnaire de la langue française obtenu à partir de Lexique, puis prendre chaque mot du dictionnaire et calculer la moyenne, sur les n-grammes du mot, du nombre d'occurrences de chaque n-gramme dans le dictionnaire. Plus ce nombre est grand, plus les n-grammes du mot sont répandus.

De fait, cette approche fonctionne plutôt bien pour identifier des mots qui ont l'air bizarres ou qui ont l'air normaux. Préparons tout d'abord un dictionnaire des mots français qui ne sont pas des formes dérivées, en supprimant ceux qui contiennent des caractères bizarres et ceux qui ne contiennent que des consonnes :

cut -f1,14 lexique |
  sed 1d |
  grep '1$' |
  cut -f 1 |
  uniq |
  grep -v "['. -]" |
  grep -v '^[bcdfghjklmnpqrstvwxzç]*$' > words.txt

Passons à présent ce résultat au script ubac.py (dont la version initiale a été écrite par Armavica). Voici les 100 mots les plus bizarres, pour n allant de 1 à 4, triés par score croissant suivant la définition ci-dessus, puis par ordre alphabétique en cas d'égalité :

à       fy      acmé     abbé
çà      où      acné     abcès
ô       éwé     adp      abyme
jà      jà      bey      acmé
y       çà      cañon    acné
fy      dû      dey      afghan
dû      aa      deçà     ailé
by      là      doña     aimé
gy      sûr     dyke     aixois
déjà    mûr     déçu     ajax
khôl    eh      fez      alcôve
âgé     yéyé    fox      alezan
jazzy   fût     foëne    alizé
yéyé    khôl    guzla    allô
gym     oh      hâve     almée
fâché   tôt     ibm      aloès
jazz    jèze    khôl     amibe
bâché   uht     kiwi     amok
éwé     by      kohl     ankh
évêché  gy      lek      apax
buggy   pèze    new      apex
husky   ok      ovni     area
pêché   rhô     oïl      arkose
fêlé    rôt     più      arum
dès     dès     reçu     arôme
bombyx  kohl    rhô      asdic
là      kiwi    señor    aulx
hobby   tweed   tek      awacs
hé      ça      uht      axel
hâlé    oïl     wax      azulejo
bobby   dyke    wigwam   azéri
zébu    kawa    yue      aède
ès      août    yéyé     aéré
haïk    têt     zeb      aïkido
jubé    ako     zob      aïoli
chômé   aïe     âgé      baht
hugh    vu      éwé      baie
junky   wax     zemstvo  banjo
bé      web     muon     baou
bébé    ghât    noël     barzoï
jugé    ès      soja     basé
puff    lès     yeah     beagle
déçu    yakuza  foehn    berk
guppy   oxo     vodka    besef
bédé    zozo    abc      beuh
whisky  zizi    bof      bezef
époxy   mât     bye      bief
puy     jazzy   cèpe     binz
psy     kazakh  dao      birbe
fox     oka     djemââ   bizou
bouzy   ski     dès      bled
ghât    bât     faf      blob
ptyx    axe     grâce    bobsleigh
qu      aïd     jeûne    bock
funky   rêvé    kawa     bodega
body    ska     kif      boer
dé      gym     kot      boghei
zob     skaï    lez      bohême
lâché   kiki    nok      bolge
dévoyé  skunks  nèpe     boskoop
job     zeb     piu      boxon
bluff   aï      puzzle   bozo
pédégé  lynx    sikh     bref
câblé   ah      zicmu    brize
flux    zoo     zip      brol
wax     bye     âge      brook
fût     jazz    île      bufo
psyché  kaki    nikkei   buggy
bug     je      bézef    bunraku
zouk    ajax    taïga    buée
hypo    yak     trèpe    bâté
box     bézef   dîme     bébé
vu      yoyo    naevus   bédé
zéphyr  ka      ptyx     béer
zélé    kayak   râpé     bénef
pédé    jojo    york     bésef
décès   zef     abîme    bévue
boy     jaja    bezef    bézef
mêlé    jeep    shôgun   bôme
haïku   hugh    fax      cajun
yak     junky   geez     canyon
djemââ  ptyx    glèbe    casoar
dégât   buggy   gêné     catgut
goy     funky   ils      cavum
sulky   aya     lao      caïd
bât     âgé     laïus    caïman
haïkaï  ya      lès      cañon
jusqu   excès   moho     ceci
phylum  lys     nez      cheap
péché   oye     ouïe     chez
off     râpé    ouïr     ciao
jèze    foehn   paf      ciré
hadj    psy     puy      city
kazakh  rugby   puîné    clac
pépé    zébu    rez      clef
décédé  bezef   saké     coati
bégum   skip    soft     cobra
huppé   puy     sûr      cohue
baby    yeah    web      coir
humbug  bé      yin      coke

Voici les 100 mots les moins étranges, par score décroissant :

e            ter           ion               mention
erre         enter         que               ration
tee          mer           mention           ossement
errer        on            entement          salement
ire          renter        ente              vilement
terre        tenter        cent              râlement
reine        inter         menteur           pâlement
rire         en            lentement         bêlement
are          intenter      vent              cation
terrer       tinter        dent              gisement
renier       entier        lent              nation
enterrer     conter        gent              entement
terrier      contenter     mentalement       mentalement
retirer      rentier       enter             cillement
te           entente       mentir            rationnement
et           ente          entente           virement
rare         enterrer      mental            vitement
raie         entrer        mentez            parement
aire         errer         menton            finement
ente         rentrer       ossement          ornement
ne           rente         rationnement      durement
en           lent          cillement         purement
raire        tente         entablement       bêtement
terne        terrer        mentionner        vêtement
rente        content       mentor            rarement
entre        anti          virement          âprement
enter        ponter        dément            âcrement
retraire     linter        chiquement        sûrement
retenter     lente         sagement          mûrement
reinette     monter        enterrement       vivement
renne        inti          ciment            uniquement
tertre       mener         piment            sagement
ter          sentier       connement         tique
retreinte    venter        gisement          lavement
renter       ante          finement          tellement
entrer       canter        uniquement        chiquement
inerte       retenter      entassement       pavement
entier       intention     salement          bellement
area         menterie      menthe            rudement
enserrer     entement      jument            fadement
rentrer      contrer       ration            logement
serre        rater         moment            jugement
terrine      tiser         identiquement     armement
terrien      contentement  durement          gaiement
retenir      contention    purement          paiement
rentier      tirer         agent             tapement
inertie      tintin        parement          mêmement
tire         lentement     dûment            sapement
rite         tentant       vilement          lapement
relire       mention       vitement          gréement
relier       dentier       sentiment         payement
lierre       miserere      tellement         fixement
serrer       hanter        entichement       cassement
erreur       seriner       menterie          passement
rien         interner      gentiment         bassement
rein         enterrement   cation            tassement
nier         senti         entêtement        mental
trier        conte         pratiquement      mentir
tirer        serrer        bellement         connement
terri        tintement     stationnement     nettement
araire       continent     nation            bonnement
retraite     pinter        contentement      mollement
entente      rentrant      cassement         follement
ternaire     entrant       râlement          oralement
irriter      sente         pâlement          avalement
tertiaire    remonter      bêlement          également
entretien    vanter        passement         nullement
entretenir   entre         entendement       stationnement
ore          attenter      bassement         étalement
oie          intimer       logement          noblement
terrestre    centrer       éventrement       mentez
tette        trente        tassement         roulement
retenue      orienter      actionnement      stablement
tare         montrer       rarement          logiquement
rate         reconter      amendement        utilement
taie         alerter       entièrement       seulement
enterreur    entourer      serment           isolement
rater        der           sentimentalement  feulement
tirette      rencontrer    mollement         amplement
tiare        renier        âprement          diablement
taire        ratier        mens              parlement
verrerie     menton        bonnement         vrillement
tente        patienter     lamentablement    ululement
traire       contenant     ornement          amusement
ratier       cantiner      logiquement       dément
trente       raconter      follement         raclement
tenter       fermenter     entre             versement
teinte       cranter       alitement         drôlement
trentenaire  merisier      tente             ciment
teinter      tin           âcrement          règlement
retentir     serin         rente             frôlement
narrer       teinter       vivement          hurlement
aine         terrier       fadement          piment
arien        errant        tentation         giclement
rainer       lier          comiquement       actionnement
recette      rentamer      mûrement          bâillement
tienne       ventiler      sûrement          menton
interner     riser         lavement          baisement
interne      te            tintement         tintement
resserrer    ralenti       nullement         mentor

Remarquez comme le mot le moins étrange en termes de 1-grammes est "e", qui consiste de la seule lettre "e" qui est la plus courante en français. Pour les 4-grammes, en revanche, les mots les moins étranges sont ceux qui se composent de séquences de 4 caractères qui sont fréquentes.

Vous pouvez télécharger les résultats complets avec le score, pour n valant 1 2 3 4. Pour l'anecdote, le script s'appelle ubac.py pour la raison suivante : Armavica et moi étions tombés sur le mot "ubac" et nous demandions comment quantifier le fait qu'il a l'air bizarre. De fait, le mot se classe vers la position 1500 (sur 40000 mots environ) pour n < 4, et c'est un 4-gramme unique.

High priority free software projects

— updated

The Free Software Foundation just sent out an email asking for feedback about their list of High Priority Free Software Projects. As they said, "we encourage you to publish your thoughts independently (e.g., on your blog) and send a us a link". So I thought I'd answer. I'm an FSF member (just to support them; I'm not actively involved in anything), but of course my answer does not carry any special weight.

The points of this list are not really given in any specific order. The first three points cover most of the situations where I use proprietary software, so it's a bit geared towards my own needs and those of people around me. It excludes things like video games, because I feel that video games are works of art and not just software: it seldom makes sense to replace them by a free alternative, like you wouldn't replace a copyrighted movie with an "alternative" to that movie. The last two points are less about software proper, and about more general issues.

Developing an alternative to Skype

I use Skype to communicate with remote colleagues (aka coauthors, in researcher parlance). It is the only solution used by people around me for conference calls, except when they can't get Skype to work, in which case they sometimes switch to Google Hangouts, which is not really better as it may require a non-free browser plugin and in any case requires a Google account (so you cannot self-host it).

As Skype is not federated or interoperable either, you must use the Skype client to talk to Skype users, which makes it especially hard to replace Skype with a free alternative. (Back in the days of text messaging, I was able to switch from MSN Messenger to alternative clients because the protocol had been reverse-engineered, but it seems like this has not been done for Skype.)

To replace Skype, I would need a drop-in replacement that just works and that I could try to advertise. I tried to find such a solution some time ago, and tested various SIP clients in this hope, and couldn't find anything suitable. Skype is not even perfect, in fact, and sometimes does not work, but it works much more often than the alternatives.

Of course, as SIP is federated, SIP clients need you to open a SIP account before you can use them (some of them facilitate this process); this is an additional complication as compared to Skype, but OK, it is unavoidable, and you have to register for Skype as well. But then, the interface of these clients is ugly and ridiculously hard to understand even for experienced computer users. Worse, even when both parties are using the same client, it usually doesn't work. The call doesn't go through, there is no sound or distorted sound, the connection fails, etc. Worse, there are often no helpful error messages, logs, or other debug information to troubleshoot the problem.

There are a lot of hard problems to address to make a solution that works. If both users are behind NAT, you need NAT traversal. Sadly, with existing solutions, it doesn't work, even if you are root on a public server that you could use to relay the traffic. Also, if the connection is flaky, you want to avoid gaps in the audio. I suspect that Skype is fairly clever behind the scenes, as I sometimes hear it buffer speech during network hiccups and then send it and make up for the lag by speeding up speech and skipping blanks, so you don't miss anything. All of this is not so trivial.

Relying on Skype means relying on proprietary software, but it also means relying on a non-federated and non-interoperable protocol (and thus contributing to the network effect), and it also poses a security issue because there is no secrecy on the contents of Skype chats (as opposed to, say, SIP with ZRTP). In fact, the CNRS, the largest governmental French research organisation, which is related to my employer, recommends against the use of Skype. Yet such recommendations are ignored by all researchers I know, even when they are personally committed to the principles of free software: they need to work with other researchers who use Skype, and have no usable alternative to advertise.

What's needed: Developing a Skype alternative that works, both with a usable interface and the required backend features. Maybe WebRTC is a promising direction.

Developing and promoting an alternative to Dropbox

Some of my coworkers (not the majority) are starting to use Dropbox to collaborate on documents, instead of version control systems (VCSes), because it is easier. So I use Dropbox when I need to collaborate with other people who do, and I have no drop-in self-hosted replacement to offer.

Using Dropbox in this context is essentially laziness, as the complexity of setting up a VCS usually pays off in terms of added power: computing and reviewing diffs, using blame, having commit messages, branching, proper conflict handling, etc. Yet, in some contexts the hassle of updating, committing, and pushing manually, is not worth it, and it is better to automatically sync changes because there will often be no conflicts and little need to review old versions.

I would also need such a system to synchronize data between my machines. Manual rsync (or cronjobs) often suffice, sometimes VCSes are the right tool, yet sometimes automatic sync is all you want; especially on phones where manual solutions are impractical. So far I have resisted the temptation of using Dropbox for this, but I would be interested in a tool that would have the convenience of Dropbox.

This is not so obvious to engineer. It should rely on inotify or offer a FUSE FS to avoid polling for changes and sync with no noticeable lag (which excludes Unison). It should offer file access when you are offline (which excludes sshfs-based solutions). It should use bandwidth sparingly and avoid retransferring existing content (even doing this across files with different names, which is fine in theory but not done, e.g., by rsync). It should run also on Android so I can sync data with my phone. It should ideally rely on a VCS to resolve conflicts automatically when possible, and to offer sensible tools for manual resolution when needed (although the situation should occur rarely: only, e.g., when changing the same file on two different machines with one of them offline). I want a tool that solves only this problem and solves it well (which excludes ownCloud). Also, it should have good documentation and tutorials, which excludes git-annex (I still haven't understood whether it does what I want or not).

What's needed: Developing a Dropbox alternative that is easy to use, and document how to use it.

Making mobile phones usable with free software

Most Android applications are proprietary and distributed through the proprietary Play Store. The only libre alternative that I know of is F-Droid, and I use it exclusively, but it has about 1,300 apps listed as of this writing, whereas Google Play has a thousand times more. What is more, many proprietary apps are hard to replace, because they are designed to access the backend of specific services (train companies, banks, postal services, even public institutions, etc.): there are many such services, and they often don't have a public API.

Even forgetting about the apps, it is essentially impossible to avoid proprietary software on mobile phones. At the operating system (OS) level, Android used to be touted as open. Now the visible face of Android is riddled by proprietary Google applications. The pure open-source side of Android, AOSP, is barely keeping up. I use CyanogenMod to have Android without the proprietary Google stuff, but it has many limitations. For instance, the TextSecure SMS encryption service cannot be used without the proprietary Google Mobile Services, even though it is itself open-source (and bundled with CyanogenMod).

In any case, CyanogenMod does not solve the problem of proprietary drivers and firmware. The Replicant project does, but currently supports only 12 devices, and only partially (for none of them is both telephony and GPS suppported, for instance). At least it's a start, but hard to use for now. Then there is the issue of the radio firmware: to my knowledge the only libre implementation is OsmocomBB, with very narrow device support and no 3G. This is especially worrying: the radio firmware has unrestricted access to the entire phone and is constantly connected to an untrusted network. Replicant developers have found suspicious things there, and more may be lurking.

What's needed: supporting the development of libre radio firmware alternatives, and of Replicant (or maybe the solution lies with other phone OSes entirely, such as Firefox OS). Support the development of F-Droid. Campaign so that people understand that not everyone is using Android or iOS, and that it is important to provide an alternative to non-portable apps (e.g., a good website).

Fighting back OS bundling and warranty exclusions

In France, as in many countries, most PCs are bundled with a Microsoft Windows license, even though it is illegal to bundle the sale of a service (the license) with a good (the hardware). Thus, when buying my Lenovo Yoga 13, I could not avoid buying a Windows license with it, and could not get it refunded from the manufacturer (Lenovo). (I had looked for comparable OS-free machines that I could buy instead, but the choice was too limited.) Of course, bundling does not apply only to Windows, but also to Apple computers. You can buy Mac OS separately, so it is not a free service, and it is bundled with the machine (even though it is leased to you by the same company that is selling you the hardware and does not expect you to run a different OS).

Going to the courts is not a realistic option: a Linux user sued Lenovo France in 2007 and the case is still open. The latest episode is a February 2014 ruling by the Cour de Cassation (French's last resort court) which cancelled a previous verdict and requested that the affair be judged a third time.

Advocates of bundling contend that customers cannot install operating systems themselves (and of course everyone uses Windows), so bundling is justified. The obvious counter-argument is that you could sell the hardware and sell separately an activation code for the preinstalled OS. In fact, now that Windows licenses need to be activated with Microsoft through the Internet, this solution would be essentially trivial to implement. Yet nothing has changed.

A related problem is the question of warranty. PC manufacturers tried for some time to avoid offering hardware warranty to users which had installed a different OS. Now mobile phone manufacturers will argue that jailbreaking, or installing a different Android, will void your warranty. I believe there should be a clear separation between the hardware and the software, and warranty should be offered on the hardware no matter which software is run. (For cases where buggy software could damage the hardware: if the hardware cannot withstand what the software asks it to do, it is its job to avoid getting damaged.)

Last, installing alternative OSes on computers and phones is blocked by technical measures: UEFI for computers, various locking mechanisms for phones...

What's needed: Legal support to consumers seeking reimbursement of bundled software, especially through collective means such as class action whenever possible. (There is a French support group but I'm not sure how active it is. At any rate, I could not get definite information about whether Lenovo France offers refunds for Windows 8 licenses (they don't).) Lobby for investigations into this practice using competition laws, like the Windows Media Player case in the EU. Help consumers who were denied warranty claims for a hardware problem because of the software that they run, lobby against technical restrictions and document ways to circumvent them.

Proprietary software and centralized computing

This is not at all a software project, just a closing comment about a battle that needs to be fought, which is distinct from the question of free software, but as least as important, in my opinion.

With the advent of customer broadband connections, it became clear that people would be tempted to use remote servers to perform their computing and store their data, even at the expense of their freedom. This switch back from personal computers to dumb terminals is well underway, and does not seem to stop. In this light, the original concerns against proprietary software seem quite tame in comparison with cloud computing. Yes, you cannot know what a proprietary program does, it could have a backdoor and give evil people access to your data, it may save your files in a proprietary format that nothing else can read, so you can't switch and your colleagues have to use the same program. However, in the cloud, you don't even know which software is running or when it changes, the data is already in the hands of others, and you may not even be able to retrieve it at all, so you are stuck and anyone working with you is stuck as well.

Ironically, with newer devices such as phones and tablets and Chromebooks, more people are using free software, without realizing it. Of course, there are proprietary vendor-specific programs and drivers, but the core is often free. Sometimes there are technical restrictions against tinkering with the device, but sometimes there aren't. Yet people do not know or care, because what matters to them is not the device, but the remote services.

To users, this is normal: they are entrusting their computing to specialists, like they do with their electricity, their food, their housing, and so many aspects of their life. Still this is not right: people are still expected to have their own opinions, their own vote, their privacy, and people's electronic trails are more and more entangled with their private thoughts and actions... Yes, computing is less vital, but it is a finer-grained power mechanism, so it is really dangerous to give it up.

Proprietary software, and centralized computing, are two different threats at two different levels, but I find the second one maybe more worrying than the first. For the first one, free software has scored some big victories. For the second one it's not even clear which alternative model we can propose...

The absurdity of software patents: a proof by example

— updated

When I learnt about free software some time ago, I came to believe that software patents were a very insidious threat. It was dangerous that someone could claim exclusivity over an algorithm: it could stifle innovation, it could be an obstacle to independent free software implementations of useful standards, etc. I very much hoped that Europe would keep them at bay, and still remember when the corresponding directive was rejected in 2005, giving us the (relative) security that we now enjoy in the EU.

Since then, some of my work about speech recognition while interning at Google New York has been turned into a US patent application that was recently published. I was paid a patent bonus by Google and assisted them to fix problems in drafts of the application. This process has given me more elements to form an opinion about software patents in the US, and I have changed my mind accordingly. I no longer believe that they are an insidious threat. I now believe them to be a blatant absurdity.

I invite anyone doubting this fact to skim through the application and see for themselves. Just for fun, here are some selected bits.

Our tour begins with a wall of boilerplate on pages 2 to 4 that describes mundane banalities about computers in a stilted prose that makes a surreal attempt at complete and exhaustive generality. (Remember that the topic is speech recognition, with printers and light bulbs being key components of such systems, as we all know.)

[0058] User interface 304 may function to allow client device 300 to interact with a human or non-human user, such as to receive input from a user and to provide output to the user. Thus, user interface 304 may include input components such as a keypad, keyboard, touch-sensitive or presence-sensitive panel, computer mouse, trackball, joystick, microphone, still camera and/or video camera. User interface 304 may also include one or more output components such as a display screen (which, for example, may be combined with a presence-sensitive panel), CRT, LCD, LED, a display using DLP technology, printer, light bulb, and/or other similar devices, now known or later developed. User interface 304 may also be configured to generate audible output(s), via a speaker, speaker jack, audio output port, audio output device, earphones, and/or other similar devices, now known or later developed. In some embodiments, user interface 304 may include software, circuitry, or another form of logic that can transmit data to and/or receive data from external user input/output devices. Additionally or alternatively, client device 300 may support remote access from another device, via communication interface 302 or via another physical interface (not shown).

It seems that you have to describe the overall context of the invention from scratch, even though it implies you have to repeat the same platitudes in every single application. In case you are wondering about those numbers peppered throughout, they are actually pointers to a helpful figure that wouldn't look out of place in a 1990 monograph about computer architecture:

Figure 3

Let us now move on to the gruesome sight of mathematical language being contorted and twisted to fit this strange mode of expression. The bulk of the application reads like a scientific paper eerily addressed to lawyers from the past. Here is an example from page 11:

[0162] As noted above, the search graph may be prepared by pushing the weights associated with some transitions toward the initial state such that the total weight of each path is unchanged. In some embodiments, this operation provides that for every state, the sum of all outgoing edges (including the final weight, which can be seen as an E-transition to a super-final state) is equal to 1.

Confused? Here is the beginning of the conclusion, that wraps things up in a surprisingly inconclusive way. I think the point is to make a desperate appeal to the reader to use their common sense, but I'm not sure.

[0192] The above detailed description describes various features and functions of the disclosed systems, devices, and methods with reference to the accompanying figures. In the figures, similar symbols typically identify similar components, unless context indicates otherwise. The illustrative embodiments described in the detailed description, figures, and claims are not meant to be limiting. Other embodiments can be utilized, and other changes can be made, without departing from the spirit or scope of the subject matter presented herein. It will be readily understood that the aspects of the present disclosure, as generally described herein, and illustrated in the figures, can be arranged, substituted, combined, separated, and designed in a wide variety of different configurations, all of which are explicitly contemplated herein.

The forced marriage between legalese and theoretical computer science jargon reaches its peak in the claim list. Reproducing it in full here would be both inconvenient and intolerable, but here is the general flavor:

An article of manufacture including a computer-readable medium, having stored thereon program instructions that, upon execution by a computing device, cause the computing device to perform operations comprising: selecting n hypothesis-space transcriptions of an utterance from a search graph that includes t>n transcriptions of the utterance...

It might just be me, but I find this outlandish prose pretty fascinating. Now, consider that this 32-page double-column document only describes a few simple ideas from a four-month undergraduate internship. Yet, it took over a year to produce it, and then two additional years to get the application published. (God knows when the patent will be granted, if ever.) Isn't it quite obvious now? Whatever this system is trying to achieve, it is being hacked. It can no longer keep up with the way it is being used. It would be funny if it didn't have dire consequences for innovators. It should be fixed. Really.

I am happy that this patent application was published, because at least it provides some public documentation about things I did while at Google. However, my opinion about software patents is still that they should not be given any legal credit. My advertising this document is because it describes work that I did; the point of this post is to stress that I am not endorsing the US software patent system. Also, my point here is not to criticise the people at Google who wrote this document (and are probably just following the rules of the game), or to blame Google for filing such applications (they presumably do it because everyone else does).

I hope that no one will run into trouble because of the existence of this patent. I fortunately think it is quite unlikely. If someone does, though, I put the blame on US law for ascribing any meaning whatsoever to this farce.

Why research feels really hard at first

— updated

I have tried to do research for about two and a half years now. I haven't quite figured it out yet, but I have understood a few things that I want to write down, before they seem so obvious that I don't even remember I had to notice them.

The point of this post is to explain for which reasons doing research is difficult, especially when you are getting started in the field: say, doing a research internship, or just starting a PhD. So, if your experience is not entirely positive, this may help you identify possible causes, and understand how much they may improve with time. Of course, everyone's perception is different, and maybe you legitimately dislike research, so it is difficult to tell how much my advice applies to your specific case; still, I hope it gives you food for thought. Of course, parts of what I write apply to any activity where you are getting started, not just research, but even the general points may not be immediately obvious if you aren't warned about them.

Research is intrinsically hard

You need to keep this is mind: research is actually difficult, so it is normal that you feel that way. It is normal that things progress slowly, in fits and starts, that you sometimes get discouraged, and often procrastinate. There are several reasons why real research is hard:

  • It must be new, so you need to find things ideas that haven't been tried out, and solutions that no one else saw yet.
  • No one knows how to solve your problems, so no one can really help you.
  • No one knows whether things will work, so there is a fair chance your hard work will not pay off no matter what, and it usually takes many iterations to solve anything.
  • No one directly needs you to solve the problem. No one's directly looking at you and waiting for you to do the job. If you are not making progress, probably no one cares, except maybe your advisor, and then only because they are your advisor, probably not because they care intrinsically about the problem.

You may feel really bad if you think of research as just another job, and compare yourself to someone with a "normal job" that does not face the same challenges. When your job is to do something that you're proficient at, that you've done many times and that everyone understands, and where people directly need you and you just can't not do it, it's a lot easier, and you can't procrastinate as much.

It may sound trite, but I remember it wasn't obvious to me at first: researchers are essentially like artists. The only difference is that research looks more like an office job, both in terms of having an actual office, and of being paid a fixed salary. However, if you think of it as art, it seems more normal that you are not continuously productive for 8 hours every single day. You need inspiration, it takes a lot of energy because you are creating something new, and there are false starts and ideas that turn out not to work out well in practice.

You are doing only research

As a young researcher, you probably do not have a lot of diverse things to do. You don't have a backlog of email to reply to, things to read, things to write about, papers to review. More significantly, you probably do not have many administrative obligations or teaching duties. In fact, the very moment where you have the most time to devote to actual research is paradoxically when you're the least qualified to do it, while the competent researchers are swamped by other duties. It makes little sense, but that's life.

It may seem strange to claim that filling administrative forms and teaching classes can make research easier, but I think it is true, up to a certain proportion. It is good to have some part of your job where tasks are gratifying because they can be completed without turning your brain on (trivial tasks) or they make you feel useful to others (teaching). If your entire professional life is research, you feel bad if you lost one week because you messed up on something or didn't get any inspiration; but if you have also managed to do other things that aren't research, it is more palatable. Likewise, if you did research for 4 hours and goofed off for 4 hours on a given day, you feel bad. If you did 4 hours research and 4 hours mindlessly replying to tons of trivial email, you feel extra productive! So you may feel better once you get to spend more time being efficient on professional things that are not hard research.

You have only one project

Not only are you doing mostly research, you probably have only one project, for example, your internship or PhD topic. Of course having too many things to do can put more pressure on you and make you lose time and energy in mental context switches. But on the other hand, having multiple projects allows you to switch from one to the other when you get bored, and more importantly allows you to move on temporarily to something else while your current project is stuck (while you wait for your collaborators to make progress, wait for your unconscious mind to come up with a new idea, etc.).

Devoting all your energy to a single thing makes you extremely vulnerable if it does not go well. In research as in life in general, it is a bad idea to put all your eggs in one basket, and you will feel more stable if your self-esteem depends on multiple independent things.

You haven't chosen your project wisely

Not only are you probably working on one project, you are also probably working on your first project. Another huge perk of having been around for longer is that you had more time to discover what you like, and do more of it, and less of stuff that you don't like. As time passes you will figure out which tasks you prefer, and which themes. You will switch to different collaborators (or, maybe, supervisors) if things aren't working out well with the current ones. Your first project is only an entry point, so on average what you will do afterwards will tend to suit you more, so it will probably be more pleasant, which will make it easier to be productive on it.

You are mostly working alone

Except from occasional supervision by your advisor, you may be the only person involved on your project. If so, be aware that this is not the usual way to do research. In my field at least, the vast majority of research happens as collaborations within small groups of people.

When you are working on a project with other people (as peers, not as supervisors), it's much harder to lose steam, because they are multiple people pushing the project forward, a lower chance that they are all discouraged simultaneously, and you do not want to let them down. (The energy that you get from someone supervising you is not the same as that of someone who works with you on a equal level; and feeling responsible for your commitments to peers is not the same thing as vertical accountability to your supervisor.)

The advantages of a successful collaboration are that it mathematically splits the workload, and gives you access to the skills of your collaborators for areas you may be unfamiliar with; but more importantly, it sets up a situation where the collaborators are all relying on each other, so the project moves forward.

Another aspect is that some tasks are easier to perform with other people. I find it much easier to make progress on complex issues by discussing them with someone else: of course I also need to think on my own, but I find that discussions helps to flesh ideas out, and sometimes leads to new insights (also, the rubber duck effect applies). For writing, it's easier to proofread someone else's prose than your own because it is hard to spot your own mistakes, and it's easier to write if you know that someone else will tidy things up behind you. A scheme that works well is to have multiple people perform "passes" to successively write and edit the text: start with a very rough draft (already hard to spit out, but easier because it can really be crap), and then correct, correct, until you converge. You can emulate this on your own but it's harder because you need to go back to your own prose and improve something that you wrote just before.

People have invested less in you

If you are "just" an intern who recently got started, your advisor has not "invested" a lot in you yet, in comparison with more important time and money investments like ongoing PhDs or collaborations which are already successful. It is also for this reason that you are probably working on your own: no one has committed to working with you yet.

Your advisor may not be doing conscious accounting about the importance of their commitments, but, as they accumulate, those that will get sacrificed are usually the ones with the least sunken costs, the most distant prospects of payoff, and the smallest embarrassment in case of failure. It is sad and wasteful to inadequately supervise an intern; it is a professional failure and possible future embarrassment if you let down your PhD student, whom you've supervised for three years, and who may be staying in the field later.

There is little cure to this except time and work. As time passes and you stick around, you build up a reputation for reliability. There is a "rich get richer" effect: the more successful people are more reputable, so get more commitments, so get better chances to build up their skills, knowledge and contacts, which make them more successful still.

You don't know existing work

Another hard thing in initial research is that you have to get acquainted with all that's already been done in your field. In some areas, this can mean that you need to spend months or years reading up the relevant literature, before you can hope to contribute anything meaningful or new to the discussion. Remember that the discussion is between people who have been reading each other's work for many years. If you are a newcomer, you have a lot of catching up to do.

More locally, if your research project has been started before you, you have to understand what people have been doing before you within that specific project, which is also hard. For instance, in computer science, you may have to get acquainted with other people's code, which is much harder and duller than starting out on your own cool new project from scratch.

There are advantages to being a newcomer, though: as your perspective is shaped more by the current state of things and less by the whole history of the discussion, your mind is freer to contribute new insights. So don't worry, you will not always remain in the shade of the people who were here before you.

Your skill and confidence will improve

A very general thing: when you start doing research, you are less skilled at it, so it feels even harder; when you become more skilled, it feels less hard. Now, you can never be entirely proficient at research: as soon as you really master something, you have to move on to something else. Yet you eventually practice meta-skills such as understanding things, thinking about them, organizing your time, etc. You accumulate a culture about your field (see previous point). Last, you exercise a bunch of tangential skills: mastering your tools, your computer, LaTeX, etc. So you become more skilled as time passes, and the tasks becomes less hard, and less daunting.

Also, as you get started, you are not so confident that you will be able to achieve anything in research. As time passes, you achieve a few things, and so you start becoming more confident, because you have objective proofs that you are competent. This point is even more general and a little bit meta, but it can be a huge deal, because lack of confidence and impostor syndrome is a huge problem for a lot of people in research. It's normal, it usually gets better, and in any case you normally get used to it after a while.

Privacy in public space

— updated

We can roughly divide the world in two regions: public space, where anyone can enter and watch what happens, and private space, where only selected people can get in and see what takes place inside. Of course, there are borderline conditions, such as space being either private or public depending on time of day (e.g., parks that close at night), some space being "public" but technically private property subjected to house regulations (e.g., inside a shop), and some space being "private" but with the technical possibility for outsiders to look inside (e.g., through windows, into a private home or vehicle, with convoluted mechanisms to explain why this is still a violation of privacy). Still, by and large, the distinction holds.

It seems that, by definition, there can be no expectation of privacy in public space, because, in principle, everything that happens there can be witnessed and recorded by anyone (and possibly shared publicly, although personality rights may apply). Yet, people rely on it to some extent. First, in terms of proximity: when you are with someone else (e.g., in a park) and there is no one around, you assume that your conversations are private, and (except in crowded environments) a third party cannot intrude and listen (because of the fuzzy assumption that, as long as they can sit elsewhere and enjoy an equivalent portion of the space, you should be entitled to your own spot; and that, when you talk and there is no one around, the information will not be recorded from far away). Second, in terms of unrelatedness, as people will sometimes indulge in a conversation with third parties within earshot under the assumption that they are not concerned by what is said (and politeness would require them not to pay attention). Third, in terms of continuity: you assume that no one knows your whereabouts at all times, because to do so they would have to stay close to you always, that is, follow you, and there is again an assumption that other's use of public space should not be "guided" by yours. Fourth, in terms of ephemerality: even if someone were to see or hear, you may assume that they will not retain a record of it forever, and that it is not possible to look up public space information about the past, because it was not archived.

My point is to focus on the last two privacy expectations, and show that they can break down when the notion of "public space" becomes altered by technology.

Today, a variety of entities (shops, police, etc.) have already installed CCTV cameras (within their private property or with applicable permits) to monitor public space. Cameras tend to become more and more widespread, so that more and more public space is filmed. The resulting trail of data is not eternal, for practical and legal reasons; but both limitations tend to disappear as time passes. So eventually we may assume that an uncoordinated bunch of actors will store traces which, together, could be used to reconstitute the entire history of everything visible in public space.

Now consider a second step which is currently starting to happen: CCTV cameras that upload their recordings to the cloud rather than storing them locally. This seems natural as more and more computing and storage is centralized in datacenters rather than on individual devices. Now, as I see no reason why cloud providers should not remain an oligopoly (or even become a monopoly), suddenly a growing proportion of the acquired data (in raw form) is available to a small number of actors. Incidentally, wiretapping ensures that various secret agencies also get access to the data.

Add a third step where the storage space, processing power, and algorithmic sophistication of the cloud providers go to infinity. Suddenly, all those actors have a different kind of access to public space, which is not limited by the notion of presence which intuitively applies for humans. They can know everything that happens everywhere, or happened at any point in time. I call this total public space access. This marks the collapse of the two privacy expectations I mentioned.

Of course, this has far-reaching consequences. Organizations with total public space access can know where everyone is located and the history of everywhere they went. This is problematic because of all the private information (love affairs, political organization, etc.) which is revealed by location information. (There are currently easier ways to retrieve this information, in a less precise manner, for people carrying mobile phones, but with CCTV opting out becomes much harder.) Note that this also implies you cannot privately go from point A to point B through public space, even if A and B are private... A tentative workaround would be to cover your face so that you are not recognizable, but this may be illegal, and does not suffice: people tend to usually return to their private dwellings, so that total access to public space is sufficient to establish a continuous trail for them, and thus identify them even if their appearances are indistinguishable.

Of course, this is not the only way in which unrestricted public space access challenges usual privacy expectations. Consider names on doorbells. To my knowledge, there is currently no database harvested from them that provides all addresses where a certain name appears, and people therefore do not consider that putting their real name on the doorbell divulges the information in that direction, from their name to the address. Yet this is all information available in public space, so I am not sure about the general legal framework that would prohibit the construction of a reverse database as I described.

The disappearance of privacy in public space is not necessarily a bad thing in itself: unrestricted public space access is a power, so it can be used for good, or for evil. It can be used to fight crime: while it cannot ensure that crime is altogether prevented, it ensures that crimes committed in the public space always leave a trace that can be investigated. Under the (non-obvious) assumption that this trace cannot be tampered with, it means that the objective truth of any claim about public space can be assessed. It implies that criminals can no longer run away (assuming interference powers from the police to extract criminals from a hideout in private space, and assuming that private space regions are not well-connected, as is the case in real life).

It is not clear that the provability of public space crime would make it impractical, because some criminals may not care if they will get caught; but assuming that it does, the benefits for society is not just the crimes that are no longer committed, it is much higher: it means that precautions to prevent the crimes are no longer needed (bikes, doors no longer need to be locked, stuff can be left in public space without risk), and also that some efficient rental schemes become practically applicable (if, e.g., there is no longer a risk that the rented good is not returned). Beyond crime, unrestricted access to public space gives opportunity for smarter decisions in terms of traffic, queues, shops being opened or closed, bus schedules, etc. Indeed, a lot of practical inefficiencies are the result of insufficient knowledge of public space, which (currently, and assuming that algorithms are not a problem) is usually caused by insufficient available data.

I have claimed that total public space access, under the assumptions that I outlined, will eventually become a technological possibility, and the default situation would be that a small number of organisations get it and the general public doesn't. What should be done about this?

A first option would be to legally prohibit total access to public space, or make it impossible. A good umbrella term (coined, to my knowledge, by Louis Jachiet) is that of indiscriminate data acquisition in public space. The rationale is that while people taking pictures, tourists filming monuments, etc., are acquiring information in a targeted manner, total public space access would result from CCTV, Google cars, and other technologies which perform such broad captures. Such acquisition should not necessarily be prohibited, but should become a target for regulation.

A second option would be to ensure that the resulting public space archive is available to everyone under the same terms. Indeed, much of the reasons why total public space access is scary is because of the asymmetry between those who have it and those who don't. It means that certain companies, secret services, can know anything about you (and could, e.g., prosecute you for any minor offense you commit), and yet protect themselves so that others do not know anything about them (especially, their wrongdoings would remain unpunished). Of course, organizations with more means will always stand a better chance of finding something to use against you, but society could try to ensure that citizens can at least access the data and organize to scavenge it.

In this second case, I am not sure about whether I think the resulting society would be a good one. The panopticon is usually thought of as a bad thing, but, in another way, the fact that you have non-total visibility and memory of public space seems to me like a bug that should be fixed, not a feature. I wonder what the best compromise is.