Managing passwords with pass

I have recently migrated my passwords to pass and so far I've been really happy about it. The previous system was to have them all in a huge text file, which wasn't especially convenient or secure¹, and wasn't shared between my various machines. Here is some info about pass.

pass has been packaged for Debian since Jessie, so installing it is as simple as sudo apt-get install pass. However, it's just a shell script just over 600 lines, so really easy to review, and install manually if you need to.

The way pass manages passwords is dead simple: a hierarchy of gpg-encrypted files. The assumption is that each file corresponds to a website, or machine, or other authentication realm, and contains the password. The use of gpg provides a layer of security, so that your gpg key and passphrase serve as a master password. Of course, it is nice to have a properly configured gpg-agent(1) to avoid having to enter the passphrase multiple times.

The basic commands of pass are pass init KEYID which sets up the store for gpg key KEYID (by default in ~/.password-store), pass FILE which decrypts and shows FILE, and pass edit FILE, which decrypts FILE to a secure temporary location in /dev/shm, edits it, and encrypts it back. You can also use pass ls (which shows a nice output using tree), pass find to search for files using find, pass grep to search in the decrypted password files using grep, and pass rm, pass mv, pass cp. Of course, you can also mess around in the password store by hand.

As pass has this very nice CLI interface, migrating my passwords from my custom system was very easy, although it seems like the Debian package also installs a bunch of script to migrate from other password managers.

Beyond the generic commands I have presented, pass obviously offers commands tailored for password management. You have pass insert FILE which creates FILE with the password you provide (and turns off echo and makes you enter it twice for confirmation). You have pass -c FILE which copies the password in FILE to the clipboard, so you can input the password where you need it, and automatically clears it after 45 seconds (which is a reasonable thing to do). You have pass generate FILE LENGTH which generates a password of LENGTH chars in FILE and displays it (or copies it to the clipboard with -c); what is very nice is that pass itself does not include password generation logic, but entrusts pwgen(1) with the task.

Icing on the cake: pass is designed to be used with git, and provides pass git to call git commands. If you use git, all the pass commands will automatically git commit what's needed. This makes it very easy to share passwords between different machines. Of course, as the files are encrypted, git cannot be expected to solve conflicts within files, but it can nicely merge changes across various files. You can also use this setup to share passwords between different people, as pass supports encrypting for multiple keys.

For once, I find it hard to find something to dislike about pass. Eventually I may want to tweak password generation so that it generates passwords the way I'm used to, but this would be easy to do. I'm also missing support for usernames, as I use different usernames on different websites, but pass allows you to store anything in the password file (and only the first line is taken into account for pass -c and others), so I can just add the username as the second line if needed, it's just that I will have to retrieve it by hand, or script something that does what I want. Other than that, I'm very happy to have a convenient, lightweight, and secure way to manage my passwords and share them across machines using git.

My home partition is encrypted, but there was no security whatsoever if the machine was ever compromised. ↩

Managing installed packages in Debian

This post is about how I manage the apt packages that are installed on my Debian systems.

Left to myself, I tend to apt-get install various things every now and then: to try them out, or because I temporarily need them. In most cases it turns out I never really use them, but of course I will never remember to clean them up. Eventually my / partition fills up and I have to waste time tracking down which packages are useless. (Of course, accumulating useless packages is also a bad idea in terms of security, performance, etc.)

If I want to back up my current selection of packages, I could dump the output of apt-mark showmanual somewhere, but that's not really satisfactory; the list of packages that I use should be stored as a first-class citizen in my config, not just obtained from the current system state.

If I need to set up a new machine, I can install all the packages that were installed on the previous one, but this will end up downloading gigs of packages including lots that I don't really care about. Of course, as soon as I start using several different machines, it is necessary to install on all machines the new packages that I need, so the installed packages must be kept in sync somehow. Otherwise, I have to waste time watching apt-get installing stuff every time I run a command on a host and realize that a package is missing. (However, remember that most installed packages are not really important and probably don't need to be synchronized at all.) All of this is made worse by the fact that not all machines are equal (I don't want to install graphical stuff on servers, laptop tools should only go on laptops, etc.).

My current answer to this is to maintain in my public configuration repository a bunch of program lists for various kinds of uses. This list is synchronized between machines using git, as is the rest of my configuration. It contains only the packages that I really have some use for, rather than all the random crap I have ever installed and don't remember the point of¹. (This being said, I don't care about it being minimalistic, and I'm OK with including large tools that I use rarely, as long as there are reasonable odds I will use them again.)

I have a private file that lists, for my various hosts, which program selections it needs from that list, for instance:

my-laptop-1 minimal laptop server util
my-server minimal server

New packages that I install don't go to this list by default, but my crontab on each host mails me every month the diff between the currently manually selected packages on that host and the ones which should be installed according to the host's list. Here is the script: check-packages.sh.

From this monthly report, I can then update the list by removing the new installed packages that I don't need after all, putting the ones that I do need in the right selection, and installing the ones which should be installed. (And, of course, postponing the ones I haven't yet made up my mind about.)

The system is of course fairly rudimentary: there is no dependency between package selections, the dependencies of every piece of software that I compile myself have to be listed in a separate file, the sync process is manual, etc. Yet I am happy that I have, at last, one central list of the packages that I consider useful to have on my systems; as a bonus I can even share it with the world.

Of course, the hard part was to clean up the currently installed packages so as to come up with this list in the first place. ↩

Migrating to pelican

This blog used to be generated by Fugitive, a static blog engine¹ backed by git. Fugitive is very neat and minimalistic, but it is also written in pure sh, and its design makes it spawn a lot of processes, which makes it unacceptably slow for me.

Rather than rewriting a similar engine with a different design, I chose to migrate to Pelican, a more common static blog engine in Python. I hoped that this would be simpler than rolling my own. It still took several hours to set everything up correctly, rewrite the templates, understand the way it works, etc. However, everything should be fine now, and very little should have changed relative to the fugitive version. I hope nothing subtle broke in the process.

In terms of performance, pelican takes about 2 seconds to recompile the entire blog, as opposed to around 30 seconds for fugitive to recompile one page (most of which in spent compiling the archives page), and far more to recompile all pages when templates have changed. (This being on fairly recent machines.) It also looks like pelican's processing time comes from the constant overhead of loading its own files, rather than from the blog size; I guess time will tell. The new setup also separates neatly the git versioning of the blog source from the tools used to generate it, rather than relying on fugitive's hooks which I always found a bit too brittle for my liking.

As an added benefit, I have set up some elaborate things which I had never taken the time to configure with fugitive: using markdown to write blogposts, having syntax highlighting in code snippets with Pygments, math rendering with MathJax, footnotes, etc. Categories, tags and languages should also be easy to set up if needed.

Pelican does have a few downsides: it is a bit complex to understand and heavy, it requires separate plugins for various things, which you also have to install. Most of its complexity is because it is too generic for my needs, and its config tries to abstract away things which I'm OK with hardcoding (e.g., in the templates). Also, pelican does not interact with git at all, and it is a bit silly that article creation and modification times must be recorded as metadata in the articles whereas they are available from git. I intend to rely on simple scripts to save me from having to add this kind of info in files by hand.

A static blog engine is one where the entire blog is generated on my side to HTML files, and the files are then uploaded to a vanilla HTTP server; the HTTP server is unaware of the logic of the blog. I prefer this because it saves me from having to configure a fancy Web server, and allows me to write blog articles on my own machine with a real text editor, rather than writing them in-browser on the server as is typical with dynamic blog engines. ↩

The strangest and least strange French words according to n-grams

English version

The (multi)set of (character-level) n-grams of a word consists of its sequences of n consecutive characters. For instance, the 2-grams of "gram" are "gr", "ra", and "am". Duplicates are counted, e.g., for "toto" the 2-grams are "to", "ot", "to". Given the dictionary of all words in a language, we can compute the multiset of all n-grams.

It turns out that this multiset is quite characteristic of the language. For instance, to identify the language of a piece of text, it is often enough to compute its n-grams, normalize it as a frequency distribution, and compare it to the distribution of known languages: usually the closest distribution is that of the language in which the text is written.

Here I explore a related idea: compute the n-grams of a French dictionary obtained from Lexique, and then take each word of the dictionary and compute the average, over its number of n-grams, of the number of times each n-gram was seen in the dictionary. The highest this number, the most common the n-grams of the word.

And indeed, it turns out this approach is quite good at identifying words that look strange or that look common. Let us first prepare a dictionary of French words that are not derived forms, removing those containing strange characters or consisting only of consonants:

cut -f1,14 lexique |
  sed 1d |
  grep '1$' |
  cut -f 1 |
  uniq |
  grep -v "['. -]" |
  grep -v '^[bcdfghjklmnpqrstvwxzç]*$' > words.txt

Now run the script ubac.py (originally written by Armavica). Here are the 100 weirdest words, for n = 1, 2, 3, 4, sorted by ascending score (following the definition above) and breaking ties by alphabetical order:

à       fy      acmé     abbé
çà      où      acné     abcès
ô       éwé     adp      abyme
jà      jà      bey      acmé
y       çà      cañon    acné
fy      dû      dey      afghan
dû      aa      deçà     ailé
by      là      doña     aimé
gy      sûr     dyke     aixois
déjà    mûr     déçu     ajax
khôl    eh      fez      alcôve
âgé     yéyé    fox      alezan
jazzy   fût     foëne    alizé
yéyé    khôl    guzla    allô
gym     oh      hâve     almée
fâché   tôt     ibm      aloès
jazz    jèze    khôl     amibe
bâché   uht     kiwi     amok
éwé     by      kohl     ankh
évêché  gy      lek      apax
buggy   pèze    new      apex
husky   ok      ovni     area
pêché   rhô     oïl      arkose
fêlé    rôt     più      arum
dès     dès     reçu     arôme
bombyx  kohl    rhô      asdic
là      kiwi    señor    aulx
hobby   tweed   tek      awacs
hé      ça      uht      axel
hâlé    oïl     wax      azulejo
bobby   dyke    wigwam   azéri
zébu    kawa    yue      aède
ès      août    yéyé     aéré
haïk    têt     zeb      aïkido
jubé    ako     zob      aïoli
chômé   aïe     âgé      baht
hugh    vu      éwé      baie
junky   wax     zemstvo  banjo
bé      web     muon     baou
bébé    ghât    noël     barzoï
jugé    ès      soja     basé
puff    lès     yeah     beagle
déçu    yakuza  foehn    berk
guppy   oxo     vodka    besef
bédé    zozo    abc      beuh
whisky  zizi    bof      bezef
époxy   mât     bye      bief
puy     jazzy   cèpe     binz
psy     kazakh  dao      birbe
fox     oka     djemââ   bizou
bouzy   ski     dès      bled
ghât    bât     faf      blob
ptyx    axe     grâce    bobsleigh
qu      aïd     jeûne    bock
funky   rêvé    kawa     bodega
body    ska     kif      boer
dé      gym     kot      boghei
zob     skaï    lez      bohême
lâché   kiki    nok      bolge
dévoyé  skunks  nèpe     boskoop
job     zeb     piu      boxon
bluff   aï      puzzle   bozo
pédégé  lynx    sikh     bref
câblé   ah      zicmu    brize
flux    zoo     zip      brol
wax     bye     âge      brook
fût     jazz    île      bufo
psyché  kaki    nikkei   buggy
bug     je      bézef    bunraku
zouk    ajax    taïga    buée
hypo    yak     trèpe    bâté
box     bézef   dîme     bébé
vu      yoyo    naevus   bédé
zéphyr  ka      ptyx     béer
zélé    kayak   râpé     bénef
pédé    jojo    york     bésef
décès   zef     abîme    bévue
boy     jaja    bezef    bézef
mêlé    jeep    shôgun   bôme
haïku   hugh    fax      cajun
yak     junky   geez     canyon
djemââ  ptyx    glèbe    casoar
dégât   buggy   gêné     catgut
goy     funky   ils      cavum
sulky   aya     lao      caïd
bât     âgé     laïus    caïman
haïkaï  ya      lès      cañon
jusqu   excès   moho     ceci
phylum  lys     nez      cheap
péché   oye     ouïe     chez
off     râpé    ouïr     ciao
jèze    foehn   paf      ciré
hadj    psy     puy      city
kazakh  rugby   puîné    clac
pépé    zébu    rez      clef
décédé  bezef   saké     coati
bégum   skip    soft     cobra
huppé   puy     sûr      cohue
baby    yeah    web      coir
humbug  bé      yin      coke

And the 100 least strange, by decreasing score:

e            ter           ion               mention
erre         enter         que               ration
tee          mer           mention           ossement
errer        on            entement          salement
ire          renter        ente              vilement
terre        tenter        cent              râlement
reine        inter         menteur           pâlement
rire         en            lentement         bêlement
are          intenter      vent              cation
terrer       tinter        dent              gisement
renier       entier        lent              nation
enterrer     conter        gent              entement
terrier      contenter     mentalement       mentalement
retirer      rentier       enter             cillement
te           entente       mentir            rationnement
et           ente          entente           virement
rare         enterrer      mental            vitement
raie         entrer        mentez            parement
aire         errer         menton            finement
ente         rentrer       ossement          ornement
ne           rente         rationnement      durement
en           lent          cillement         purement
raire        tente         entablement       bêtement
terne        terrer        mentionner        vêtement
rente        content       mentor            rarement
entre        anti          virement          âprement
enter        ponter        dément            âcrement
retraire     linter        chiquement        sûrement
retenter     lente         sagement          mûrement
reinette     monter        enterrement       vivement
renne        inti          ciment            uniquement
tertre       mener         piment            sagement
ter          sentier       connement         tique
retreinte    venter        gisement          lavement
renter       ante          finement          tellement
entrer       canter        uniquement        chiquement
inerte       retenter      entassement       pavement
entier       intention     salement          bellement
area         menterie      menthe            rudement
enserrer     entement      jument            fadement
rentrer      contrer       ration            logement
serre        rater         moment            jugement
terrine      tiser         identiquement     armement
terrien      contentement  durement          gaiement
retenir      contention    purement          paiement
rentier      tirer         agent             tapement
inertie      tintin        parement          mêmement
tire         lentement     dûment            sapement
rite         tentant       vilement          lapement
relire       mention       vitement          gréement
relier       dentier       sentiment         payement
lierre       miserere      tellement         fixement
serrer       hanter        entichement       cassement
erreur       seriner       menterie          passement
rien         interner      gentiment         bassement
rein         enterrement   cation            tassement
nier         senti         entêtement        mental
trier        conte         pratiquement      mentir
tirer        serrer        bellement         connement
terri        tintement     stationnement     nettement
araire       continent     nation            bonnement
retraite     pinter        contentement      mollement
entente      rentrant      cassement         follement
ternaire     entrant       râlement          oralement
irriter      sente         pâlement          avalement
tertiaire    remonter      bêlement          également
entretien    vanter        passement         nullement
entretenir   entre         entendement       stationnement
ore          attenter      bassement         étalement
oie          intimer       logement          noblement
terrestre    centrer       éventrement       mentez
tette        trente        tassement         roulement
retenue      orienter      actionnement      stablement
tare         montrer       rarement          logiquement
rate         reconter      amendement        utilement
taie         alerter       entièrement       seulement
enterreur    entourer      serment           isolement
rater        der           sentimentalement  feulement
tirette      rencontrer    mollement         amplement
tiare        renier        âprement          diablement
taire        ratier        mens              parlement
verrerie     menton        bonnement         vrillement
tente        patienter     lamentablement    ululement
traire       contenant     ornement          amusement
ratier       cantiner      logiquement       dément
trente       raconter      follement         raclement
tenter       fermenter     entre             versement
teinte       cranter       alitement         drôlement
trentenaire  merisier      tente             ciment
teinter      tin           âcrement          règlement
retentir     serin         rente             frôlement
narrer       teinter       vivement          hurlement
aine         terrier       fadement          piment
arien        errant        tentation         giclement
rainer       lier          comiquement       actionnement
recette      rentamer      mûrement          bâillement
tienne       ventiler      sûrement          menton
interner     riser         lavement          baisement
interne      te            tintement         tintement
resserrer    ralenti       nullement         mentor

Note how for 1-grams the least weird word is "e" consisting of the single most common letter in French; by contrast, for 4-grams, common words are those which contain common 4-letter sequences.

You can download the complete results with the scores, for n = 1 2 3 4. In case you wonder about why the script is called ubac.py: this is just because Armavica and I got started by stumbling on the word "ubac" and wondering how to quantify the fact that it looks weird. Indeed, it is around position 1500 out of around 40k words for n < 4, and it is a unique 4-gram.

Version française

Le (multi)ensemble des n-grammes d'un mot (au niveau des caractères) est formé de ses séquences de n caractères contigus. Par exemple, les 2-grammes de "gramme" sont "gr", "ra", "am", "mm", "me". Les doublons sont comptés : par exemple, pour "toto", les 2-grammes sont "to", "ot" et "to". Étant donné le dictionnaire de tous les mots d'une langue, on peut calculer facilement le multiensemble de tous les n-grammes.

En fait, ce multiensemble est assez caractéristique de la langue. Par exemple, pour identifier la langue dans laquelle un texte est écrit, il suffit généralement de calculer ses n-grammes, normaliser le résultat, et comparer la distribution ainsi obtenue à celle de langues connues : généralement, la distribution la plus proche est celle de la langue dans laquelle le texte est écrit.

Je m'intéresse ici à une idée voisine : calculer les n-grammes pour un dictionnaire de la langue française obtenu à partir de Lexique, puis prendre chaque mot du dictionnaire et calculer la moyenne, sur les n-grammes du mot, du nombre d'occurrences de chaque n-gramme dans le dictionnaire. Plus ce nombre est grand, plus les n-grammes du mot sont répandus.

De fait, cette approche fonctionne plutôt bien pour identifier des mots qui ont l'air bizarres ou qui ont l'air normaux. Préparons tout d'abord un dictionnaire des mots français qui ne sont pas des formes dérivées, en supprimant ceux qui contiennent des caractères bizarres et ceux qui ne contiennent que des consonnes :

cut -f1,14 lexique |
  sed 1d |
  grep '1$' |
  cut -f 1 |
  uniq |
  grep -v "['. -]" |
  grep -v '^[bcdfghjklmnpqrstvwxzç]*$' > words.txt

Passons à présent ce résultat au script ubac.py (dont la version initiale a été écrite par Armavica). Voici les 100 mots les plus bizarres, pour n allant de 1 à 4, triés par score croissant suivant la définition ci-dessus, puis par ordre alphabétique en cas d'égalité :

à       fy      acmé     abbé
çà      où      acné     abcès
ô       éwé     adp      abyme
jà      jà      bey      acmé
y       çà      cañon    acné
fy      dû      dey      afghan
dû      aa      deçà     ailé
by      là      doña     aimé
gy      sûr     dyke     aixois
déjà    mûr     déçu     ajax
khôl    eh      fez      alcôve
âgé     yéyé    fox      alezan
jazzy   fût     foëne    alizé
yéyé    khôl    guzla    allô
gym     oh      hâve     almée
fâché   tôt     ibm      aloès
jazz    jèze    khôl     amibe
bâché   uht     kiwi     amok
éwé     by      kohl     ankh
évêché  gy      lek      apax
buggy   pèze    new      apex
husky   ok      ovni     area
pêché   rhô     oïl      arkose
fêlé    rôt     più      arum
dès     dès     reçu     arôme
bombyx  kohl    rhô      asdic
là      kiwi    señor    aulx
hobby   tweed   tek      awacs
hé      ça      uht      axel
hâlé    oïl     wax      azulejo
bobby   dyke    wigwam   azéri
zébu    kawa    yue      aède
ès      août    yéyé     aéré
haïk    têt     zeb      aïkido
jubé    ako     zob      aïoli
chômé   aïe     âgé      baht
hugh    vu      éwé      baie
junky   wax     zemstvo  banjo
bé      web     muon     baou
bébé    ghât    noël     barzoï
jugé    ès      soja     basé
puff    lès     yeah     beagle
déçu    yakuza  foehn    berk
guppy   oxo     vodka    besef
bédé    zozo    abc      beuh
whisky  zizi    bof      bezef
époxy   mât     bye      bief
puy     jazzy   cèpe     binz
psy     kazakh  dao      birbe
fox     oka     djemââ   bizou
bouzy   ski     dès      bled
ghât    bât     faf      blob
ptyx    axe     grâce    bobsleigh
qu      aïd     jeûne    bock
funky   rêvé    kawa     bodega
body    ska     kif      boer
dé      gym     kot      boghei
zob     skaï    lez      bohême
lâché   kiki    nok      bolge
dévoyé  skunks  nèpe     boskoop
job     zeb     piu      boxon
bluff   aï      puzzle   bozo
pédégé  lynx    sikh     bref
câblé   ah      zicmu    brize
flux    zoo     zip      brol
wax     bye     âge      brook
fût     jazz    île      bufo
psyché  kaki    nikkei   buggy
bug     je      bézef    bunraku
zouk    ajax    taïga    buée
hypo    yak     trèpe    bâté
box     bézef   dîme     bébé
vu      yoyo    naevus   bédé
zéphyr  ka      ptyx     béer
zélé    kayak   râpé     bénef
pédé    jojo    york     bésef
décès   zef     abîme    bévue
boy     jaja    bezef    bézef
mêlé    jeep    shôgun   bôme
haïku   hugh    fax      cajun
yak     junky   geez     canyon
djemââ  ptyx    glèbe    casoar
dégât   buggy   gêné     catgut
goy     funky   ils      cavum
sulky   aya     lao      caïd
bât     âgé     laïus    caïman
haïkaï  ya      lès      cañon
jusqu   excès   moho     ceci
phylum  lys     nez      cheap
péché   oye     ouïe     chez
off     râpé    ouïr     ciao
jèze    foehn   paf      ciré
hadj    psy     puy      city
kazakh  rugby   puîné    clac
pépé    zébu    rez      clef
décédé  bezef   saké     coati
bégum   skip    soft     cobra
huppé   puy     sûr      cohue
baby    yeah    web      coir
humbug  bé      yin      coke

Voici les 100 mots les moins étranges, par score décroissant :

e            ter           ion               mention
erre         enter         que               ration
tee          mer           mention           ossement
errer        on            entement          salement
ire          renter        ente              vilement
terre        tenter        cent              râlement
reine        inter         menteur           pâlement
rire         en            lentement         bêlement
are          intenter      vent              cation
terrer       tinter        dent              gisement
renier       entier        lent              nation
enterrer     conter        gent              entement
terrier      contenter     mentalement       mentalement
retirer      rentier       enter             cillement
te           entente       mentir            rationnement
et           ente          entente           virement
rare         enterrer      mental            vitement
raie         entrer        mentez            parement
aire         errer         menton            finement
ente         rentrer       ossement          ornement
ne           rente         rationnement      durement
en           lent          cillement         purement
raire        tente         entablement       bêtement
terne        terrer        mentionner        vêtement
rente        content       mentor            rarement
entre        anti          virement          âprement
enter        ponter        dément            âcrement
retraire     linter        chiquement        sûrement
retenter     lente         sagement          mûrement
reinette     monter        enterrement       vivement
renne        inti          ciment            uniquement
tertre       mener         piment            sagement
ter          sentier       connement         tique
retreinte    venter        gisement          lavement
renter       ante          finement          tellement
entrer       canter        uniquement        chiquement
inerte       retenter      entassement       pavement
entier       intention     salement          bellement
area         menterie      menthe            rudement
enserrer     entement      jument            fadement
rentrer      contrer       ration            logement
serre        rater         moment            jugement
terrine      tiser         identiquement     armement
terrien      contentement  durement          gaiement
retenir      contention    purement          paiement
rentier      tirer         agent             tapement
inertie      tintin        parement          mêmement
tire         lentement     dûment            sapement
rite         tentant       vilement          lapement
relire       mention       vitement          gréement
relier       dentier       sentiment         payement
lierre       miserere      tellement         fixement
serrer       hanter        entichement       cassement
erreur       seriner       menterie          passement
rien         interner      gentiment         bassement
rein         enterrement   cation            tassement
nier         senti         entêtement        mental
trier        conte         pratiquement      mentir
tirer        serrer        bellement         connement
terri        tintement     stationnement     nettement
araire       continent     nation            bonnement
retraite     pinter        contentement      mollement
entente      rentrant      cassement         follement
ternaire     entrant       râlement          oralement
irriter      sente         pâlement          avalement
tertiaire    remonter      bêlement          également
entretien    vanter        passement         nullement
entretenir   entre         entendement       stationnement
ore          attenter      bassement         étalement
oie          intimer       logement          noblement
terrestre    centrer       éventrement       mentez
tette        trente        tassement         roulement
retenue      orienter      actionnement      stablement
tare         montrer       rarement          logiquement
rate         reconter      amendement        utilement
taie         alerter       entièrement       seulement
enterreur    entourer      serment           isolement
rater        der           sentimentalement  feulement
tirette      rencontrer    mollement         amplement
tiare        renier        âprement          diablement
taire        ratier        mens              parlement
verrerie     menton        bonnement         vrillement
tente        patienter     lamentablement    ululement
traire       contenant     ornement          amusement
ratier       cantiner      logiquement       dément
trente       raconter      follement         raclement
tenter       fermenter     entre             versement
teinte       cranter       alitement         drôlement
trentenaire  merisier      tente             ciment
teinter      tin           âcrement          règlement
retentir     serin         rente             frôlement
narrer       teinter       vivement          hurlement
aine         terrier       fadement          piment
arien        errant        tentation         giclement
rainer       lier          comiquement       actionnement
recette      rentamer      mûrement          bâillement
tienne       ventiler      sûrement          menton
interner     riser         lavement          baisement
interne      te            tintement         tintement
resserrer    ralenti       nullement         mentor

Remarquez comme le mot le moins étrange en termes de 1-grammes est "e", qui consiste de la seule lettre "e" qui est la plus courante en français. Pour les 4-grammes, en revanche, les mots les moins étranges sont ceux qui se composent de séquences de 4 caractères qui sont fréquentes.

Vous pouvez télécharger les résultats complets avec le score, pour n valant 1 2 3 4. Pour l'anecdote, le script s'appelle ubac.py pour la raison suivante : Armavica et moi étions tombés sur le mot "ubac" et nous demandions comment quantifier le fait qu'il a l'air bizarre. De fait, le mot se classe vers la position 1500 (sur 40000 mots environ) pour n < 4, et c'est un 4-gramme unique.

High priority free software projects

The Free Software Foundation just sent out an email asking for feedback about their list of High Priority Free Software Projects. As they said, "we encourage you to publish your thoughts independently (e.g., on your blog) and send a us a link". So I thought I'd answer. I'm an FSF member (just to support them; I'm not actively involved in anything), but of course my answer does not carry any special weight.

The points of this list are not really given in any specific order. The first three points cover most of the situations where I use proprietary software, so it's a bit geared towards my own needs and those of people around me. It excludes things like video games, because I feel that video games are works of art and not just software: it seldom makes sense to replace them by a free alternative, like you wouldn't replace a copyrighted movie with an "alternative" to that movie. The last two points are less about software proper, and about more general issues.

Developing an alternative to Skype

I use Skype to communicate with remote colleagues (aka coauthors, in researcher parlance). It is the only solution used by people around me for conference calls, except when they can't get Skype to work, in which case they sometimes switch to Google Hangouts, which is not really better as it may require a non-free browser plugin and in any case requires a Google account (so you cannot self-host it).

As Skype is not federated or interoperable either, you must use the Skype client to talk to Skype users, which makes it especially hard to replace Skype with a free alternative. (Back in the days of text messaging, I was able to switch from MSN Messenger to alternative clients because the protocol had been reverse-engineered, but it seems like this has not been done for Skype.)

To replace Skype, I would need a drop-in replacement that just works and that I could try to advertise. I tried to find such a solution some time ago, and tested various SIP clients in this hope, and couldn't find anything suitable. Skype is not even perfect, in fact, and sometimes does not work, but it works much more often than the alternatives.

Of course, as SIP is federated, SIP clients need you to open a SIP account before you can use them (some of them facilitate this process); this is an additional complication as compared to Skype, but OK, it is unavoidable, and you have to register for Skype as well. But then, the interface of these clients is ugly and ridiculously hard to understand even for experienced computer users. Worse, even when both parties are using the same client, it usually doesn't work. The call doesn't go through, there is no sound or distorted sound, the connection fails, etc. Worse, there are often no helpful error messages, logs, or other debug information to troubleshoot the problem.

There are a lot of hard problems to address to make a solution that works. If both users are behind NAT, you need NAT traversal. Sadly, with existing solutions, it doesn't work, even if you are root on a public server that you could use to relay the traffic. Also, if the connection is flaky, you want to avoid gaps in the audio. I suspect that Skype is fairly clever behind the scenes, as I sometimes hear it buffer speech during network hiccups and then send it and make up for the lag by speeding up speech and skipping blanks, so you don't miss anything. All of this is not so trivial.

Relying on Skype means relying on proprietary software, but it also means relying on a non-federated and non-interoperable protocol (and thus contributing to the network effect), and it also poses a security issue because there is no secrecy on the contents of Skype chats (as opposed to, say, SIP with ZRTP). In fact, the CNRS, the largest governmental French research organisation, which is related to my employer, recommends against the use of Skype. Yet such recommendations are ignored by all researchers I know, even when they are personally committed to the principles of free software: they need to work with other researchers who use Skype, and have no usable alternative to advertise.

What's needed: Developing a Skype alternative that works, both with a usable interface and the required backend features. Maybe WebRTC is a promising direction.

Developing and promoting an alternative to Dropbox

Some of my coworkers (not the majority) are starting to use Dropbox to collaborate on documents, instead of version control systems (VCSes), because it is easier. So I use Dropbox when I need to collaborate with other people who do, and I have no drop-in self-hosted replacement to offer.

Using Dropbox in this context is essentially laziness, as the complexity of setting up a VCS usually pays off in terms of added power: computing and reviewing diffs, using blame, having commit messages, branching, proper conflict handling, etc. Yet, in some contexts the hassle of updating, committing, and pushing manually, is not worth it, and it is better to automatically sync changes because there will often be no conflicts and little need to review old versions.

I would also need such a system to synchronize data between my machines. Manual rsync (or cronjobs) often suffice, sometimes VCSes are the right tool, yet sometimes automatic sync is all you want; especially on phones where manual solutions are impractical. So far I have resisted the temptation of using Dropbox for this, but I would be interested in a tool that would have the convenience of Dropbox.

This is not so obvious to engineer. It should rely on inotify or offer a FUSE FS to avoid polling for changes and sync with no noticeable lag (which excludes Unison). It should offer file access when you are offline (which excludes sshfs-based solutions). It should use bandwidth sparingly and avoid retransferring existing content (even doing this across files with different names, which is fine in theory but not done, e.g., by rsync). It should run also on Android so I can sync data with my phone. It should ideally rely on a VCS to resolve conflicts automatically when possible, and to offer sensible tools for manual resolution when needed (although the situation should occur rarely: only, e.g., when changing the same file on two different machines with one of them offline). I want a tool that solves only this problem and solves it well (which excludes ownCloud). Also, it should have good documentation and tutorials, which excludes git-annex (I still haven't understood whether it does what I want or not).

What's needed: Developing a Dropbox alternative that is easy to use, and document how to use it.

Making mobile phones usable with free software

Most Android applications are proprietary and distributed through the proprietary Play Store. The only libre alternative that I know of is F-Droid, and I use it exclusively, but it has about 1,300 apps listed as of this writing, whereas Google Play has a thousand times more. What is more, many proprietary apps are hard to replace, because they are designed to access the backend of specific services (train companies, banks, postal services, even public institutions, etc.): there are many such services, and they often don't have a public API.

Even forgetting about the apps, it is essentially impossible to avoid proprietary software on mobile phones. At the operating system (OS) level, Android used to be touted as open. Now the visible face of Android is riddled by proprietary Google applications. The pure open-source side of Android, AOSP, is barely keeping up. I use CyanogenMod to have Android without the proprietary Google stuff, but it has many limitations. For instance, the TextSecure SMS encryption service cannot be used without the proprietary Google Mobile Services, even though it is itself open-source (and bundled with CyanogenMod).

In any case, CyanogenMod does not solve the problem of proprietary drivers and firmware. The Replicant project does, but currently supports only 12 devices, and only partially (for none of them is both telephony and GPS suppported, for instance). At least it's a start, but hard to use for now. Then there is the issue of the radio firmware: to my knowledge the only libre implementation is OsmocomBB, with very narrow device support and no 3G. This is especially worrying: the radio firmware has unrestricted access to the entire phone and is constantly connected to an untrusted network. Replicant developers have found suspicious things there, and more may be lurking.

What's needed: Supporting the development of libre radio firmware alternatives, and of Replicant (or maybe the solution lies with other phone OSes entirely, such as Firefox OS). Support the development of F-Droid. Campaign so that people understand that not everyone is using Android or iOS, and that it is important to provide an alternative to non-portable apps (e.g., a good website).

Fighting back OS bundling and warranty exclusions

In France, as in many countries, most PCs are bundled with a Microsoft Windows license, even though it is illegal to bundle the sale of a service (the license) with a good (the hardware). Thus, when buying my Lenovo Yoga 13, I could not avoid buying a Windows license with it, and could not get it refunded from the manufacturer (Lenovo). (I had looked for comparable OS-free machines that I could buy instead, but the choice was too limited.) Of course, bundling does not apply only to Windows, but also to Apple computers. You can buy Mac OS separately, so it is not a free service, and it is bundled with the machine (even though it is leased to you by the same company that is selling you the hardware and does not expect you to run a different OS).

Going to the courts is not a realistic option: a Linux user sued Lenovo France in 2007 and the case is still open. The latest episode is a February 2014 ruling by the Cour de Cassation (French's last resort court) which cancelled a previous verdict and requested that the affair be judged a third time.

Advocates of bundling contend that customers cannot install operating systems themselves (and of course everyone uses Windows), so bundling is justified. The obvious counter-argument is that you could sell the hardware and sell separately an activation code for the preinstalled OS. In fact, now that Windows licenses need to be activated with Microsoft through the Internet, this solution would be essentially trivial to implement. Yet nothing has changed.

A related problem is the question of warranty. PC manufacturers tried for some time to avoid offering hardware warranty to users which had installed a different OS. Now mobile phone manufacturers will argue that jailbreaking, or installing a different Android, will void your warranty. I believe there should be a clear separation between the hardware and the software, and warranty should be offered on the hardware no matter which software is run. (For cases where buggy software could damage the hardware: if the hardware cannot withstand what the software asks it to do, it is its job to avoid getting damaged.)

Last, installing alternative OSes on computers and phones is blocked by technical measures: UEFI for computers, various locking mechanisms for phones...

What's needed: Legal support to consumers seeking reimbursement of bundled software, especially through collective means such as class action whenever possible. (There is a French support group but I'm not sure how active it is. At any rate, I could not get definite information about whether Lenovo France offers refunds for Windows 8 licenses (they don't).) Lobby for investigations into this practice using competition laws, like the Windows Media Player case in the EU. Help consumers who were denied warranty claims for a hardware problem because of the software that they run, lobby against technical restrictions and document ways to circumvent them.

Proprietary software and centralized computing

This is not at all a software project, just a closing comment about a battle that needs to be fought, which is distinct from the question of free software, but as least as important, in my opinion.

With the advent of customer broadband connections, it became clear that people would be tempted to use remote servers to perform their computing and store their data, even at the expense of their freedom. This switch back from personal computers to dumb terminals is well underway, and does not seem to stop. In this light, the original concerns against proprietary software seem quite tame in comparison with cloud computing. Yes, you cannot know what a proprietary program does, it could have a backdoor and give evil people access to your data, it may save your files in a proprietary format that nothing else can read, so you can't switch and your colleagues have to use the same program. However, in the cloud, you don't even know which software is running or when it changes, the data is already in the hands of others, and you may not even be able to retrieve it at all, so you are stuck and anyone working with you is stuck as well.

Ironically, with newer devices such as phones and tablets and Chromebooks, more people are using free software, without realizing it. Of course, there are proprietary vendor-specific programs and drivers, but the core is often free. Sometimes there are technical restrictions against tinkering with the device, but sometimes there aren't. Yet people do not know or care, because what matters to them is not the device, but the remote services.

To users, this is normal: they are entrusting their computing to specialists, like they do with their electricity, their food, their housing, and so many aspects of their life. Still this is not right: people are still expected to have their own opinions, their own vote, their privacy, and people's electronic trails are more and more entangled with their private thoughts and actions... Yes, computing is less vital, but it is a finer-grained power mechanism, so it is really dangerous to give it up.

Proprietary software, and centralized computing, are two different threats at two different levels, but I find the second one maybe more worrying than the first. For the first one, free software has scored some big victories. For the second one it's not even clear which alternative model we can propose...