a3nm's blog

Notifications

— updated

There are basically two ways to receive information you care about. Either poll periodically the information sources (check your inbox, rss feeds, refresh pages, etc.), or get notified whenever a new item arrives.

Intuitively, notifications seem better. They ensure that no time is wasted in useless polling, and that you receive information in a timely manner. They enable you to get those information sources out of your head and get things done, and only get notified when action is required on your part.

Of course, this means that you get distracted from what you were doing if a notification arrives, which is dangerous because of the time it might take to get back to the right mind configuration once you're done dealing with a piece of information. Much has been said about this problem, and about the fact that you should avoid notifications if you want to be productive.

In my opinion, avoiding notifications entirely is a bad solution, because checking information sources compulsively (and more often than needed) is just too tempting a way to procrastinate. I think a better solution is to have notifications which you can mute (ie. enqueue incoming notifications) and unmute (ie. dequeue the backlog), so that you can either stay connected when you're not doing stuff which requires high concentration, or get in a state of flow and be sure that you won't have missed anything when you turn the notifications back on. Obviously, the right way wouldn't be to have "on" and "off" states, but to have an importance threshold (with different sources having different priorities, along with more elaborate filtering).

Quick plot of the mtgox bitcoin exchange rate

— updated

I thought I would share the following snippet to display a graph of the current BTC-USD exchange rate on mtgox (with a resolution of one second and a range of one minute; adapting to other values is trivial).

while true; do
  curl -s https://mtgox.com/code/data/ticker.php | jshon -e ticker -e last;
  sleep 1;
done | feedGnuplot --lines --stream --xlen 60 --xlabel "s" --ylabel "USD/BTC"

The aim isn't to sing the praises of Bitcoin (or of mtgox), but to advertise two useful tools which weren't so easy to find: jshon (take a JSON stream on stdin and extract information to stdout), and feedGnuplot (take data points on stdin and produce a plot using gnuplot).

Wikimedia projects by dump size, and average compressed article size

— updated

I was looking at the list of Wikipedias, and I thought that after all, the number of articles is a very bad measure of the size of a Wikipedia, because articles can have various sizes. More specifically, some Wikipedias have bots which create stubs about all sorts of villages and other geographic locations (to take the random example of the Volapük Wikipedia), and this makes them seem bigger than they really are.

It seems fairer to compare the dump size of the wiki as a measure of the total quantity of text, or rather the compressed dump size to eliminate redundancy. Surprisingly, I didn't manage to find this information in a usable chart, so here we go... Note that I'm speaking about the dumps from the Wikimedia dump service, and more specifically the pages-articles.xml.bz2 dump. You might want to look precisely what this measures (it does not include images or page history, for instance).

curl -s 'http://dumps.wikimedia.org/backup-index.html' |
  grep '<li>' |
  grep '<a href' |
  cut -d '"' -f 2 |
  cut -d '/' -f 1 |
  sort |
  uniq |
  while read p; do
    echo -n "$p "
    curl -s "http://dumps.wikimedia.org/$p/latest/" |
      grep 'pages-articles.xml.bz2"' |
      head -1 |
      cut -d '>' -f 9 |
      cut -d '<' -f 1
    sleep 1
  done |
  awk '
    /K$/ {printf "%s %d\n", $1, $2*1024}
    /M$/ {printf "%s %d\n", $1, $2*1024*1024} 
    /G$/ {printf "%s %d\n", $1, $2*1024*1024*1024}' |
  sort -k2,2nr

Yes, I know, this is really ugly and will break the day Wikimedia changes the layout of the dump service pages in the slightest way, but it works for now. It no longer does. Here is the current result:

enwiki 7301444403
dewiki 2040109465
frwiki 1503238553
commonswiki 1288490188
jawiki 1288490188
itwiki 1181116006
eswiki 1073741824
enwikisource 1041026252
ruwiki 966367641
plwiki 823866163
ptwiki 697093324
nlwiki 647495680
frwikisource 604294348
zhwiki 538548633
svwiki 371195904
cawiki 361863577
huwiki 357878988
zhwikisource 335439462
fiwiki 321493401
ukwiki 311531929
cswiki 297166438
dewikisource 295908147
nowiki 295698432
hewiki 250295091
trwiki 209924915
kowiki 208456908
enwiktionary 203318886
ruwikisource 201431449
viwiki 181403648
arwiki 178782208
rowiki 167981875
dawiki 155923251
idwiki 153826099
srwiki 149631795
bgwiki 147115212
eswikisource 126877696
fawiki 124256256
hrwiki 124256256
arwikisource 122788249
eowiki 118174515
frwiktionary 116706508
skwiki 114294784
thwiki 111463628
ltwiki 103704166
elwiki 102655590
slwiki 101711872
enwikibooks 88814387
itwikisource 88394956
mswiki 82522931
glwiki 82313216
hewikisource 80530636
etwiki 76860620
euwiki 75916902
simplewiki 69310873
shwiki 67423436
hiwiki 66794291
mkwiki 64067993
nnwiki 61236838
plwikisource 60293120
en_labswikimedia 58195968
metawiki 55469670
enwikiquote 54525952
kawiki 50226790
zhwiktionary 50226790
lvwiki 49073356
ptwikisource 47500492
tawiki 47290777
tewiki 45193625
slwikisource 44459622
ruwiktionary 43096473
azwiki 41418752
lawiki 40684748
dewiktionary 40160460
flaggedrevs_labswikimedia 39950745
readerfeedback_labswikimedia 39950745
specieswiki 38692454
bswiki 37224448
knwiki 37224448
dewikibooks 36595302
be_x_oldwiki 34603008
mlwiki 32820428
sqwiki 32505856
brwiki 31771852
plwiktionary 31142707
enwikinews 30513561
bnwiki 30303846
enwikiversity 30094131
iswiki 29150412
kowikisource 28206694
sourceswiki 28206694
cswikisource 27787264
svwikisource 27472691
bewiki 26738688
bgwiktionary 26528972
afwiki 26109542
cywiki 26004684
lbwiki 25794969
huwikisource 25165824
tlwiki 24117248
anwiki 23488102
nowikisource 23278387
ptwiktionary 22858956
lawikisource 22334668
ocwiki 22229811
mrwiki 22124953
srwikinews 21390950
mgwiktionary 20866662
viwiktionary 20552089
fiwiktionary 18874368
elwikisource 18664652
itwikiquote 18454937
vowiki 18350080
rowikisource 18245222
fywiki 18035507
huwiktionary 18035507
thwikisource 17930649
alswiki 17511219
ltwiktionary 17511219
swwiki 17196646
hrwikisource 17091788
jvwiki 17091788
zh_yuewiki 16882073
ndswiki 16672358
siwikibooks 16252928
frwikibooks 16148070
guwiki 16148070
incubatorwiki 16148070
eswiktionary 16043212
urwiki 15938355
nostalgiawiki 15728640
srwikisource 15623782
hywiki 15414067
siwiki 15309209
nlwikisource 14994636
itwikibooks 14889779
trwiktionary 14365491
fawikisource 13946060
astwiki 13841203
elwiktionary 13736345
kowiktionary 13736345
warwiki 13421772
nlwiktionary 13316915
fiwikisource 13212057
gawiki 13107200
mediawikiwiki 13002342
kkwiki 12687769
iowiktionary 12582912
jawiktionary 12268339
eswikibooks 12058624
plwikiquote 12058624
trwikisource 11953766
itwiktionary 11744051
dewikinews 11639193
lmowiki 11324620
tawiktionary 10800332
huwikibooks 10695475
frwikinews 10590617
scnwiki 10590617
jawikisource 10380902
barwiki 10276044
svwiktionary 10276044
iowiki 10171187
mywiki 10066329
iawiktionary 9961472
ukwikisource 9961472
suwiki 9856614
mnwiki 9751756
viwikisource 9542041
cvwiki 9437184
dewikiversity 9437184
quwiki 9437184
ptwikibooks 9332326
newwiki 9017753
yowiki 9017753
arzwiki 8912896
kuwiki 8912896
pmswiki 8912896
plwikinews 8808038
nlwikibooks 8598323
ttwiki 8493465
yiwiki 8388608
tawikisource 8283750
vecwiki 8178892
liwiki 8074035
newiki 7969177
cebwiki 7759462
itwikinews 7759462
plwikibooks 7654604
frwikiversity 7549747
kuwiktionary 7549747
bnwikisource 7444889
dawikisource 7444889
etwiktionary 7235174
mywiktionary 7235174
pswiki 7235174
tewikisource 7235174
eswikiquote 7130316
pamwiki 7025459
scowiki 6920601
uzwiki 6920601
eswikinews 6815744
ruwikiquote 6815744
de_labswikimedia 6710886
zh_classicalwiki 6710886
wawiki 6606028
ruwikibooks 6501171
idwikisource 6396313
jawikibooks 6186598
oswiki 6081740
dewikiquote 5976883
napwiki 5872025
ptwikinews 5872025
zh_min_nanwiki 5872025
hywikisource 5767168
sahwiki 5767168
htwiki 5662310
mtwiki 5662310
sqwikibooks 5662310
cawikisource 5557452
iawiki 5557452
nowiktionary 5557452
bpywiki 5452595
gvwiki 5452595
mwlwiki 5452595
strategywiki 5452595
hsbwiki 5347737
nds_nlwiki 5347737
rowiktionary 5347737
hewikibooks 5138022
ptwikiquote 5138022
vlswiki 5138022
gdwiki 5033164
tgwiki 4928307
foundationwiki 4823449
knwiktionary 4823449
pnbwiki 4823449
kmwiki 4718592
mlwiktionary 4613734
frwikiquote 4508876
jawikinews 4508876
tewiktionary 4404019
betawikiversity 4299161
cswiktionary 4299161
roa_tarawiki 4299161
ukwiktionary 4299161
bat_smgwiki 4194304
fowiki 4194304
itwikiversity 4089446
mlwikisource 4089446
nahwiki 3984588
bswikisource 3879731
amwiki 3774873
azwikisource 3774873
arwiktionary 3670016
ckbwiki 3670016
zhwikinews 3670016
angwiki 3565158
dvwiki 3565158
ganwiki 3565158
mgwiki 3565158
cawiktionary 3460300
ruwikiversity 3460300
eowiktionary 3355443
hifwiki 3355443
hewiktionary 3250585
kshwiki 3250585
eswikiversity 3145728
fiwikibooks 3145728
szlwiki 3145728
wuuwiki 3145728
cewiki 3040870
glwiktionary 3040870
liwiktionary 3040870
lowiktionary 3040870
ruwikinews 2936012
hewikiquote 2831155
rmwiki 2831155
sawikisource 2831155
simplewiktionary 2831155
ugwiki 2831155
vecwikisource 2831155
cowiki 2726297
arwikibooks 2621440
bowiki 2621440
zhwikibooks 2516582
bclwiki 2411724
cswikiversity 2411724
furwiki 2411724
huwikiquote 2411724
idwiktionary 2411724
iswikisource 2411724
kywiki 2411724
map_bmswiki 2411724
stqwiki 2411724
diqwiki 2306867
fiu_vrowiki 2306867
hrwiktionary 2306867
krcwiki 2306867
mkwikisource 2306867
ptwikiversity 2306867
afwiktionary 2202009
bawiki 2202009
extwiki 2202009
hiwiktionary 2202009
iswiktionary 2202009
lawiktionary 2202009
ocwiktionary 2202009
roa_rupwiki 2202009
sawikibooks 2202009
scwiki 2202009
tkwiki 2202009
fawikiquote 2097152
ilowiki 2097152
ladwiki 2097152
mhrwiki 2097152
nrmwiki 2097152
viwikibooks 2097152
bgwikisource 1992294
brwikisource 1992294
cawikibooks 1992294
dsbwiki 1992294
fawiktionary 1992294
pawiki 1992294
sewiki 1992294
sowiki 1992294
thwiktionary 1992294
wawiktionary 1992294
yiwikisource 1992294
zhwikiquote 1992294
cswikibooks 1887436
fawikibooks 1887436
mkwikibooks 1887436
mrjwiki 1887436
orwiki 1887436
bgwikiquote 1782579
emlwiki 1782579
eowikisource 1782579
lijwiki 1782579
mznwiki 1782579
scnwiktionary 1782579
skwikiquote 1782579
trwikiquote 1782579
wowiki 1782579
crhwiki 1677721
euwiktionary 1677721
hsbwiktionary 1677721
kwwiki 1677721
novwiki 1677721
plwikimedia 1677721
svwikibooks 1677721
bjnwiki 1572864
frpwiki 1572864
gnwiki 1572864
kowikibooks 1572864
kvwiki 1572864
miwiki 1572864
nlwikiquote 1572864
nvwiki 1572864
ruewiki 1572864
udmwiki 1572864
brwiktionary 1468006
csbwiki 1468006
cswikiquote 1468006
cywikisource 1468006
lnwiki 1468006
svwikinews 1468006
testwiki 1468006
xalwiki 1468006
bswikiquote 1363148
chrwiki 1363148
jbowiki 1363148
kmwiktionary 1363148
sawiki 1363148
tpiwiki 1363148
trwikibooks 1363148
cbk_zamwiki 1258291
eowikibooks 1258291
fiwikinews 1258291
frrwiki 1258291
idwikibooks 1258291
koiwiki 1258291
mrwikibooks 1258291
nowikibooks 1258291
pcdwiki 1258291
rwwiki 1258291
simplewikiquote 1258291
skwikibooks 1258291
sqwiktionary 1258291
elwikiquote 1153433
fawikinews 1153433
glkwiki 1153433
glwikisource 1153433
hakwiki 1153433
iswikibooks 1153433
mswiktionary 1153433
papwiki 1153433
ukwikimedia 1153433
arwikinews 1048576
azwiktionary 1048576
bugwiki 1048576
dawiktionary 1048576
fywiktionary 1048576
gagwiki 1048576
huwikinews 1048576
iawikibooks 1048576
iewiki 1048576
kbdwiki 1048576
klwiki 1048576
lowiki 1048576
nlwikimedia 1048576
rowikibooks 1048576
rowikinews 1048576
sdwiki 1048576
tawikinews 1048576
ukwikibooks 1016729
kawikibooks 1005363
cswikinews 996352
cvwikibooks 996352
dawikibooks 975667
elwikiversity 962048
swwiktionary 955596
zeawiki 955494
aywiki 954470
bgwikinews 949760
myvwiki 949452
arcwiki 946688
astwiktionary 943718
etwikisource 943718
fiwikiquote 943718
pdcwiki 943718
srwiktionary 943718
slwiktionary 937267
fiwikiversity 933888
vowiktionary 933785
hewikinews 931840
kabwiki 922931
ltwikiquote 922419
kawiktionary 915660
azwikibooks 894873
nlwikinews 894259
cywiktionary 889753
acewiki 889651
hrwikibooks 884224
thwikibooks 878080
kaawiki 869376
glwikibooks 841318
wikimania2007wiki 840192
outreachwiki 832512
jawikiquote 818995
test2wiki 818176
elwikibooks 813772
pflwiki 811315
wikimania2009wiki 805171
wikimania2011wiki 803123
ltwikisource 801382
svwikiquote 788787
jawikiversity 780492
igwiki 776192
ttwikibooks 775680
bhwiki 771379
ukwikinews 757248
cawikinews 754176
abwiki 744140
aswiki 736153
mrwiktionary 732569
skwikisource 725299
mdfwiki 721510
cuwiki 720486
nowikinews 718233
zh_min_nanwiktionary 714547
ltwikibooks 713318
hywiktionary 707174
bgwikibooks 700211
pagwiki 698675
sahwikisource 688640
arwikiquote 687001
slwikiquote 685158
trwikinews 683008
tetwiki 678809
srnwiki 663142
cawikiquote 660582
wikimania2008wiki 646041
towiki 644710
liwikisource 638873
ukwikiquote 621772
nowikiquote 617984
hawwiki 608870
srwikibooks 597299
ltgwiki 594841
rmywiki 590745
pntwiki 583270
lvwiktionary 565248
kowikiquote 565043
wikimania2010wiki 564019
gawiktionary 554496
lbewiki 544870
sawiktionary 516096
elwikinews 514150
sswiki 511180
nawiki 504012
cdowiki 499097
srwikiquote 498380
bmwiki 496025
brwikimedia 492748
tenwiki 489267
kkwiktionary 480665
eowikiquote 475750
avwiki 471654
slwikibooks 471449
eewiki 467046
omwiki 464691
hrwikiquote 464179
hywikiquote 459980
wikimania2006wiki 450867
piwiki 449331
kawikiquote 447488
dawikiquote 441241
kywikibooks 439603
hiwikibooks 439091
rowikiquote 430387
hawiki 429158
svwikiversity 425881
zh_min_nanwikisource 424448
tswiki 421580
viwikiquote 419840
angwiktionary 414310
tawikibooks 413081
idwikiquote 412262
tlwiktionary 411955
smwiki 405401
tywiki 404172
pihwiki 386457
sdwiktionary 382771
bxrwiki 379596
tlwikibooks 374476
mowiki 373555
sqwikinews 367718
simplewikibooks 363520
kgwiki 363315
cowiktionary 352665
iuwiki 350105
yiwiktionary 344985
ttwiktionary 342323
gotwiki 340889
ikwiki 340480
nowikimedia 338329
tkwiktionary 337408
mswikibooks 335667
glwikiquote 332288
newiktionary 329523
sewikimedia 328601
urwiktionary 326860
biwiki 324812
bewiktionary 322048
siwiktionary 320102
usabilitywiki 319078
amwiktionary 318259
skwiktionary 315596
angwikibooks 304025
thwikiquote 294297
mlwikibooks 293273
hiwikiquote 287744
iewiktionary 278630
bnwikibooks 277401
nnwikiquote 276070
etwikiquote 275046
xhwiki 269209
mlwikiquote 266035
bswikinews 264908
lbwiktionary 264806
kswiki 262041
kowikinews 261427
anwiktionary 261324
sdwikinews 261017
wowiktionary 258867
bswiktionary 257843
sgwiki 257024
zawiki 257024
knwikisource 246784
urwikibooks 235929
wikimania2005wiki 234803
kywiktionary 233881
zuwiki 232652
gnwiktionary 227328
etwikibooks 224665
ffwiki 221696
bnwiktionary 214220
nnwiktionary 213606
crwiki 211968
azwikiquote 208793
dzwiki 207974
csbwiktionary 206131
nywiki 205414
thwikinews 205312
cywikiquote 204492
tumwiki 202956
ndswiktionary 198041
eowikinews 195686
stwiki 194867
mkwiktionary 192204
tiwiki 189644
kmwikibooks 188620
iswikiquote 186368
rnwiki 185958
newikibooks 184320
tnwiki 182784
ptwikimedia 182169
vewiki 180224
kuwikiquote 177971
shwiktionary 175206
euwikiquote 174796
lgwiki 169062
chwiki 166297
vowikibooks 160768
fjwiki 154419
klwiktionary 146636
jvwiktionary 146227
uawikimedia 142131
guwiktionary 140800
bewikiquote 132300
snwiki 130457
nahwiktionary 130150
htwikisource 129331
pswikibooks 127590
pswiktionary 125440
ocwikibooks 124108
akwiki 120934
mnwiktionary 120115
mrwikiquote 120012
sqwikiquote 119910
tewikiquote 119296
gdwiktionary 117657
fowiktionary 117555
ugwiktionary 116121
rwwiktionary 111513
twwiki 109260
bswikibooks 107929
tgwiktionary 107008
quwiktionary 104345
hywikibooks 103731
kiwiki 101580
liwikiquote 101273
lawikibooks 100454
brwikiquote 99328
euwikibooks 99225
kuwikibooks 99123
kwwiktionary 97382
afwikiquote 95129
gvwiktionary 93900
tawikiquote 90009
lawikiquote 87449
stwiktionary 85196
mtwiktionary 84889
ruwikimedia 83763
rswikimedia 81510
miwiktionary 80384
tpiwiktionary 79462
ngwiki 79155
chrwiktionary 78233
zuwiktionary 75059
fywikibooks 73932
omwiktionary 73932
bewikibooks 73830
tewikibooks 73113
swwikibooks 70963
iuwiktionary 69939
lvwikibooks 69529
tkwikibooks 67276
sowiktionary 66252
fowikisource 64716
chywiki 63897
tiwiktionary 63692
tgwikibooks 62464
suwiktionary 61235
alswiktionary 60928
cowikimedia 60723
iewikibooks 60723
lnwiktionary 59596
pawiktionary 59596
fiwikimedia 58880
sswiktionary 58265
tswiktionary 58265
cowikibooks 57548
uzwiktionary 57344
nawiktionary 56422
pawikibooks 55398
roa_rupwiktionary 54886
ikwiktionary 54169
mkwikimedia 53964
smwiktionary 53350
hawiktionary 52633
afwikibooks 51916
pa_uswikimedia 51609
chwikimedia 51097
knwikiquote 50688
sgwiktionary 49971
tnwiktionary 46796
guwikiquote 46592
dvwiktionary 46284
kswiktionary 46284
liwikibooks 45568
mowiktionary 45363
fjwiktionary 45056
mgwikibooks 43212
aswiktionary 40448
angwikiquote 38912
trwikimedia 38809
bhwiktionary 38297
abwiktionary 37478
dzwiktionary 37273
orwiktionary 37068
snwiktionary 37068
yowiktionary 36864
biwiktionary 36761
twwiktionary 36761
towiktionary 36249
xhwiktionary 35840
lbwikiquote 35737
bowiktionary 34713
cywikibooks 33894
urwikiquote 33484
aywiktionary 31641
nzwikimedia 28057
uzwikiquote 27545
aawiki 27340
astwikiquote 26931
amwikiquote 26828
suwikibooks 25600
zawiktionary 24985
kywikiquote 23552
advisorywiki 22835
cowikiquote 22630
aawiktionary 21504
astwikibooks 20275
wikimania2012wiki 20275
zh_min_nanwikibooks 19148
iiwiki 18329
zh_min_nanwikiquote 18124
akwikibooks 17920
qualitywiki 17817
jbowiktionary 17408
mhwiki 16384
etwikimedia 14540
chowiki 14028
dkwikimedia 11776
kkwikiquote 11673
kkwikibooks 9420
mnwikibooks 9318
scwiktionary 9113
wowikiquote 8601
knwikibooks 8089
angwikisource 7884
bmwikibooks 7680
ndswikibooks 7680
quwikibooks 7680
nawikibooks 7475
zawikibooks 7270
howiki 7065
xhwikibooks 6860
suwikiquote 6656
ugwikibooks 6553
miwikibooks 6451
kswikibooks 6348
zuwikibooks 6246
guwikibooks 6041
kswikiquote 5836
tkwikiquote 5529
uzwikibooks 5529
chwikibooks 5222
kjwiki 5222
lbwikibooks 5222
lnwikibooks 5222
ugwikiquote 5222
gnwikibooks 5120
gawikibooks 5017
gawikiquote 4915
gotwikibooks 4812
bmwiktionary 4710
hzwiki 4608
nahwikibooks 4505
sewikibooks 4505
bowikibooks 4403
kwwikiquote 4300
rmwikibooks 4300
mhwiktionary 4096
ttwikiquote 4096
alswikibooks 3993
biwikibooks 3993
rmwiktionary 3788
yowikibooks 3788
vowikiquote 3686
avwiktionary 3584
bmwikiquote 3584
aawikibooks 3379
aswikibooks 3276
piwiktionary 3276
rnwiktionary 3276
bawikibooks 3174
crwiktionary 3174
wawikibooks 3072
crwikiquote 2969
mywikibooks 2969
muswiki 2867
nawikiquote 2867
akwiktionary 2764
alswikiquote 2662
krwiki 2560
zawikiquote 2252
aywikibooks 2150
chwiktionary 1945
vewikimedia 1945
quwikiquote 1843
ndswikiquote 1740
krwikiquote 1126

I won't comment much: the biggest wikis are mostly the same (though in a slightly different order), but the Volapük Wikipedia, to continue using this example, is quite lower in the rankings. Actually, the right way to look at this would be to plot the dump size over the number of articles. Here is the result, restricted to the first 80 Wikipedias (not Wikimedia projects) sorted by the number of articles. (If the graph is not correctly displayed below, you can get it here.)

Plot

The Volapük Wikipedia is indeed an outlier, with a few other cases.

Haspirater -- identifying initial aspirated 'h's in French words

— updated

English version

Version française ci-dessous.

I just wrote haspirater, a system to detect if the initial 'h' in a French word is aspirated or not. (I happened to need this, and no one had apparently done it yet.) For those who are unfamiliar with the context, the thing is that for French words which start with an 'h', the 'h' can be aspirated or non aspirated, which changes the behavior of the word regarding elision and liaison. Of course, there are no known rules to find out from the structure of a word whether it is aspirated or not...

The simple approach would be to use word lists, but of course they are never complete and will fail for unseen words. A natural solution for them would be to assume that they behave in the same way that the closest known word. This forces us to define "closest", and it seems reasonable to look at the word with the longest common prefix, because it looks like the property of whether the 'h' is aspired or not should be mostly conditioned by the beginning of the word. (Actually, it would be better to look at the pronunciation of the beginning of the word if we could afford it).

This suggests a simple optimization. If we are going to take the result of the closest word in this fashion, we might as well drop words from the known words list which do not contribute anything to the result. In other words, if all words starting with "hach" in the list have an aspirated 'h', then there is no need to store them all; just storing "hach" will be enough. In fact, this means that the appropriate structure for our word list is a trie, and the optimization that I mentioned is apparently called compression.

Another trick is that we can try to infer word lists automatically from a corpus using some simple rules. If we read "la hache" in a text, it means that the initial 'h' is aspirated, whereas "l'hirondelle" indicates that "hirondelle" starts with a non-aspirated 'h'. We can easily process megabytes of text and get word lists in this way.

This approach works quite well, yielding a dataset which is under 5 KB in json (and less than 1 KB compressed) and a lookup program which is just 40 lines of Python. It gives the right result for any example I could come up with (though I had to add a few exceptions for words missing from the corpus, both manually and using Wikipedia and Wiktionary); hopefully it will also work for you, please tell me if it doesn't. Note that we could even compile the trie into a very efficient C program if speed was of the essence.

Another amusing thing is that we can draw the data and try to see what the "rules" are. Here is the trie (it is quite messy). The node border and edge thickness are are a logarithmic function of the number of occurrences, the node labels indicate the prefix, and the color is red for aspirated h and blue for non-aspirated (or a mix thereof for ambiguous cases, depending on the proportion).

The Wikipedia article I linked above mentions that this approach is used to store Unicode character properties efficiently; it seems like it could be used for a lot of other things. For instance, you could imagine the trie indicating the gender of nouns (though you would probably build it on suffixes rather than prefixes in this case), the trie of prepositions used for a given noun (say "в" vs "на" in Russian), and probably tries for a lot of other arbitrary things that are so obvious to native speakers.

Version française

English version above.

Je viens d'écrire haspirater, un système pour détecter les 'h' aspirés au début des mots français. (Il se trouve que j'en avais besoin, et, apparemment, personne ne l'avait encore fait.) Pour ceux qui ne connaissent pas le contexte, le problème est que pour les mots français qui commencent par un 'h', le 'h' peut être aspiré ou non aspiré (ce qui ne change rien à la prononciation, mais change le fonctionnement de l'élision et de la liaison). Évidemment, il n'y a pas de règles pour déterminer à partir de la structure d'un mot si le 'h' initial est aspiré ou non...

L'approche la plus simple serait d'utiliser une liste de mots, mais évidemment de telles listes ne sont jamais complètes et ne fonctionneront pas pour des mots inconnus. Une solution naturelle pour ces derniers serait de supposer qu'ils se comportent de la même manière que le mot connu le plus proche. Cela nous oblige à définir "proche" : il paraît raisonnable de regarder le mot ayant le plus long préfixe commun avec le mot inconnu, parce qu'on a tendance à se dire que c'est le début du mot qui a le plus d'influence sur le fait que le 'h' soit aspiré ou non. (En fait, ce serait plutôt la prononciation du début du mot qu'il faudrait regarder si on la connaissait, mais passons...)

Cela nous mène à une optimisation simple. Si on prend toujours le résultat du mot le plus proche de la façon décrite plus haut, on peut très bien retirer de la liste de mots connus ceux qui ne sont pas nécessaires pour déduire le résultat. En d'autres termes, si tous les mots qui commencent par "hach" dans la liste ont un 'h' aspiré, alors il n'est pas nécessaire de les stocker tous, il suffit de stocker "hach". En fait, cela veut dire que la bonne structure pour notre liste de mots est un trie, et l'optimisation que je viens de décrire s'appelle apparemment la compression.

Une autre astuce est qu'on peut essayer d'inférer automatiquement des listes de mots à partir d'un corpus en utilisant quelques règles simples. Si on lit "la hache" dans un texte, cela signifie que le 'h' initial est aspiré, alors que "l'hirondelle" indique qu'"hirondelle" commence par un 'h' non aspiré. On peut facilement traiter plusieurs mégaoctets de texte pour en tirer des listes de mots de cette façon.

Cette approche fonctionne plutôt bien : le fichier de données fait moins de 5 Ko en json (et moins de 1 Ko compressé), et le programme pour chercher un mot fait 40 lignes de Python. Il donne le bon résultat pour tous les exemples que j'ai essayés (quoique j'ai dû ajouter à la main quelques exceptions pour des mots absents du corpus, en utilisant les listes de Wikipédia et du Wiktionnaire) ; j'espère qu'il marchera aussi pour vous, et vous prie de me signaler toute erreur. Notez qu'on pourrait même compiler le trie en un programme en C très efficace si la vitesse était importante.

Un autre truc amusant qu'on peut faire, c'est faire un dessin des données et voir quelles semblent être les "règles". Voici le trie (il est plutôt bordélique). L'épaisseur du bord des nœuds et des arêtes est en fonction du nombre d'occurrences (logarithmiquement), les étiquettes des nœuds indiquent le préfixe, les nœuds en rouge indiquent l'aspiration, les nœuds en bleu indiquent la non-aspiration, et un mélange indique les cas ambigus (en fonction de la proportion).

L'article Wikipédia cité plus haut indique que cette approche est utilisée pour un stockage efficace des propriétés des caractères Unicode ; j'ai l'impression qu'elle pourrait être utilisée pour tout un tas d'autres choses. On pourrait imaginer, par exemple, un trie du genre des noms (quoiqu'il serait probablement préférable de regarder les suffixes plutôt que les préfixes dans ce cas), un trie du choix des propositions en fonction du nom ("в" vs "на" en russe, entre autres), et sans doute des tries pour un tas d'autres choses arbitraires qui sont tellement évidentes pour les locuteurs natifs.

A review of the TypeMatrix 2030

— updated

I was convinced by a friend who wanted to buy a TypeMatrix 2030 to buy one too so that we could get a discount. Here is a review of it, after roughly one month and a half of nearly exclusive use. The version I have is the blank one (because looking at the keyboard is a bad idea, and because it's kinda pretty).

To give a bit of background: I've been using the Dvorak US layout (with dead keys to get French accents) for about two years. I touch-type, and can reach around 110 WPM on speed typing games (but am nowhere as fast in real life usage). Before that, I touch-typed on the French Dvorak layout of Josselin Mouette adapted from that of Francis Leboutte (which got removed from Xorg for stupid reasons), and still before, I hunted and pecked on the Azerty layout. I love exclusive keyboard usage whenever I can afford it, and hate having to move my hands to reach things away from the home row like the arrow keys, the numpad, or (gasp) the mouse. Oh, and I love the command line and commandline apps, and use vim.

General comments

The blank version of the keyboard is very stylish (though it's sad that the design is so asymmetrical). It is guaranteed to confuse or impress people, which can be fun, and, if you're using an alternative layout, it is a gentle hint to other people that they'd better not try to use your keyboard.

The keyboard does have some unexpected features, like a hardcoded Dvorak layout that you can toggle and which is managed by the keyboard not the OS. In other words, turning this on will make the keyboard interpret what you're typing as Dvorak and translate to the OS, which will do what you expect if the OS is configured to receive Qwerty. Of course, if it already expects Dvorak, then you get garbage. This is useless to me because I need non-standard dead keys (and seldom have to share my machines with Qwerty users anyway), but can be useful to others. Or you might also be disgusted to see that the keyboard tries to do some fancy logic like this. Or maybe regret that since it does, it would have been cool to also have fancy features like the ability to remap keys and record macros on the fly...

There are a few multimedia shortcuts and all. Some of them are actual multimedia keys which you can map to whatever you want, and some of them (like cut, copy, paste) are hardcoded sequences which are indistinguishable from the separate keys. The precise status of those keys is described in this document.

The dots on the index home row keys are there, and at the center of the key, which can be surprising if you expect them to be at the bottom. There is also a dot on the delete key (which I don't see the use for), a dot on the lower row pinky key (which is '/' on Qwerty but 'z' on Dvorak) which I find useless and slightly confusing, and a dot on the down arrow (which is a nice touch to help you reach those keys without looking whenever you have to use software which requires them).

The TypeMatrix 2030 does not have N-key rollover (NKRO). That's a bit disappointing for a keyboard of this kind...

Adaptation period

Adapting to the TypeMatrix 2030 takes a bit of time. It's nowhere as hard as learning a new layout, but it is definitely not instantaneous, and you might want to count one week before you're up to speed. Here is a list of the things that I had to adapt to:

Touch
The key touch isn't really special (and I'm not really picky about that sort of thing anyway), but it is slightly hard. This, along with the fact that the position of the modifier keys somehow confused me at the beginning, meant that my wrists suffered a bit and that it was literally slightly painful at the beginning. This didn't last, fortunately.
Modifiers
The main modifiers that I use are left shift, left control, super, alt, and altgr. Finding out where there are to be able to press the right ones without even thinking about it takes some time.
Enter and backspace
One of the most original things about this layout is the fact that the enter and backspace keys are in the middle of the keyboard (and pressed with the index) rather than far at the right (and pressed with the pinky). This means that you have to replace the very low level reflex of reaching for enter when done and backspace when wrong by the reflex of going at the center of the keyboard.
Matrix keys
The other important feature of the keyboard is that the keys are in a matrix (duh). This isn't that much of a deal, except for those keys which seem to be off by one relative to their position on usual keyboards. The worst for me was the right half of the lower row, and numbers.
Real touch-typing
Unless you're completely touch-typing (and it's easy to be mistaken), the absence of markings will make you notice those keys where you sometimes peek at the keyboard. In my case, the letter keys were fine, but not the numbers and symbols...

Assessment

Overall, I have to admit that I am not that enthusiastic about the benefits of this keyboard. Though aligning the keys in a matrix seems more logical, I do not feel it makes much of a difference. Maybe it's better somehow--but then, maybe not.

Another slight problem is that damn enter key. Putting it in the middle seems like a good idea; however, this means that pressing it by mistake still happens now and then, whereas I never had this problem with a regular keyboard. Yes, it's easier to reach, but then the usual enter and backspace can also be reached with the pinky almost without moving the hand, so it's not much of a benefit.

My main disappointment, though, is that it is doesn't make you type faster, or give you the impression that you're typing faster, or make you feel better, or whatever. It feels like just another keyboard; a good keyboard, with a cool design, but definitely not worth the effort of carrying it around when you're using a laptop, and probably not worth the money and adaptation time. Granted, if you have RSI, you might want to see if this helps. If you do most of your work on a fixed workstation and you're willing to pay extra, this might still be a reasonable choice. But otherwise, if you're just a normal typist not especially dissatisfied with normal keyboards and just intellectually satisfied by the TypeMatrix design choices, if you're using a laptop or multiple computers, don't buy it and expect it to be enormously better to use. If you're like me, you won't really notice much.