haspirater

detect aspirated 'h' in French words
git clone https://a3nm.net/git/haspirater/
Log | Files | Refs | README

README (4814B)


      1 haspirater -- a toolkit to detect initial aspirated 'h' in French words
      2 Copyright (C) 2011 by Antoine Amarilli
      3 
      4 == 0. License (MIT license) ==
      5 
      6 Permission is hereby granted, free of charge, to any person obtaining a
      7 copy of this software and associated documentation files (the
      8 "Software"), to deal in the Software without restriction, including
      9 without limitation the rights to use, copy, modify, merge, publish,
     10 distribute, sublicense, and/or sell copies of the Software, and to
     11 permit persons to whom the Software is furnished to do so, subject to
     12 the following conditions:
     13 
     14 The above copyright notice and this permission notice shall be included
     15 in all copies or substantial portions of the Software.
     16 
     17 THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
     18 OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
     19 MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
     20 IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
     21 CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
     22 TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
     23 SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
     24 
     25 == 1. Features ==
     26 
     27 haspirater is a tool to detect if a French word starts with an aspirated
     28 'h' or not. It is not based on a list of words but on a trie trained
     29 from a corpus, which ensures that it should do a reasonable job for
     30 unseen words which are similar to known ones, without carrying a big
     31 exceptions list. The json trie used is less than 10 Kio, and the lookup
     32 script is 40 lines of Python.
     33 
     34 == 2. Usage ==
     35 
     36 If you just want to use the included training data, you can either run
     37 haspirater.py, giving one word per line in stdin and getting the
     38 annotation on stdout, or you can import it in a Python file and call
     39 haspirater.lookup(word) which returns True if the leading 'h' is
     40 aspirated, False if it isn't, and raises ValueError if there is no
     41 leading 'h'.
     42 
     43 Please report any errors in the training data, keeping in mind than only
     44 one possibility is returned even when both are attested.
     45 
     46 == 3. Training ==
     47 
     48 The training data used by haspirater.py is loaded at runtime from the
     49 haspirater.json file which has been trained from French texts taken from
     50 Project Gutenberg <www.gutenberg.org>, from the list in the Wikipedia
     51 article <http://fr.wikipedia.org/wiki/H_aspir%C3%A9>, from the
     52 categories in the French Wiktionary
     53 <http://fr.wiktionary.org/wiki/Catégorie:Mots_à_h_muet> and
     54 <http://fr.wiktionary.org/wiki/Catégorie:Mots_à_h_aspiré> and from a custom
     55 set of exceptions. If you want to create your own data, or adapt the
     56 approach here to other linguistic features, read on.
     57 
     58 The master script is make.sh which accepts French text on stdin and a
     59 list of exceptions files as arguments. Included exception files are
     60 additions and wikipedia. These exceptions are just like training data
     61 and are not stored as-is; they are just piped later on in the training
     62 phase. make.sh produces on stdout the json trie. Thus, you would run
     63 something like:
     64 
     65   $ cat corpus | ./make.sh exceptions > haspirater.json
     66 
     67 == 4. Training details ==
     68 
     69 === 4.1. Corpus preparation (prepare.sh) ===
     70 
     71 This script removes useless characters, and separates words (one per
     72 line).
     73 
     74 === 4.2. Property inference (detect.pl) ===
     75 
     76 This script examines the output, notices occurrences of words for which
     77 the preceding word indicates the aspirated or non-aspirated status, and
     78 outputs them.
     79 
     80 The format used for the output of this phase is the same as that of the
     81 exceptions file.
     82 
     83 === 4.3. Removing leading 'h' ===
     84 
     85 This is a quick optimization.
     86 
     87 === 4.4. Trie construction (buildtrie.py) ===
     88 
     89 The occurrences are read one after the other and are used to populate a
     90 trie carrying the value count for each occurrence having a given prefix.
     91 
     92 === 4.5. Trie compression (compresstrie.py) ===
     93 
     94 The trie is then compressed by removing branches which are not needed to
     95 infer a value. This step could be followed by a removal of branches with
     96 very little dissent from the majority value if we wanted to reduce the
     97 trie size at the expense of accuracy: for aspirated h, this isn't
     98 needed.
     99 
    100 === 4.6. Trie majority relabeling (majoritytrie.py) ===
    101 
    102 Instead of the list of values with their counts, nodes are relabeled to
    103 carry the most common value. This step could be skipped to keep
    104 confidence values. We also drop useless leaf nodes there.
    105 
    106 == 5. Additional stuff ==
    107 
    108 You can use trie2dot.py to convert the output of buildtrie.py or
    109 compresstrie.py in the dot format which can be used to render a drawing
    110 of the trie ("trie2dot.py h 0 1"). The result of such a drawing is given
    111 as haspirater.pdf (before majoritytrie.py: contains frequency info, but
    112 more nodes) and haspirater_majority.pdf (no frequency, less nodes).
    113 
    114 You can use leavestrie.py to get the leaves of a trie.
    115