haspirater

detect aspirated 'h' in French words (local mirror of https://gitlab.com/a3nm/haspirater)
git clone https://a3nm.net/git/haspirater/
Log | Files | Refs | README | LICENSE

README (4468B)


      1 haspirater -- a toolkit to detect initial aspirated 'h' in French words
      2 Repository URL: https://gitlab.com/a3nm/haspirater
      3 Python package name: haspirater
      4 
      5 == 0. Author and license ==
      6 
      7 haspirater is copyright (C) 2011-2019 by Antoine Amarilli
      8 
      9 haspirater is free software, distributed under an MIT license: see the
     10 file LICENSE for details of the licensing terms.
     11 
     12 Many thanks to Julien Romero who maintains the PyPI package for
     13 haspirater.
     14 
     15 == 1. Features ==
     16 
     17 haspirater is a tool to detect if a French word starts with an aspirated
     18 'h' or not. It is not based on a list of words but on a trie trained
     19 from a corpus, which ensures that it should do a reasonable job for
     20 unseen words which are similar to known ones. The JSON trie used is less
     21 than 10 KiB, and the lookup script is 40 lines of Python.
     22 
     23 == 2. Installation ==
     24 
     25 You need a working Python3 environment to run haspirater.
     26 
     27 You can install haspirater directly with pip by doing:
     28 
     29   pip3 install haspirater
     30 
     31 You can also manually clone the project repository and use haspirater
     32 directly from there.
     33 
     34 == 3. Usage ==
     35 
     36 If you just want to use the included training data, you can either run
     37 haspirater/haspirater.py, giving one word per line in stdin and getting
     38 the annotation on stdout, or you can import it in a Python file and call
     39 haspirater.lookup(word) which returns True if the leading 'h' is
     40 aspirated, False if it isn't, and raises ValueError if there is no
     41 leading 'h'.
     42 
     43 Please report any errors in the training data, keeping in mind than only
     44 one possibility is returned even when both are attested.
     45 
     46 == 4. Training ==
     47 
     48 The training data used by haspirater/haspirater.py is loaded at runtime
     49 from the haspirater/haspirater.json file which has been trained from
     50 French texts taken from Project Gutenberg <www.gutenberg.org>, from the
     51 list in the Wikipedia article
     52 <http://fr.wikipedia.org/wiki/H_aspir%C3%A9>, from the categories in the
     53 French Wiktionary
     54 <http://fr.wiktionary.org/wiki/Catégorie:Mots_à_h_muet> and
     55 <http://fr.wiktionary.org/wiki/Catégorie:Mots_à_h_aspiré> and from a
     56 custom set of exceptions. If you want to create your own data, or adapt
     57 the approach here to other linguistic features, read on.
     58 
     59 The master script is make.sh which accepts French text on stdin and a
     60 list of exceptions files as arguments. Included exception files are
     61 datasets/additions and datasets/wikipedia. These exceptions are just
     62 like training data and are not stored as-is; they are just piped later
     63 on in the training phase. make.sh produces on stdout the json trie.
     64 Thus, you would run something like the following, where corpus is your
     65 corpus:
     66 
     67   cd training
     68   cat corpus | ./make.sh ../datasets/additions ../datasets/wikipedia > ../haspirater/haspirater.json
     69 
     70 The resulting JSON file would reflect both what can be identified in the
     71 corpus and what is given in the exception files.
     72 
     73 == 5. Training details ==
     74 
     75 === 5.1. Corpus preparation (prepare.sh) ===
     76 
     77 This script removes useless characters, and separates words (one per
     78 line).
     79 
     80 === 5.2. Property inference (training/detect.pl) ===
     81 
     82 This script examines the output, notices occurrences of words for which
     83 the preceding word indicates the aspirated or non-aspirated status, and
     84 outputs them.
     85 
     86 The format used for the output of this phase is the same as that of the
     87 exceptions file.
     88 
     89 === 5.3. Removing leading 'h' ===
     90 
     91 This is a quick optimization.
     92 
     93 === 5.4. Trie construction (buildtrie.py) ===
     94 
     95 The occurrences are read one after the other and are used to populate a
     96 trie carrying the value count for each occurrence having a given prefix.
     97 
     98 === 5.5. Trie compression (compresstrie.py) ===
     99 
    100 The trie is then compressed by removing branches which are not needed to
    101 infer a value, because the only possible value is already determined at
    102 that stage.
    103 
    104 === 5.6. Trie majority relabeling (majoritytrie.py) ===
    105 
    106 Instead of the list of values with their counts, nodes are relabeled to
    107 carry the most common value. This step could be skipped to keep
    108 confidence values. We also drop useless leaf nodes there.
    109 
    110 == 6. Additional stuff ==
    111 
    112 You can use trie2dot.py to convert the output of buildtrie.py or
    113 compresstrie.py in the dot format which can be used to render a drawing
    114 of the trie ("trie2dot.py h 0 1"). The result of such a drawing is given
    115 as plots/haspirater.pdf (before majoritytrie.py: contains frequency
    116 info, but more nodes) and plots/haspirater_majority.pdf (no frequency,
    117 less nodes).
    118 
    119 You can use leavestrie.py to get the leaves of a trie.
    120