README (4468B)
1 haspirater -- a toolkit to detect initial aspirated 'h' in French words 2 Repository URL: https://gitlab.com/a3nm/haspirater 3 Python package name: haspirater 4 5 == 0. Author and license == 6 7 haspirater is copyright (C) 2011-2019 by Antoine Amarilli 8 9 haspirater is free software, distributed under an MIT license: see the 10 file LICENSE for details of the licensing terms. 11 12 Many thanks to Julien Romero who maintains the PyPI package for 13 haspirater. 14 15 == 1. Features == 16 17 haspirater is a tool to detect if a French word starts with an aspirated 18 'h' or not. It is not based on a list of words but on a trie trained 19 from a corpus, which ensures that it should do a reasonable job for 20 unseen words which are similar to known ones. The JSON trie used is less 21 than 10 KiB, and the lookup script is 40 lines of Python. 22 23 == 2. Installation == 24 25 You need a working Python3 environment to run haspirater. 26 27 You can install haspirater directly with pip by doing: 28 29 pip3 install haspirater 30 31 You can also manually clone the project repository and use haspirater 32 directly from there. 33 34 == 3. Usage == 35 36 If you just want to use the included training data, you can either run 37 haspirater/haspirater.py, giving one word per line in stdin and getting 38 the annotation on stdout, or you can import it in a Python file and call 39 haspirater.lookup(word) which returns True if the leading 'h' is 40 aspirated, False if it isn't, and raises ValueError if there is no 41 leading 'h'. 42 43 Please report any errors in the training data, keeping in mind than only 44 one possibility is returned even when both are attested. 45 46 == 4. Training == 47 48 The training data used by haspirater/haspirater.py is loaded at runtime 49 from the haspirater/haspirater.json file which has been trained from 50 French texts taken from Project Gutenberg <www.gutenberg.org>, from the 51 list in the Wikipedia article 52 <http://fr.wikipedia.org/wiki/H_aspir%C3%A9>, from the categories in the 53 French Wiktionary 54 <http://fr.wiktionary.org/wiki/Catégorie:Mots_à_h_muet> and 55 <http://fr.wiktionary.org/wiki/Catégorie:Mots_à_h_aspiré> and from a 56 custom set of exceptions. If you want to create your own data, or adapt 57 the approach here to other linguistic features, read on. 58 59 The master script is make.sh which accepts French text on stdin and a 60 list of exceptions files as arguments. Included exception files are 61 datasets/additions and datasets/wikipedia. These exceptions are just 62 like training data and are not stored as-is; they are just piped later 63 on in the training phase. make.sh produces on stdout the json trie. 64 Thus, you would run something like the following, where corpus is your 65 corpus: 66 67 cd training 68 cat corpus | ./make.sh ../datasets/additions ../datasets/wikipedia > ../haspirater/haspirater.json 69 70 The resulting JSON file would reflect both what can be identified in the 71 corpus and what is given in the exception files. 72 73 == 5. Training details == 74 75 === 5.1. Corpus preparation (prepare.sh) === 76 77 This script removes useless characters, and separates words (one per 78 line). 79 80 === 5.2. Property inference (training/detect.pl) === 81 82 This script examines the output, notices occurrences of words for which 83 the preceding word indicates the aspirated or non-aspirated status, and 84 outputs them. 85 86 The format used for the output of this phase is the same as that of the 87 exceptions file. 88 89 === 5.3. Removing leading 'h' === 90 91 This is a quick optimization. 92 93 === 5.4. Trie construction (buildtrie.py) === 94 95 The occurrences are read one after the other and are used to populate a 96 trie carrying the value count for each occurrence having a given prefix. 97 98 === 5.5. Trie compression (compresstrie.py) === 99 100 The trie is then compressed by removing branches which are not needed to 101 infer a value, because the only possible value is already determined at 102 that stage. 103 104 === 5.6. Trie majority relabeling (majoritytrie.py) === 105 106 Instead of the list of values with their counts, nodes are relabeled to 107 carry the most common value. This step could be skipped to keep 108 confidence values. We also drop useless leaf nodes there. 109 110 == 6. Additional stuff == 111 112 You can use trie2dot.py to convert the output of buildtrie.py or 113 compresstrie.py in the dot format which can be used to render a drawing 114 of the trie ("trie2dot.py h 0 1"). The result of such a drawing is given 115 as plots/haspirater.pdf (before majoritytrie.py: contains frequency 116 info, but more nodes) and plots/haspirater_majority.pdf (no frequency, 117 less nodes). 118 119 You can use leavestrie.py to get the leaves of a trie. 120