detect aspirated 'h' in French words (local mirror of https://gitlab.com/a3nm/haspirater)
git clone https://a3nm.net/git/haspirater/
Log | Files | Refs | README | LICENSE

commit 90a1a9fba6e56abe7baca8792394f0a15f9f10eb
parent ae59e299f0d0b9ce4255d6b07dc718b32edd980d
Author: Antoine Amarilli <a3nm@a3nm.net>
Date:   Fri, 16 Aug 2019 00:01:45 +0200

go over README

README | 95+++++++++++++++++++++++++++++++++++++++----------------------------------------
1 file changed, 47 insertions(+), 48 deletions(-)

diff --git a/README b/README @@ -1,42 +1,41 @@ haspirater -- a toolkit to detect initial aspirated 'h' in French words -Copyright (C) 2011-2019 by Antoine Amarilli Repository URL: https://gitlab.com/a3nm/haspirater +Python package name: haspirater -== 0. License (MIT license) == +== 0. Author and license == -Permission is hereby granted, free of charge, to any person obtaining a -copy of this software and associated documentation files (the -"Software"), to deal in the Software without restriction, including -without limitation the rights to use, copy, modify, merge, publish, -distribute, sublicense, and/or sell copies of the Software, and to -permit persons to whom the Software is furnished to do so, subject to -the following conditions: +Haspirater is copyright (C) 2011-2019 by Antoine Amarilli -The above copyright notice and this permission notice shall be included -in all copies or substantial portions of the Software. +Haspirater is free software, distributed under an MIT license: see the +file LICENSE for details of the licensing terms. -THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS -OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF -MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. -IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY -CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, -TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE -SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. +Many thanks to Julien Romero who maintains the PyPI package for +haspirater. == 1. Features == haspirater is a tool to detect if a French word starts with an aspirated 'h' or not. It is not based on a list of words but on a trie trained from a corpus, which ensures that it should do a reasonable job for -unseen words which are similar to known ones, without carrying a big -exceptions list. The json trie used is less than 10 Kio, and the lookup -script is 40 lines of Python. +unseen words which are similar to known ones. The JSON trie used is less +than 10 KiB, and the lookup script is 40 lines of Python. -== 2. Usage == +== 2. Installation == + +You need a working Python3 environment to run haspirater. + +You can install haspirater directly with pip by doing: + + pip3 install haspirater + +You can also manually clone the project repository and use haspirater +directly from there. + +== 3. Usage == If you just want to use the included training data, you can either run -haspirater.py, giving one word per line in stdin and getting the -annotation on stdout, or you can import it in a Python file and call +haspirater/haspirater.py, giving one word per line in stdin and getting +the annotation on stdout, or you can import it in a Python file and call haspirater.lookup(word) which returns True if the leading 'h' is aspirated, False if it isn't, and raises ValueError if there is no leading 'h'. @@ -44,35 +43,37 @@ leading 'h'. Please report any errors in the training data, keeping in mind than only one possibility is returned even when both are attested. -== 3. Training == +== 4. Training == -The training data used by haspirater.py is loaded at runtime from the -haspirater.json file which has been trained from French texts taken from -Project Gutenberg <www.gutenberg.org>, from the list in the Wikipedia -article <http://fr.wikipedia.org/wiki/H_aspir%C3%A9>, from the -categories in the French Wiktionary -<http://fr.wiktionary.org/wiki/Catégorie:Mots_à_h_muet> and -<http://fr.wiktionary.org/wiki/Catégorie:Mots_à_h_aspiré> and from a custom -set of exceptions. If you want to create your own data, or adapt the -approach here to other linguistic features, read on. +The training data used by haspirater/haspirater.py is loaded at runtime from the +haspirater/haspirater.json file which has been trained from French texts taken +from Project Gutenberg <www.gutenberg.org>, from the list in the Wikipedia +article <http://fr.wikipedia.org/wiki/H_aspir%C3%A9>, from the categories in the +French Wiktionary <http://fr.wiktionary.org/wiki/Catégorie:Mots_à_h_muet> and +<http://fr.wiktionary.org/wiki/Catégorie:Mots_à_h_aspiré> and from a custom set +of exceptions. If you want to create your own data, or adapt the approach here +to other linguistic features, read on. The master script is make.sh which accepts French text on stdin and a list of exceptions files as arguments. Included exception files are additions and wikipedia. These exceptions are just like training data and are not stored as-is; they are just piped later on in the training phase. make.sh produces on stdout the json trie. Thus, you would run -something like: +something like the following, where corpus is your corpus: + + $ cat corpus | ./make.sh additions wikipedia > haspirater/haspirater.json - $ cat corpus | ./make.sh exceptions > haspirater/haspirater.json +The resulting JSON file would reflect both what can be identified in the +corpus and what is given in the exception files. -== 4. Training details == +== 5. Training details == -=== 4.1. Corpus preparation (prepare.sh) === +=== 5.1. Corpus preparation (prepare.sh) === This script removes useless characters, and separates words (one per line). -=== 4.2. Property inference (detect.pl) === +=== 5.2. Property inference (detect.pl) === This script examines the output, notices occurrences of words for which the preceding word indicates the aspirated or non-aspirated status, and @@ -81,30 +82,28 @@ outputs them. The format used for the output of this phase is the same as that of the exceptions file. -=== 4.3. Removing leading 'h' === +=== 5.3. Removing leading 'h' === This is a quick optimization. -=== 4.4. Trie construction (buildtrie.py) === +=== 5.4. Trie construction (buildtrie.py) === The occurrences are read one after the other and are used to populate a trie carrying the value count for each occurrence having a given prefix. -=== 4.5. Trie compression (compresstrie.py) === +=== 5.5. Trie compression (compresstrie.py) === The trie is then compressed by removing branches which are not needed to -infer a value. This step could be followed by a removal of branches with -very little dissent from the majority value if we wanted to reduce the -trie size at the expense of accuracy: for aspirated h, this isn't -needed. +infer a value, because the only possible value is already determined at that +stage. -=== 4.6. Trie majority relabeling (majoritytrie.py) === +=== 5.6. Trie majority relabeling (majoritytrie.py) === Instead of the list of values with their counts, nodes are relabeled to carry the most common value. This step could be skipped to keep confidence values. We also drop useless leaf nodes there. -== 5. Additional stuff == +== 6. Additional stuff == You can use trie2dot.py to convert the output of buildtrie.py or compresstrie.py in the dot format which can be used to render a drawing