frhyme

guess the last phonemes of a French word
git clone https://a3nm.net/git/frhyme/
Log | Files | Refs | README

commit 43892774f45ccf108f0f0b9409c4db4ebe20ec67
parent 87c740896784f74f58495a0dff3de29e576a1dc2
Author: Antoine Amarilli <a3nm@a3nm.net>
Date:   Tue, 13 Mar 2012 14:38:01 +0100

continue README

Diffstat:
README | 91+++++++++++++++++--------------------------------------------------------------
1 file changed, 19 insertions(+), 72 deletions(-)

diff --git a/README b/README @@ -1,5 +1,5 @@ frhyme -- a toolkit to guess the last phonemes of a French word -Copyright (C) 2011 by Antoine Amarilli +Copyright (C) 2011-2012 by Antoine Amarilli == 0. Licence == @@ -34,81 +34,28 @@ the longest common prefix, using a trie for internal representation. To avoid licensing headaches, and because the data file is quite big, no pronunciation data is included, you have to generate it yourself. See section 3. -Once you have pronunciation data ready in -If you just want to use the included training data, you can either run -haspirater.py, giving one word per line in stdin and getting the -annotation on stout, or you can import it in a Python file and call -haspirater.lookup(word) which returns True if the leading 'h' is -aspirated, False if it isn't, and raises ValueError if there is no -leading 'h'. +Once you have pronunciation data ready in frhyme.json, you can either run +frhyme.py NBEST, giving one word per line in stdin and getting the NBEST top +pronunciations on stdout, or you can import it in a Python file and call +frhyme.lookup(word, NBEST) which returns the NBEST top pronunciations. -Please report any errors in the training data, keeping in mind than only -one possibility is returned even when both are attested. +The pronunciations returned are annotated with a confidence score (the number of +occurrences in the training data). They should be sensible up to the longest +prefix occuring in the training data, but may be prefixed by garbage. == 3. Training == -The training data used by haspirater.py is loaded at runtime from the -haspirater.json file which has been trained from French texts taken from -Project Gutenberg <www.gutenberg.org>, from the list in the Wikipedia -article <http://fr.wikipedia.org/wiki/H_aspir%C3%A9>, from the -categories in the French Wiktionary -<http://fr.wiktionary.org/wiki/Catégorie:Mots_à_h_muet> and -<http://fr.wiktionary.org/wiki/Catégorie:Mots_à_h_aspiré> and from a custom -set of exceptions. If you want to create your own data, or adapt the -approach here to other linguistic features, read on. +The data used by frhyme.py is loaded at runtime from the haspirater.json file +which should be trained from a pronunciation database. The recommended way to do +so is to use a tweaked Lexique <http://lexique.org> along with a provided bugfix +file, as follows: -The master script is make.sh which accepts French text on stdin and a -list of exceptions files as arguments. Included exception files are -additions and wikipedia. These exceptions are just like training data -and are not stored as-is; they are just piped later on in the training -phase. make.sh produces on stdout the json trie. Thus, you would run -something like: + lexique/lexique_retrieve.sh > lexique + ./make.sh NPHON lexique additions > frhyme.json - $ cat corpus | ./make.sh exceptions > haspirater.json - - - - - -== 4. Training details == - -=== 4.1. Corpus preparation (prepare.sh) === - -This script removes useless characters, and separates words (one per -line). - -=== 4.2. Property inference (detect.pl) === - -This script examines the output, notices occurrences of words for which -the preceding word indicates the aspirated or non-aspirated status, and -outputs them. - -=== 4.3. Removing leading 'h' === - -This is a quick optimization. - -=== 4.4. Trie construction (buildtrie.py) === - -The occurrences are read one after the other and are used to populate a -trie carrying the value count for each occurrence having a given prefix. - -=== 4.5. Trie compression (compresstrie.py) === - -The trie is then compressed by removing branches which are not needed to -infer a value. This step could be followed by a removal of branches with -very little dissent from the majority value if we wanted to reduce the -trie size at the expense of accuracy: for aspirated h, this isn't -needed. - -=== 4.5. Trie majority relabeling (majoritytrie.py) === - -Instead of the list of values with their counts, nodes are relabeled to -carry the most common value. This step could be skipped to keep -confidence values. - -== 5. Additionnal stuff == - -You can use trie2dot.py to convert the output of buildtrie.py or -compresstrie.py in the dot format which can be used to render a drawing -of the trie. The result of such a drawing is given as aspirated_h.pdf +where NPHON is the number of trailing phonemes to keep. Beware, this may take up +several hundred megabytes of RAM. The resulting file should be accurate on the +French words of Lexique, and will return pronunciations in a variant of X-SAMPA +which ensures that each phoneme is mapped to exactly one ASCII character: the +substitutions are "A~" => "#", "O~" => "$", "E~" => ")", "9~" => "(".