commit 43892774f45ccf108f0f0b9409c4db4ebe20ec67
parent 87c740896784f74f58495a0dff3de29e576a1dc2
Author: Antoine Amarilli <a3nm@a3nm.net>
Date: Tue, 13 Mar 2012 14:38:01 +0100
continue README
Diffstat:
README | | | 91 | +++++++++++++++++-------------------------------------------------------------- |
1 file changed, 19 insertions(+), 72 deletions(-)
diff --git a/README b/README
@@ -1,5 +1,5 @@
frhyme -- a toolkit to guess the last phonemes of a French word
-Copyright (C) 2011 by Antoine Amarilli
+Copyright (C) 2011-2012 by Antoine Amarilli
== 0. Licence ==
@@ -34,81 +34,28 @@ the longest common prefix, using a trie for internal representation.
To avoid licensing headaches, and because the data file is quite big, no
pronunciation data is included, you have to generate it yourself. See section 3.
-Once you have pronunciation data ready in
-If you just want to use the included training data, you can either run
-haspirater.py, giving one word per line in stdin and getting the
-annotation on stout, or you can import it in a Python file and call
-haspirater.lookup(word) which returns True if the leading 'h' is
-aspirated, False if it isn't, and raises ValueError if there is no
-leading 'h'.
+Once you have pronunciation data ready in frhyme.json, you can either run
+frhyme.py NBEST, giving one word per line in stdin and getting the NBEST top
+pronunciations on stdout, or you can import it in a Python file and call
+frhyme.lookup(word, NBEST) which returns the NBEST top pronunciations.
-Please report any errors in the training data, keeping in mind than only
-one possibility is returned even when both are attested.
+The pronunciations returned are annotated with a confidence score (the number of
+occurrences in the training data). They should be sensible up to the longest
+prefix occuring in the training data, but may be prefixed by garbage.
== 3. Training ==
-The training data used by haspirater.py is loaded at runtime from the
-haspirater.json file which has been trained from French texts taken from
-Project Gutenberg <www.gutenberg.org>, from the list in the Wikipedia
-article <http://fr.wikipedia.org/wiki/H_aspir%C3%A9>, from the
-categories in the French Wiktionary
-<http://fr.wiktionary.org/wiki/Catégorie:Mots_à_h_muet> and
-<http://fr.wiktionary.org/wiki/Catégorie:Mots_à_h_aspiré> and from a custom
-set of exceptions. If you want to create your own data, or adapt the
-approach here to other linguistic features, read on.
+The data used by frhyme.py is loaded at runtime from the haspirater.json file
+which should be trained from a pronunciation database. The recommended way to do
+so is to use a tweaked Lexique <http://lexique.org> along with a provided bugfix
+file, as follows:
-The master script is make.sh which accepts French text on stdin and a
-list of exceptions files as arguments. Included exception files are
-additions and wikipedia. These exceptions are just like training data
-and are not stored as-is; they are just piped later on in the training
-phase. make.sh produces on stdout the json trie. Thus, you would run
-something like:
+ lexique/lexique_retrieve.sh > lexique
+ ./make.sh NPHON lexique additions > frhyme.json
- $ cat corpus | ./make.sh exceptions > haspirater.json
-
-
-
-
-
-== 4. Training details ==
-
-=== 4.1. Corpus preparation (prepare.sh) ===
-
-This script removes useless characters, and separates words (one per
-line).
-
-=== 4.2. Property inference (detect.pl) ===
-
-This script examines the output, notices occurrences of words for which
-the preceding word indicates the aspirated or non-aspirated status, and
-outputs them.
-
-=== 4.3. Removing leading 'h' ===
-
-This is a quick optimization.
-
-=== 4.4. Trie construction (buildtrie.py) ===
-
-The occurrences are read one after the other and are used to populate a
-trie carrying the value count for each occurrence having a given prefix.
-
-=== 4.5. Trie compression (compresstrie.py) ===
-
-The trie is then compressed by removing branches which are not needed to
-infer a value. This step could be followed by a removal of branches with
-very little dissent from the majority value if we wanted to reduce the
-trie size at the expense of accuracy: for aspirated h, this isn't
-needed.
-
-=== 4.5. Trie majority relabeling (majoritytrie.py) ===
-
-Instead of the list of values with their counts, nodes are relabeled to
-carry the most common value. This step could be skipped to keep
-confidence values.
-
-== 5. Additionnal stuff ==
-
-You can use trie2dot.py to convert the output of buildtrie.py or
-compresstrie.py in the dot format which can be used to render a drawing
-of the trie. The result of such a drawing is given as aspirated_h.pdf
+where NPHON is the number of trailing phonemes to keep. Beware, this may take up
+several hundred megabytes of RAM. The resulting file should be accurate on the
+French words of Lexique, and will return pronunciations in a variant of X-SAMPA
+which ensures that each phoneme is mapped to exactly one ASCII character: the
+substitutions are "A~" => "#", "O~" => "$", "E~" => ")", "9~" => "(".