frhyme

guess the last phonemes of a French word (local mirror of https://gitlab.com/a3nm/frhyme)
git clone https://a3nm.net/git/frhyme/
Log | Files | Refs | README | LICENSE

commit ebe63afe36c813a2cba408a37a3a8d3570b865c3
parent 6e9af935a279923df039026980c92b979e26947a
Author: Antoine Amarilli <a3nm@a3nm.net>
Date:   Fri, 16 Aug 2019 00:30:32 +0200

go over README

Diffstat:
README | 93+++++++++++++++++++++++++++++++++++++++++++------------------------------------
1 file changed, 51 insertions(+), 42 deletions(-)

diff --git a/README b/README @@ -1,27 +1,16 @@ frhyme -- a toolkit to guess the last phonemes of a French word -Copyright (C) 2011-2019 by Antoine Amarilli Repository URL: https://gitlab.com/a3nm/frhyme +Python package name: frhyme -== 0. Licence == +== 0. Author and license == -Permission is hereby granted, free of charge, to any person obtaining a -copy of this software and associated documentation files (the -"Software"), to deal in the Software without restriction, including -without limitation the rights to use, copy, modify, merge, publish, -distribute, sublicense, and/or sell copies of the Software, and to -permit persons to whom the Software is furnished to do so, subject to -the following conditions: +frhyme is copyright (C) 2011-2019 by Antoine Amarilli -The above copyright notice and this permission notice shall be included -in all copies or substantial portions of the Software. +frhyme is free software, distributed under an MIT license: see the +file LICENSE for details of the licensing terms. -THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS -OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF -MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. -IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY -CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, -TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE -SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. +Many thanks to Julien Romero who maintains the PyPI package for +frhyme. == 1. Features == @@ -30,40 +19,60 @@ It is trained on a list of words with associated pronunciation, and will infer a few likely possibilities for unseen words using known words with the longest common prefix, using a trie for internal representation. -== 2. Usage == +== 2. Installation == -To avoid licensing headaches, and because the data file is quite big, no -pronunciation data is included, you have to generate it yourself. See section 3. +You need a working Python3 environment to run frhyme. -Once you have pronunciation data ready in frhyme.json, you can either run -frhyme.py [NBEST], giving one word per line in stdin and getting the NBEST top -pronunciations on stdout (default is 5), or you can import it in a Python file -and call frhyme.lookup(word, NBEST) which returns the NBEST top pronunciations -(default is 5). +You can install frhyme directly with pip by doing: -The pronunciations returned are annotated with a confidence score (the number of -occurrences in the training data). They should be sensible up to the longest -prefix occurring in the training data, but may be prefixed by garbage. + pip3 install frhyme -== 3. Training == +You can also manually clone the project repository and use frhyme +directly from there, but you must then follow the instructions in +Section 4 below to prepare the file frhyme.json for frhyme. (By +contrast, if you install frhyme using pip, a file frhyme.json is +provided, which has been trained using the Lexique database: +http://www.lexique.org/.) -First, make sure that you have a working python3 installation and that you have -unzip (Debian packages: python3, unzip). +== 3. Usage == -The data used by frhyme.py is loaded at runtime from the fryme.json file which -should be trained from a pronunciation database. The recommended way to do so is -to use a tweaked Lexique <http://lexique.org> along with a provided bugfix file, -as follows: +You can either run + + frhyme.py [NBEST] + +giving one word per line in stdin and getting the NBEST top +pronunciations on stdout (default is 5), or you can import frhyme in a +Python program and call frhyme.lookup(word, NBEST) which returns the +NBEST top pronunciations (default is 5). + +The pronunciations returned are annotated with a confidence score (the +number of occurrences in the training data). They should be sensible up +to the longest prefix of the input word that occurs in the training +data, but they may be prefixed by garbage. + +== 4. Training == + +If you have cloned this repository, you need to prepare the file +frhyme.json. + +First, make sure that you have a working python3 installation and that +you have unzip (Debian packages: python3, unzip). + +The data used by frhyme.py is loaded at runtime from the frhyme.json +file which should be trained from a pronunciation database. The +recommended way to do so is to use Lexique <http://lexique.org> with +some tweaks and some modifications according to provided files. The way +to do this is as follows: cd scripts lexique/lexique_retrieve.sh > lexique.txt ./make.sh NPHON lexique.txt additions > ../frhyme/frhyme.json cd .. -where NPHON is the number of trailing phonemes to keep (suggested value: 4). -Beware, this may take up several hundred megabytes of RAM. The resulting file -should be accurate on the French words of Lexique, and will return -pronunciations in a variant of X-SAMPA which ensures that each phoneme is mapped -to exactly one ASCII character: the substitutions are "A~" => "#", "O~" => "$", -"E~" => ")", "9~" => "(". +where NPHON is the number of trailing phonemes to keep (suggested value: +4). Beware, this may take up several hundred megabytes of RAM. The +resulting file should be accurate on the French words of Lexique, and +will return pronunciations in a variant of X-SAMPA which ensures that +each phoneme is mapped to exactly one ASCII character: the substitutions +are "A~" => "#", "O~" => "$", "E~" => ")", "9~" => "(".