frhyme

guess the last phonemes of a French word
git clone https://a3nm.net/git/frhyme/
Log | Files | Refs | README

commit b686dd48ff2340e861aebd921be254fe1fdb7a04
parent a385809da4461daa2c93d244eb106391a4ed2c03
Author: Antoine Amarilli <a3nm@a3nm.net>
Date:   Wed, 10 Aug 2011 15:38:48 -0400

add Lexique import code

Diffstat:
README | 18+++++++++++-------
lexique/lexique_prepare.sh | 8++++++++
lexique/lexique_retrieve.sh | 12++++++++++++
lexique/subst.pl | 36++++++++++++++++++++++++++++++++++++
4 files changed, 67 insertions(+), 7 deletions(-)

diff --git a/README b/README @@ -1,4 +1,4 @@ -haspirater -- a toolkit to detect initial aspirated 'h' in French words +frhyme -- a toolkit to guess the last phonemes of a French word Copyright (C) 2011 by Antoine Amarilli == 0. Licence == @@ -24,15 +24,15 @@ SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. == 1. Features == -haspirater is a tool to detect if a French word starts with an aspirated -'h' or not. It is not based on a list of words but on a trie trained -from a corpus, which ensures that it should do a reasonable job for -unseen words which are similar to known ones, without carrying a big -exceptions list. The json trie used is less than 10 Kio, and the lookup -script is 40 lines of Python. +frhyme is a tool to guess what the last phonemes of a French word are. +It is trained on a list of words with associated pronunciation, and will +infer a few likely possibilities for unseen words using known words with +the longest common prefix, using a trie for internal representation. +TODO == 2. Usage == +To avoid licensing headaches, no training data is included. If you just want to use the included training data, you can either run haspirater.py, giving one word per line in stdin and getting the annotation on stout, or you can import it in a Python file and call @@ -64,6 +64,10 @@ something like: $ cat corpus | ./make.sh exceptions > haspirater.json + + + + == 4. Training details == === 4.1. Corpus preparation (prepare.sh) === diff --git a/lexique/lexique_prepare.sh b/lexique/lexique_prepare.sh @@ -0,0 +1,8 @@ +#!/bin/bash + +# Prepare the Lexique file for use with frhyme + +cd "$( dirname "$0" )" + +awk '{print $1, $2}' | iconv -f latin1 -t utf8 | ./subst.pl + diff --git a/lexique/lexique_retrieve.sh b/lexique/lexique_retrieve.sh @@ -0,0 +1,12 @@ +#!/bin/bash + +ZIP="Lexique371b.zip" +URL="http://www.lexique.org/public/$ZIP" +FILE="Lexique371/Bases+Scripts/Lexique3.txt" + +cd "$( dirname "$0" )" + +wget $URL +unzip $ZIP $FILE +cat $FILE | ./prepare_lexique.sh + diff --git a/lexique/subst.pl b/lexique/subst.pl @@ -0,0 +1,36 @@ +#!/usr/bin/perl + +# This file fixes Lexique's pronunciation info from the home-grown +# format described in +# http://www.lexique.org/outils/Manuel_Lexique.htm#_Toc108519023 to the +# X-SAMPA standard + + +sub subst { + my $a = shift; + # substitutions to apply + my @s = ( + ["§", "O~"], + ["@", "A~"], + ["1", "E~"], + ["5", "9~"], + ["°", "@"], + ["3", "@"], + ["H", "8"], + ["N", "J"], + ["G", "N"], + ); + foreach my $t (@s) { + $a =~ s/${$t}[0]/${$t}[1]/g + } + return $a; +} + +while (<>) { + chop; + if (/^(.*) ([^ ]*)$/) { + my $repl = subst $2; + print "$1 $repl\n"; + } +} +