Author: Antoine Amarilli <firstname.lastname@example.org>
Date: Mon, 10 Oct 2011 00:09:47 +0200
|README|| | ||90||+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++|
1 file changed, 90 insertions(+), 0 deletions(-)
diff --git a/README b/README
@@ -0,0 +1,90 @@
+nlsplit -- a small tool to split natural language text in natural chunks
+Copyright (C) 2011 by Antoine Amarilli
+== 0. Licence ==
+Permission is hereby granted, free of charge, to any person obtaining a
+copy of this software and associated documentation files (the
+"Software"), to deal in the Software without restriction, including
+without limitation the rights to use, copy, modify, merge, publish,
+distribute, sublicense, and/or sell copies of the Software, and to
+permit persons to whom the Software is furnished to do so, subject to
+the following conditions:
+The above copyright notice and this permission notice shall be included
+in all copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
+OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+== 1. Features ==
+nlsplit is a tool to split natural language text in chunks at reasonable
+language boundaries. The program takes as argument a maximal size for
+chunks, reads stdin and produces chunks smaller than the maximal size.
+The general NLP problem of text splitting is AI-hard, and optimizing to
+keep chunks close to the specified size would be NP-complete. The text
+splitting done here is designed to be fast and sensible in most cases,
+using a few hardcoded heuristics to produce splitting possibilities with
+an estimated confidence. The chunking strategy is greedy: we only
+guarantee that no chunk exceeds the maximal size and that no individual
+chunk could be extended to reach a further splitting point with a higher
+== 2. Usage ==
+$ nlsplit SIZE [MINCONFIDENCE]
+nlsplit reads text on stdin and produces a sequence of chunks on stdout.
+Chunks are introduced by the following header:
+"-- chunk %d length %d confidence %f\n"
+The first %d is the number of the chunk (incrementing, starting at 0).
+The second %d is the number of bytes N in the chunk, with N <= SIZE. The
+%f is a confidence value for the chunk, indicating the confidence of the
+split performed at the end of the chunk (the confidence of the last
+chunk is therefore very high).
+The header is followed by N bytes and terminated by a newline.
+To illustrate this format and test the program, a sample program
+nlsplit_read is provided. nlsplit_read SIZE reads a list of chunks on
+stdout produced by an invocation of nlsplit SIZE, and produces the
+concatenation of the contents of the chunks on stdout. This
+concatenation should always assemble back to the original data on stdin,
+even in the case of arbitrary binary data.
+If MINCONFIDENCE is provided, nlsplit will systematically split at any
+position with confidence over MINCONFIDENCE. This can be useful if you
+want to regroup chunks, rescore splits, or perform other such
+== 3. Performance ==
+For an input of size N, for SIZE the maximal size of fragments, nlsplit
+runs in time O(N log(SIZE)): it reads stdin and produces splits on the
+fly, with O(log(SIZE)) worst-case running time at each character. The
+memory usage is O(SIZE).
+== 4. Limitations ==
+nlsplit's heuristics are not bulletproof. They can be fooled and perform
+bad splits, or miss good ones.
+nlsplit can produce arbitrarily small chunks and will do nothing to
+avoid that. It's up to you to regroup chunks if you don't find this
+nlsplit is not Unicode-aware. It will not perform splits according to
+extended characters, and could theoretically split an extended
+character. However, as long as you are using ASCII whitespace regularly
+enough, these splits should be favoured and that bad situation should