split natural language text in chunks at reasonable language boundaries
git clone
Log | Files | Refs | README

commit 31e2c0eac393b0799afc37cd69ad614d3ad2c314
parent 13ceb6bcc1c89f533d9ab3ccb19de9a060870e2a
Author: Antoine Amarilli <>
Date:   Mon, 10 Oct 2011 00:09:47 +0200

write doc

README | 90+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 90 insertions(+), 0 deletions(-)

diff --git a/README b/README @@ -0,0 +1,90 @@ +nlsplit -- a small tool to split natural language text in natural chunks +Copyright (C) 2011 by Antoine Amarilli + +== 0. Licence == + +Permission is hereby granted, free of charge, to any person obtaining a +copy of this software and associated documentation files (the +"Software"), to deal in the Software without restriction, including +without limitation the rights to use, copy, modify, merge, publish, +distribute, sublicense, and/or sell copies of the Software, and to +permit persons to whom the Software is furnished to do so, subject to +the following conditions: + +The above copyright notice and this permission notice shall be included +in all copies or substantial portions of the Software. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS +OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF +MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. +IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY +CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, +TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE +SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. + +== 1. Features == + +nlsplit is a tool to split natural language text in chunks at reasonable +language boundaries. The program takes as argument a maximal size for +chunks, reads stdin and produces chunks smaller than the maximal size. + +The general NLP problem of text splitting is AI-hard, and optimizing to +keep chunks close to the specified size would be NP-complete. The text +splitting done here is designed to be fast and sensible in most cases, +using a few hardcoded heuristics to produce splitting possibilities with +an estimated confidence. The chunking strategy is greedy: we only +guarantee that no chunk exceeds the maximal size and that no individual +chunk could be extended to reach a further splitting point with a higher +confidence. + +== 2. Usage == + +$ nlsplit SIZE [MINCONFIDENCE] + +nlsplit reads text on stdin and produces a sequence of chunks on stdout. +Chunks are introduced by the following header: + +"-- chunk %d length %d confidence %f\n" + +The first %d is the number of the chunk (incrementing, starting at 0). +The second %d is the number of bytes N in the chunk, with N <= SIZE. The +%f is a confidence value for the chunk, indicating the confidence of the +split performed at the end of the chunk (the confidence of the last +chunk is therefore very high). + +The header is followed by N bytes and terminated by a newline. + +To illustrate this format and test the program, a sample program +nlsplit_read is provided. nlsplit_read SIZE reads a list of chunks on +stdout produced by an invocation of nlsplit SIZE, and produces the +concatenation of the contents of the chunks on stdout. This +concatenation should always assemble back to the original data on stdin, +even in the case of arbitrary binary data. + +If MINCONFIDENCE is provided, nlsplit will systematically split at any +position with confidence over MINCONFIDENCE. This can be useful if you +want to regroup chunks, rescore splits, or perform other such +operations. + +== 3. Performance == + +For an input of size N, for SIZE the maximal size of fragments, nlsplit +runs in time O(N log(SIZE)): it reads stdin and produces splits on the +fly, with O(log(SIZE)) worst-case running time at each character. The +memory usage is O(SIZE). + +== 4. Limitations == + +nlsplit's heuristics are not bulletproof. They can be fooled and perform +bad splits, or miss good ones. + +nlsplit can produce arbitrarily small chunks and will do nothing to +avoid that. It's up to you to regroup chunks if you don't find this +acceptable. + +nlsplit is not Unicode-aware. It will not perform splits according to +extended characters, and could theoretically split an extended +character. However, as long as you are using ASCII whitespace regularly +enough, these splits should be favoured and that bad situation should +not happen. +