nlsplit -- a small tool to split natural language text in natural chunks Copyright (C) 2011 by Antoine Amarilli == 0. License (MIT license) == Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. == 1. Features == nlsplit is a tool to split natural language text in chunks at reasonable language boundaries. The program takes as argument a maximal size for chunks, reads stdin and produces chunks smaller than the maximal size on stdout. The general NLP problem of text splitting is AI-hard, and optimizing to keep chunks close to the specified size would be NP-complete. The text splitting done here is designed to be fast and sensible in most cases, using a few hardcoded heuristics to produce splitting possibilities with an estimated confidence. The chunking strategy is greedy: we only guarantee that no chunk exceeds the maximal size and that no individual chunk could be extended to reach a further splitting point with a higher confidence. == 2. Usage == $ nlsplit SIZE [MINCONFIDENCE] nlsplit reads text on stdin and produces a sequence of chunks on stdout. Chunks are introduced by the following header: "-- chunk %d length %d confidence %f\n" The first %d is the number of the chunk (incrementing, starting at 0). The second %d is the number of bytes N in the chunk, with N <= SIZE. The %f is a confidence value for the chunk, indicating the confidence of the split performed at the end of the chunk (the confidence of the last chunk is therefore very high). The header is followed by N bytes and terminated by a newline. To illustrate this format and test the program, a sample program nlsplit_read is provided. nlsplit_read SIZE reads a list of chunks on stdout produced by an invocation of nlsplit SIZE, and produces the concatenation of the contents of the chunks on stdout. This concatenation should always assemble back to the original data on stdin, even in the case of arbitrary binary data. If MINCONFIDENCE is provided, nlsplit will systematically split at any position with confidence over MINCONFIDENCE. This can be useful if you want to regroup chunks, rescore splits, or perform other such operations. == 3. Performance == For an input of size N, for SIZE the maximal size of fragments, nlsplit runs in time O(N log(SIZE)): it reads stdin and produces splits on the fly, with O(log(SIZE)) worst-case running time at each character. The memory usage is O(SIZE). == 4. Limitations == nlsplit assumes that newlines are encoded with LF, not CR, CR+LF or something else. You will have to perform the conversion using another tool if this can be an issue. nlsplit's heuristics are not bulletproof. They can be fooled and perform bad splits, or miss good ones. nlsplit can produce arbitrarily small chunks and will do nothing to avoid that. It's up to you to regroup chunks if you don't find this acceptable. nlsplit is not Unicode-aware. It will not take extended characters into account when performing splits, and could theoretically split an extended character. However, as long as you are using ASCII whitespace regularly enough, these splits should not be favoured and that bad situation should not happen. nlsplit keeps whitespace at the beginning or at the end of chunks to avoid losing any information. Depending on your application, you might prefer to trim it.