aboutsummaryrefslogtreecommitdiff
nlsplit -- a small tool to split natural language text in natural chunks
Copyright (C) 2011 by Antoine Amarilli

== 0. License (MIT license) ==

Permission is hereby granted, free of charge, to any person obtaining a
copy of this software and associated documentation files (the
"Software"), to deal in the Software without restriction, including
without limitation the rights to use, copy, modify, merge, publish,
distribute, sublicense, and/or sell copies of the Software, and to
permit persons to whom the Software is furnished to do so, subject to
the following conditions:

The above copyright notice and this permission notice shall be included
in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

== 1. Features ==

nlsplit is a tool to split natural language text in chunks at reasonable
language boundaries. The program takes as argument a maximal size for
chunks, reads stdin and produces chunks smaller than the maximal size on
stdout.

The general NLP problem of text splitting is AI-hard, and optimizing to
keep chunks close to the specified size would be NP-complete. The text
splitting done here is designed to be fast and sensible in most cases,
using a few hardcoded heuristics to produce splitting possibilities with
an estimated confidence. The chunking strategy is greedy: we only
guarantee that no chunk exceeds the maximal size and that no individual
chunk could be extended to reach a further splitting point with a higher
confidence.

== 2. Usage ==

$ nlsplit SIZE [MINCONFIDENCE]

nlsplit reads text on stdin and produces a sequence of chunks on stdout.
Chunks are introduced by the following header:

"-- chunk %d length %d confidence %f\n"

The first %d is the number of the chunk (incrementing, starting at 0).
The second %d is the number of bytes N in the chunk, with N <= SIZE. The
%f is a confidence value for the chunk, indicating the confidence of the
split performed at the end of the chunk (the confidence of the last
chunk is therefore very high).

The header is followed by N bytes and terminated by a newline.

To illustrate this format and test the program, a sample program
nlsplit_read is provided. nlsplit_read SIZE reads a list of chunks on
stdout produced by an invocation of nlsplit SIZE, and produces the
concatenation of the contents of the chunks on stdout. This
concatenation should always assemble back to the original data on stdin,
even in the case of arbitrary binary data.

If MINCONFIDENCE is provided, nlsplit will systematically split at any
position with confidence over MINCONFIDENCE. This can be useful if you
want to regroup chunks, rescore splits, or perform other such
operations.

== 3. Performance ==

For an input of size N, for SIZE the maximal size of fragments, nlsplit
runs in time O(N log(SIZE)): it reads stdin and produces splits on the
fly, with O(log(SIZE)) worst-case running time at each character. The
memory usage is O(SIZE).

== 4. Limitations ==

nlsplit assumes that newlines are encoded with LF, not CR, CR+LF or
something else. You will have to perform the conversion using another
tool if this can be an issue.

nlsplit's heuristics are not bulletproof. They can be fooled and perform
bad splits, or miss good ones.

nlsplit can produce arbitrarily small chunks and will do nothing to
avoid that. It's up to you to regroup chunks if you don't find this
acceptable.

nlsplit is not Unicode-aware. It will not take extended characters into
account when performing splits, and could theoretically split an
extended character. However, as long as you are using ASCII whitespace
regularly enough, these splits should not be favoured and that bad
situation should not happen.

nlsplit keeps whitespace at the beginning or at the end of chunks to
avoid losing any information. Depending on your application, you might
prefer to trim it.