nlsplit

split natural language text in chunks at reasonable language boundaries
git clone https://a3nm.net/git/nlsplit/
Log | Files | Refs | README

README (4282B)


      1 nlsplit -- a small tool to split natural language text in natural chunks
      2 Copyright (C) 2011 by Antoine Amarilli
      3 
      4 == 0. License (MIT license) ==
      5 
      6 Permission is hereby granted, free of charge, to any person obtaining a
      7 copy of this software and associated documentation files (the
      8 "Software"), to deal in the Software without restriction, including
      9 without limitation the rights to use, copy, modify, merge, publish,
     10 distribute, sublicense, and/or sell copies of the Software, and to
     11 permit persons to whom the Software is furnished to do so, subject to
     12 the following conditions:
     13 
     14 The above copyright notice and this permission notice shall be included
     15 in all copies or substantial portions of the Software.
     16 
     17 THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
     18 OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
     19 MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
     20 IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
     21 CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
     22 TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
     23 SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
     24 
     25 == 1. Features ==
     26 
     27 nlsplit is a tool to split natural language text in chunks at reasonable
     28 language boundaries. The program takes as argument a maximal size for
     29 chunks, reads stdin and produces chunks smaller than the maximal size on
     30 stdout.
     31 
     32 The general NLP problem of text splitting is AI-hard, and optimizing to
     33 keep chunks close to the specified size would be NP-complete. The text
     34 splitting done here is designed to be fast and sensible in most cases,
     35 using a few hardcoded heuristics to produce splitting possibilities with
     36 an estimated confidence. The chunking strategy is greedy: we only
     37 guarantee that no chunk exceeds the maximal size and that no individual
     38 chunk could be extended to reach a further splitting point with a higher
     39 confidence.
     40 
     41 == 2. Usage ==
     42 
     43 $ nlsplit SIZE [MINCONFIDENCE]
     44 
     45 nlsplit reads text on stdin and produces a sequence of chunks on stdout.
     46 Chunks are introduced by the following header:
     47 
     48 "-- chunk %d length %d confidence %f\n"
     49 
     50 The first %d is the number of the chunk (incrementing, starting at 0).
     51 The second %d is the number of bytes N in the chunk, with N <= SIZE. The
     52 %f is a confidence value for the chunk, indicating the confidence of the
     53 split performed at the end of the chunk (the confidence of the last
     54 chunk is therefore very high).
     55 
     56 The header is followed by N bytes and terminated by a newline.
     57 
     58 To illustrate this format and test the program, a sample program
     59 nlsplit_read is provided. nlsplit_read SIZE reads a list of chunks on
     60 stdout produced by an invocation of nlsplit SIZE, and produces the
     61 concatenation of the contents of the chunks on stdout. This
     62 concatenation should always assemble back to the original data on stdin,
     63 even in the case of arbitrary binary data.
     64 
     65 If MINCONFIDENCE is provided, nlsplit will systematically split at any
     66 position with confidence over MINCONFIDENCE. This can be useful if you
     67 want to regroup chunks, rescore splits, or perform other such
     68 operations.
     69 
     70 == 3. Performance ==
     71 
     72 For an input of size N, for SIZE the maximal size of fragments, nlsplit
     73 runs in time O(N log(SIZE)): it reads stdin and produces splits on the
     74 fly, with O(log(SIZE)) worst-case running time at each character. The
     75 memory usage is O(SIZE).
     76 
     77 == 4. Limitations ==
     78 
     79 nlsplit assumes that newlines are encoded with LF, not CR, CR+LF or
     80 something else. You will have to perform the conversion using another
     81 tool if this can be an issue.
     82 
     83 nlsplit's heuristics are not bulletproof. They can be fooled and perform
     84 bad splits, or miss good ones.
     85 
     86 nlsplit can produce arbitrarily small chunks and will do nothing to
     87 avoid that. It's up to you to regroup chunks if you don't find this
     88 acceptable.
     89 
     90 nlsplit is not Unicode-aware. It will not take extended characters into
     91 account when performing splits, and could theoretically split an
     92 extended character. However, as long as you are using ASCII whitespace
     93 regularly enough, these splits should not be favoured and that bad
     94 situation should not happen.
     95 
     96 nlsplit keeps whitespace at the beginning or at the end of chunks to
     97 avoid losing any information. Depending on your application, you might
     98 prefer to trim it.
     99