README (4282B)
1 nlsplit -- a small tool to split natural language text in natural chunks 2 Copyright (C) 2011 by Antoine Amarilli 3 4 == 0. License (MIT license) == 5 6 Permission is hereby granted, free of charge, to any person obtaining a 7 copy of this software and associated documentation files (the 8 "Software"), to deal in the Software without restriction, including 9 without limitation the rights to use, copy, modify, merge, publish, 10 distribute, sublicense, and/or sell copies of the Software, and to 11 permit persons to whom the Software is furnished to do so, subject to 12 the following conditions: 13 14 The above copyright notice and this permission notice shall be included 15 in all copies or substantial portions of the Software. 16 17 THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS 18 OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF 19 MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. 20 IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY 21 CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, 22 TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE 23 SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 24 25 == 1. Features == 26 27 nlsplit is a tool to split natural language text in chunks at reasonable 28 language boundaries. The program takes as argument a maximal size for 29 chunks, reads stdin and produces chunks smaller than the maximal size on 30 stdout. 31 32 The general NLP problem of text splitting is AI-hard, and optimizing to 33 keep chunks close to the specified size would be NP-complete. The text 34 splitting done here is designed to be fast and sensible in most cases, 35 using a few hardcoded heuristics to produce splitting possibilities with 36 an estimated confidence. The chunking strategy is greedy: we only 37 guarantee that no chunk exceeds the maximal size and that no individual 38 chunk could be extended to reach a further splitting point with a higher 39 confidence. 40 41 == 2. Usage == 42 43 $ nlsplit SIZE [MINCONFIDENCE] 44 45 nlsplit reads text on stdin and produces a sequence of chunks on stdout. 46 Chunks are introduced by the following header: 47 48 "-- chunk %d length %d confidence %f\n" 49 50 The first %d is the number of the chunk (incrementing, starting at 0). 51 The second %d is the number of bytes N in the chunk, with N <= SIZE. The 52 %f is a confidence value for the chunk, indicating the confidence of the 53 split performed at the end of the chunk (the confidence of the last 54 chunk is therefore very high). 55 56 The header is followed by N bytes and terminated by a newline. 57 58 To illustrate this format and test the program, a sample program 59 nlsplit_read is provided. nlsplit_read SIZE reads a list of chunks on 60 stdout produced by an invocation of nlsplit SIZE, and produces the 61 concatenation of the contents of the chunks on stdout. This 62 concatenation should always assemble back to the original data on stdin, 63 even in the case of arbitrary binary data. 64 65 If MINCONFIDENCE is provided, nlsplit will systematically split at any 66 position with confidence over MINCONFIDENCE. This can be useful if you 67 want to regroup chunks, rescore splits, or perform other such 68 operations. 69 70 == 3. Performance == 71 72 For an input of size N, for SIZE the maximal size of fragments, nlsplit 73 runs in time O(N log(SIZE)): it reads stdin and produces splits on the 74 fly, with O(log(SIZE)) worst-case running time at each character. The 75 memory usage is O(SIZE). 76 77 == 4. Limitations == 78 79 nlsplit assumes that newlines are encoded with LF, not CR, CR+LF or 80 something else. You will have to perform the conversion using another 81 tool if this can be an issue. 82 83 nlsplit's heuristics are not bulletproof. They can be fooled and perform 84 bad splits, or miss good ones. 85 86 nlsplit can produce arbitrarily small chunks and will do nothing to 87 avoid that. It's up to you to regroup chunks if you don't find this 88 acceptable. 89 90 nlsplit is not Unicode-aware. It will not take extended characters into 91 account when performing splits, and could theoretically split an 92 extended character. However, as long as you are using ASCII whitespace 93 regularly enough, these splits should not be favoured and that bad 94 situation should not happen. 95 96 nlsplit keeps whitespace at the beginning or at the end of chunks to 97 avoid losing any information. Depending on your application, you might 98 prefer to trim it. 99