You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: lt.seg/README.MD
+27-9Lines changed: 27 additions & 9 deletions
Original file line number
Diff line number
Diff line change
@@ -1,36 +1,53 @@
1
+
###
2
+
# Copyright 2015
3
+
#
4
+
# Licensed under the Apache License, Version 2.0 (the "License");
5
+
# you may not use this file except in compliance with the License.
6
+
# You may obtain a copy of the License at
7
+
#
8
+
# http://www.apache.org/licenses/LICENSE-2.0
9
+
#
10
+
# Unless required by applicable law or agreed to in writing, software
11
+
# distributed under the License is distributed on an "AS IS" BASIS,
12
+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13
+
# See the License for the specific language governing permissions and
14
+
# limitations under the License.
15
+
#
16
+
###
17
+
1
18
### Prerequisities
2
19
3
20
* Java v.8
4
21
* (optional) bash v.4
5
22
6
23
### How to install
7
24
8
-
* Download latest version from the releases site [https://github.com/de-tudarmstadt-lt/lt.core/releases][]
25
+
* Download latest version from the releases site [https://github.com/de-tudarmstadt-lt/seg/releases][]
9
26
* unpack into a directory of choice: `tar -xzvf lt.seg-version-dist.tar.gz -C <your-preferred-directory>`
10
-
* (optional) to access `seg.sh` from anywhere you can add <your-preferred-directory>/bin to PATH: `PATH=<your-preferred-directory>:$PATH` or symlink `seg.sh` into a directory which is already in your PATH
27
+
* executables can be found in the `bin` directory, e.g. `bin/seg`. You can execute it from any directory.
28
+
* (optional) to access the `seg` binary from anywhere you can add <your-preferred-directory>/bin to PATH: `PATH=<your-preferred-directory>:$PATH` or symlink `seg` into a directory which is already in your PATH
*Note: the following description is for unix based systems, you cannot run the startup shell scripts on MS Windows, consider using cygwin or run the java commands manually*
15
32
16
33
### How to use the lt.seg segmenter
17
34
18
35
basic usage is as simple as
19
36
20
-
cat text.txt | seg.sh > segmented_text.txt
37
+
cat text.txt | seg > segmented_text.txt
21
38
22
39
or
23
40
24
-
seg.sh < text.txt > segmented_text.txt
41
+
seg < text.txt > segmented_text.txt
25
42
26
43
or
27
44
28
-
seg.sh -f text.txt > segmented_text.txt
45
+
seg -f text.txt > segmented_text.txt
29
46
30
47
31
-
lt.seg comes with a number of parameters, run `seg.sh -?` to get a list of options
48
+
lt.seg comes with a number of parameters, run `seg -?` to get a list of options
32
49
33
-
*Note: for MS Windows based systems replace*`seg.sh`*with the correct java command, e.g.*`java -cp lt.seg-<version>-with-dependencies.jar de.tudarmstadt.lt.seg.app.Segmenter <options>`
50
+
*Note: for MS Windows based systems replace*`seg`*with the correct java command, e.g.*`java -cp lt.seg-<version>-with-dependencies.jar de.tudarmstadt.lt.seg.app.Segmenter <options>`
34
51
35
52
### Options:
36
53
*`--sentencesplitter <class>` (`-s`):
@@ -41,7 +58,8 @@ lt.seg comes with a number of parameters, run `seg.sh -?` to get a list of optio
41
58
* `NullSplitter`: Convenience splitter, returns the complete input as one segment
42
59
*`--tokenizer <class>` (`-t`)
43
60
Sepcify the tokenizer class. Supported values are:
44
-
* `DiffTokenizer` (default): Applies simple rules based on the change on Unicode category of consecutive characters
61
+
* `RuleTokenizer` (default): Applies tokenization according to a ruleset specified by the `--token-ruleset` option parameter
62
+
* `DiffTokenizer`: Applies simple rules based on the change on Unicode category of consecutive characters
45
63
* `BreakTokenizer`: Java word breakiterator instance
46
64
* `EmptySpaceTokenizer`: creates a new segment only when empty spaces are found (supported empty spaces include but are not limited to: `<blank>`, `<protected-blank>`, `\t`, `\n`, `\r`, `\f`, ...)
47
65
* `NullTokenizer`: Convenience tokenizer, returns the complete input as one segment
Copy file name to clipboardExpand all lines: lt.seg/src/main/java/de/tudarmstadt/lt/seg/app/Segmenter.java
+1-1Lines changed: 1 addition & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -85,7 +85,7 @@ public Segmenter() {/* NOTHING TO DO */}
85
85
opts.addOption(OptionBuilder.withLongOpt("tokenizer").withArgName("class").hasArg().withDescription("Specify the class of the word tokinzer that you want to use: {BreakTokenizer, DiffTokenizer, EmptySpaceTokenizer, NullTokenizer} (default: DiffTokenizer)").create("t"));
86
86
opts.addOption(OptionBuilder.withLongOpt("parallel").withArgName("num").hasArg().withDescription("Specify the number of parallel threads. (Note: output might be genereated in a different order than provided by input, specify 1 if you need to keep the order. Parallel mode requires one document per line [ -l ] (default: 1).").create());
87
87
opts.addOption(OptionBuilder.withLongOpt("normalize").withDescription("Specify the degree of token normalization [0...4] (default: 0).").hasArg().withArgName("level").create("nl"));
88
-
opts.addOption(OptionBuilder.withLongOpt("filter").withDescription("Specify the degree of token filtering [0...6] (default: 2).").hasArg().withArgName("level").create("fl"));
88
+
opts.addOption(OptionBuilder.withLongOpt("filter").withDescription("Specify the degree of token filtering [0...5] (default: 2).").hasArg().withArgName("level").create("fl"));
89
89
opts.addOption(OptionBuilder.withLongOpt("merge").withDescription("Specify the degree of merging conscutive items {0,1,2} (default: 0).").hasOptionalArg().withArgName("level").create("ml"));
90
90
opts.addOption(OptionBuilder.withLongOpt("onedocperline").withDescription("Specify if you want to process documents linewise and preserve document ids, i.e. map line numbers to sentences.").create("l"));
91
91
opts.addOption(OptionBuilder.withLongOpt("sentence-ruleset").withArgName("languagecode").hasArg().withDescription(String.format("Specify the ruleset that you want to use together with RuleSplitter (avaliable: %s) (default: 'default')", de.tudarmstadt.lt.seg.sentence.rules.RuleSet.getAvailable())).create());
0 commit comments