Skip to content

Commit 4b17609

Browse files
committed
updated documentation
1 parent ea048f7 commit 4b17609

File tree

3 files changed

+29
-11
lines changed

3 files changed

+29
-11
lines changed

README.md

Lines changed: 0 additions & 1 deletion
This file was deleted.

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
lt.seg/README.MD

lt.seg/README.MD

Lines changed: 27 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,36 +1,53 @@
1+
###
2+
# Copyright 2015
3+
#
4+
# Licensed under the Apache License, Version 2.0 (the "License");
5+
# you may not use this file except in compliance with the License.
6+
# You may obtain a copy of the License at
7+
#
8+
# http://www.apache.org/licenses/LICENSE-2.0
9+
#
10+
# Unless required by applicable law or agreed to in writing, software
11+
# distributed under the License is distributed on an "AS IS" BASIS,
12+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
# See the License for the specific language governing permissions and
14+
# limitations under the License.
15+
#
16+
###
17+
118
### Prerequisities
219

320
* Java v.8
421
* (optional) bash v.4
522

623
### How to install
724

8-
* Download latest version from the releases site [https://github.com/de-tudarmstadt-lt/lt.core/releases][]
25+
* Download latest version from the releases site [https://github.com/de-tudarmstadt-lt/seg/releases][]
926
* unpack into a directory of choice: `tar -xzvf lt.seg-version-dist.tar.gz -C <your-preferred-directory>`
10-
* (optional) to access `seg.sh` from anywhere you can add <your-preferred-directory>/bin to PATH: `PATH=<your-preferred-directory>:$PATH` or symlink `seg.sh` into a directory which is already in your PATH
27+
* executables can be found in the `bin` directory, e.g. `bin/seg`. You can execute it from any directory.
28+
* (optional) to access the `seg` binary from anywhere you can add <your-preferred-directory>/bin to PATH: `PATH=<your-preferred-directory>:$PATH` or symlink `seg` into a directory which is already in your PATH
1129

12-
[https://github.com/de-tudarmstadt-lt/lt.core/releases]: https://github.com/de-tudarmstadt-lt/lt.core/releases "Releases"
1330

1431
*Note: the following description is for unix based systems, you cannot run the startup shell scripts on MS Windows, consider using cygwin or run the java commands manually*
1532

1633
### How to use the lt.seg segmenter
1734

1835
basic usage is as simple as
1936

20-
cat text.txt | seg.sh > segmented_text.txt
37+
cat text.txt | seg > segmented_text.txt
2138

2239
or
2340

24-
seg.sh < text.txt > segmented_text.txt
41+
seg < text.txt > segmented_text.txt
2542

2643
or
2744

28-
seg.sh -f text.txt > segmented_text.txt
45+
seg -f text.txt > segmented_text.txt
2946

3047

31-
lt.seg comes with a number of parameters, run `seg.sh -?` to get a list of options
48+
lt.seg comes with a number of parameters, run `seg -?` to get a list of options
3249

33-
*Note: for MS Windows based systems replace* `seg.sh` *with the correct java command, e.g.* `java -cp lt.seg-<version>-with-dependencies.jar de.tudarmstadt.lt.seg.app.Segmenter <options>`
50+
*Note: for MS Windows based systems replace* `seg` *with the correct java command, e.g.* `java -cp lt.seg-<version>-with-dependencies.jar de.tudarmstadt.lt.seg.app.Segmenter <options>`
3451

3552
### Options:
3653
* `--sentencesplitter <class>` (`-s`):
@@ -41,7 +58,8 @@ lt.seg comes with a number of parameters, run `seg.sh -?` to get a list of optio
4158
* `NullSplitter`: Convenience splitter, returns the complete input as one segment
4259
* `--tokenizer <class>` (`-t`)
4360
Sepcify the tokenizer class. Supported values are:
44-
* `DiffTokenizer` (default): Applies simple rules based on the change on Unicode category of consecutive characters
61+
* `RuleTokenizer` (default): Applies tokenization according to a ruleset specified by the `--token-ruleset` option parameter
62+
* `DiffTokenizer`: Applies simple rules based on the change on Unicode category of consecutive characters
4563
* `BreakTokenizer`: Java word breakiterator instance
4664
* `EmptySpaceTokenizer`: creates a new segment only when empty spaces are found (supported empty spaces include but are not limited to: `<blank>`, `<protected-blank>`, `\t`, `\n`, `\r`, `\f`, ...)
4765
* `NullTokenizer`: Convenience tokenizer, returns the complete input as one segment

lt.seg/src/main/java/de/tudarmstadt/lt/seg/app/Segmenter.java

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -85,7 +85,7 @@ public Segmenter() {/* NOTHING TO DO */}
8585
opts.addOption(OptionBuilder.withLongOpt("tokenizer").withArgName("class").hasArg().withDescription("Specify the class of the word tokinzer that you want to use: {BreakTokenizer, DiffTokenizer, EmptySpaceTokenizer, NullTokenizer} (default: DiffTokenizer)").create("t"));
8686
opts.addOption(OptionBuilder.withLongOpt("parallel").withArgName("num").hasArg().withDescription("Specify the number of parallel threads. (Note: output might be genereated in a different order than provided by input, specify 1 if you need to keep the order. Parallel mode requires one document per line [ -l ] (default: 1).").create());
8787
opts.addOption(OptionBuilder.withLongOpt("normalize").withDescription("Specify the degree of token normalization [0...4] (default: 0).").hasArg().withArgName("level").create("nl"));
88-
opts.addOption(OptionBuilder.withLongOpt("filter").withDescription("Specify the degree of token filtering [0...6] (default: 2).").hasArg().withArgName("level").create("fl"));
88+
opts.addOption(OptionBuilder.withLongOpt("filter").withDescription("Specify the degree of token filtering [0...5] (default: 2).").hasArg().withArgName("level").create("fl"));
8989
opts.addOption(OptionBuilder.withLongOpt("merge").withDescription("Specify the degree of merging conscutive items {0,1,2} (default: 0).").hasOptionalArg().withArgName("level").create("ml"));
9090
opts.addOption(OptionBuilder.withLongOpt("onedocperline").withDescription("Specify if you want to process documents linewise and preserve document ids, i.e. map line numbers to sentences.").create("l"));
9191
opts.addOption(OptionBuilder.withLongOpt("sentence-ruleset").withArgName("languagecode").hasArg().withDescription(String.format("Specify the ruleset that you want to use together with RuleSplitter (avaliable: %s) (default: 'default')", de.tudarmstadt.lt.seg.sentence.rules.RuleSet.getAvailable())).create());

0 commit comments

Comments
 (0)