Skip to content

Thai natural language processing library in Rust, with Python and Node bindings.

License

Notifications You must be signed in to change notification settings

PyThaiNLP/nlpo3

Folders and files

NameName
Last commit message
Last commit date

Latest commit

8a417f9 Â· Nov 13, 2024
Nov 9, 2024
Nov 13, 2024
Nov 11, 2024
Nov 11, 2024
Nov 11, 2024
Nov 11, 2024
Nov 11, 2024
Nov 9, 2024
Nov 9, 2024
Nov 11, 2024
Nov 11, 2024
May 9, 2021
Nov 11, 2024
Jun 25, 2021

Repository files navigation

SPDX-FileCopyrightText SPDX-License-Identifier
2024 PyThaiNLP Project
Apache-2.0

nlpO3

crates.io Apache-2.0 DOI

Thai natural language processing library in Rust, with Python and Node bindings. Formerly oxidized-thainlp.

To use as a library in a Rust project:

cargo add nlpo3

To use as a library in a Python project:

pip install nlpo3

Table of contents

Features

  • Thai word tokenizer
    • Use maximal-matching dictionary-based tokenization algorithm and honor Thai Character Cluster boundaries
      • 2.5x faster than similar pure Python implementation (PyThaiNLP's newmm)
    • Load a dictionary from a plain text file (one word per line) or from Vec<String>

Use

Node.js binding

See nlpo3-nodejs.

Python binding

PyPI

Example:

from nlpo3 import load_dict, segment

load_dict("path/to/dict.file", "dict_name")
segment("āļŠāļ§āļąāļŠāļ”āļĩāļ„āļĢāļąāļš", "dict_name")

See more at nlpo3-python.

Rust library

crates.io

Add to dependency

To use as a library in a Rust project:

cargo add nlpo3

It will add "nlpo3" to Cargo.toml:

[dependencies]
# ...
nlpo3 = "1.4.0"

Example

Create a tokenizer using a dictionary from file, then use it to tokenize a string (safe mode = true, and parallel mode = false):

use nlpo3::tokenizer::newmm::NewmmTokenizer;
use nlpo3::tokenizer::tokenizer_trait::Tokenizer;

let tokenizer = NewmmTokenizer::new("path/to/dict.file");
let tokens = tokenizer.segment("āļŦāđ‰āļ­āļ‡āļŠāļĄāļļāļ”āļ›āļĢāļ°āļŠāļēāļŠāļ™", true, false).unwrap();

Create a tokenizer using a dictionary from a vector of Strings:

let words = vec!["āļ›āļēāļĨāļīāđ€āļĄāļ™āļ•āđŒ".to_string(), "āļ„āļ­āļ™āļŠāļ•āļīāļ•āļīāļ§āļŠāļąāđˆāļ™".to_string()];
let tokenizer = NewmmTokenizer::from_word_list(words);

Add words to an existing tokenizer:

tokenizer.add_word(&["āļĄāļīāļ§āđ€āļ‹āļĩāļĒāļĄ"]);

Remove words from an existing tokenizer:

tokenizer.remove_word(&["āļāļĢāļ°āđ€āļžāļĢāļē", "āļŠāļēāļ™āļŠāļĨāļē"]);

Command-line interface

crates.io

Example:

echo "āļ‰āļąāļ™āļāļīāļ™āļ‚āđ‰āļēāļ§" | nlpo3 segment

See more at nlpo3-cli.

Dictionary

  • For the interest of library size, nlpO3 does not assume what dictionary the user would like to use, and it does not come with a dictionary.
  • A dictionary is needed for the dictionary-based word tokenizer.
  • For tokenization dictionary, try

Build

Requirements

Steps

Generic test:

cargo test

Build API document and open it to check:

cargo doc --open

Build (remove --release to keep debug information):

cargo build --release

Check target/ for build artifacts.

Develop

Development document

Issues

License

nlpO3 is copyrighted by its authors and licensed under terms of the Apache Software License 2.0 (Apache-2.0). See file LICENSE for details.