Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Corpus, Document, Sentence, Token, Language components #34

Open
rmarronnier opened this issue Sep 30, 2019 · 0 comments
Open
Labels
enhancement New feature or request

Comments

@rmarronnier
Copy link
Member

As you can see browsing Cadmium shards source code, several entities (for lack of a better word) are declared in different locations and in different ways.

This issue is not just a namespace or redundancy issue but we'd benefit by having fundamental classes or structs describing the tokens, sentences and documents we're dealing with.

I've started in the pos_tagger declaring such structs and objects. It's a WIP and things might change as we'll discover what we need and don't need for higher levels text processing functions.

What's obvious to me is the neat way the Language is declared in cadmium_tokenizer. As languages information is used in different shards (especially the language codes) we could move a big part of it in a language module in cadmium_utils and keep some specific language infos (abbreviations, tag maps, etc) in their respective shards.

More examples from the top of my head :

The Document struct or class might benefit from the cadmium_tfidf and vice versa.

The Cadmium::Utils::Sentence might be renamed to Sentencizer (is that a word ?)
so that a Cadmium::Sentence might exist without conflict.

I'd like to point out that having these classes or structs won't impede users to process raw text without creating these objects. But they are needed to keep morpho-syntaxic infos about the tokens and sentences.

@watzon watzon transferred this issue from another repository Nov 7, 2019
@watzon watzon added enhancement New feature or request in progress and removed in progress labels Nov 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants