Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Support more tokenizers/stemmers/filter #10

Open
VoVAllen opened this issue Dec 9, 2024 · 9 comments
Open

feat: Support more tokenizers/stemmers/filter #10

VoVAllen opened this issue Dec 9, 2024 · 9 comments
Assignees

Comments

@VoVAllen
Copy link
Member

VoVAllen commented Dec 9, 2024

Design an extensive syntax to support user to add tokenizers with different configurations.

-- Provide either config or index_name --
CREATE function create_tokenizer(config text, table_name text, column_name text, config text);
CREATE function tokenize(query text, tokenizer_name text)
RETURNS bm25vector;

SELECT create_tokenizer("document_standard", $$
tokenizer = 'standard'
table = "documents"
column = "text"
pretokenizer = 'standard'
[tokenizer.config]
stemmer = 'porter2'
[pretokenizer.config]
punctuation = 'removed'
whitespace = '\w+|[^\w\s]+'
$$)
SELECT tokenize('I'm a doctor', "document_standard");


CREATE INDEX documents_embedding_bm25 ON documents USING bm25 (embedding bm25_ops) WITH  (options = $$
tokenizer = 'standard'
pre_tokenizers = 
$$);
tokenize('I'm a doctor', tokenizer_name => "documents_standard");

Reference:

    es_bm25_settings = {
        "settings": {
            "index": {
                "similarity": {
                    "default": {
                        "type": "BM25",
                        "k1": k1,
                        "b": b,
                    }
                }
            },
            "analysis": {
                "analyzer": {
                    "custom_analyzer": {
                        "type": "standard",
                        "max_token_length": 1_000_000,
                        "stopwords": "_english_",
                        "filter": [ "lowercase", "custom_snowball"]
                    }
                },
                "filter": {
                    "custom_snowball": {
                        "type": "snowball",
                        "language": "English"
                    }
                }
            }
        }
    }
    ```
@kemingy kemingy self-assigned this Dec 10, 2024
@kemingy
Copy link
Member

kemingy commented Dec 13, 2024

  1. Does the tokenizer require a model (even for the word tokenizer)?

Yes. Without a pre-trained model on a large dataset, the NDCG drops a lot.

  1. Can we support this kind of config?

No. Because the tokenizer model is highly coupled with those configurations (pre_tokenizer, stemmer, stopwords, etc.). What we can support is:

  • hardcoded process like what we do now:
    impl Tokenizer for BertWithStemmerAndSplit {
    fn encode(&self, text: &str) -> Vec<u32> {
    let mut results = Vec::new();
    let lower_text = text.to_lowercase();
    let split = TOKEN_PATTERN_RE.find_iter(&lower_text);
    for token in split {
    if STOP_WORDS.contains(token.as_str()) {
    continue;
    }
    let stemmed_token =
    tantivy_stemmers::algorithms::english_porter_2(token.as_str()).to_string();
    let encoding = self.0.encode_fast(stemmed_token, false).unwrap();
    results.extend_from_slice(encoding.get_ids());
    }
    results
    }
  • dynamic load user-trained huggingface tokenizer models (limited feature) with several hardcoded processes (mainly for stemmer or some language-specific processing)

@VoVAllen
Copy link
Member Author

What I mean is to create a standard tokenizer like ES. It store all the tokens in a table. When new documents added, it will check whether the token exists. If not, add it to the table.

What do you mean by model? Do you mean the stats model for bpe tokenizer?

@VoVAllen
Copy link
Member Author

es mentioned unicode segmentation https://docs.rs/unicode-segmentation/latest/unicode_segmentation/. Does it help in the result?

@kemingy
Copy link
Member

kemingy commented Dec 13, 2024

Experiments

  • Current BERT uncased
  • WordLevel trained on wikitext-103-raw-v1 with r"(?u)\b\w\w+\b" and English snowball stemmer
  • WordLevel trained on fiqa with r"(?u)\b\w\w+\b" and English snowball stemmer
  • Tocken trained on wikitext-103-raw-v1 with Unicode segmentation and snowball stemmer
  • Unicode trained on dataset online, so the indexing time is larger, others are similar to Tocken
  • Unicode(L) is using Lucene stopwords which is less than NLTK stopwords

tested with top-k=10

Tokenizer Dataset QPS NDCG@10
BERT uncased fiqa 455.22/s 0.22669
Word(wiki/30k) fiqa 847.28/s 0.2026
Word(wiki/100k) fiqa 881.75/s 0.21836
Word(wiki/500k) fiqa 890.11/s 0.22807
Word(fiqa/30k) fiqa 751.22/s 0.14533
Word(fiqa/100k) fiqa 780.91/s 0.16659
Tocken fiqa 346.58/s 0.24268
Unicode fiqa 905.15/s 0.23496
Unicode(L) fiqa 340.32/s 0.25295
ES fiqa 350.59/s 0.25364
BERT uncased trec-covid 96.19/s 0.67545
Word(wiki/30k) trec-covid 287.03/s 0.57424
Word(wiki/100k) trec-covid 292.57/s 0.63196
Word(wiki/500k) trec-covid 282.97/s 0.64036
Tocken trec-covid 155.86/s 0.59249
Unicode trec-covid 268.10/s 0.67253
Unicode(L) trec-covid 148.46/s 0.61241
ES trec-covid 127.36/s 0.68803
BERT uncased webis-touche2020 178.83/s 0.31151
Word(wiki/100k) webis-touche2020 414.55/s 0.31562
Word(wiki/500k) webis-touche2020 448.72/s 0.31418
Tocken webis-touche2020 279.20/s 0.34596
Unicode webis-touche2020 439.86/s 0.32139
Unicode(L) webis-touche2020 287.05/s 0.34009
ES webis-touche2020 137.45/s 0.34707

@kemingy
Copy link
Member

kemingy commented Dec 13, 2024

What I mean is to create a standard tokenizer like ES. It store all the tokens in a table. When new documents added, it will check whether the token exists. If not, add it to the table.

What do you mean by model? Do you mean the stats model for bpe tokenizer?

I understand this method. But it's not suitable:

  1. usually we limit 30k tokens according to the frequency, which is not applicable with this method
  2. poor NDCG score, check the test above, it's trained on the dataset fiqa
  3. poor performance, because you will need to sync the table every time you encounter new tokens

@kemingy
Copy link
Member

kemingy commented Dec 13, 2024

es mentioned unicode segmentation https://docs.rs/unicode-segmentation/latest/unicode_segmentation/. Does it help in the result?

I tried unicode normalization, but it doesn't help. Will do more experiements.

@VoVAllen
Copy link
Member Author

@VoVAllen
Copy link
Member Author

Also it's possible that the index might be wrong. You may want to try the query without the index if needed.

@VoVAllen
Copy link
Member Author

VoVAllen commented Dec 16, 2024

SELECT create_standard_tokenizer(tokenizer_name, table_name, column_name, config)

This will create a trigger on the column, like

-- Create the trigger function
CREATE OR REPLACE FUNCTION trigger_update_tokenizer()
RETURNS TRIGGER AS $$
BEGIN
    -- Check if the specific column is set
    IF NEW.column_name IS NOT NULL THEN
        -- Call the update_tokenizer function with the new value
        PERFORM update_tokenizer(NEW.column_name);
    END IF;
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

-- Create the trigger on the table
CREATE TRIGGER trigger_on_column_name
AFTER INSERT ON table_name
FOR EACH ROW
EXECUTE FUNCTION trigger_update_tokenizer();

And update_tokenizer will update the vocab dict in a specific table (under our own schema like vchord_bm25.tokenizer_name).

Then user should call tokenizer(document, tokenizer_name => "documents_standard") to tokenize the document into BM25 vec.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants