Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding a new searchableAttribute should not re-index all the attributes #4492

Open
5 tasks
ManyTheFish opened this issue Mar 14, 2024 · 0 comments · May be fixed by #4656
Open
5 tasks

Adding a new searchableAttribute should not re-index all the attributes #4492

ManyTheFish opened this issue Mar 14, 2024 · 0 comments · May be fixed by #4656
Labels
performance Related to the performance in term of search/indexation speed or RAM/CPU/Disk consumption settings diff-indexing Issues related to settings diff-indexing

Comments

@ManyTheFish
Copy link
Member

ManyTheFish commented Mar 14, 2024

Related product team resources: PRD (internal only)

⚠️ this issue depends on #4480 and #4484 to be implemented

Summary

This issue is a subset of the work implementing the settings diff-indexing enhancement.

When adding a new searchableAttribute in the settings, Meilisearch re-indexes all the searchableAttributes as they were all new. This is mandatory because the word-pair-proximity-docids, word-position-docids, and word-docids databases are field agnostic and compile the data of all the searchable fields, forcing Meilisearch to recompute all databases completely. Moreover, word-docids and word-position-docids have prefix databases that need to be updated: prefix-position-docids, and prefix-docs.

word-pair-proximity-docids

Adding one or several searchableAttributes could be indexed by extracting the data from the additional attributes only.
Then, when the data is written in the database for a specific word pair, all the proximities should be fetched, then:

  • Remove the documents in the inserted proximity that are already present in a lower proximity.
  • Make a union between the filtered value and the database data.

See below a code sample explaining the idea:

let data_to_insert;
let proximity_to_insert;

let mut data_to_remove = RoaringBitmap::new();
for prox in (1..MAX_PROXIMITY) {
  let key = Key { proximity: prox, word1, word2 };
  let database_value = db.get(&key)?;
  let mut value;
  if prox == proximity_to_insert {
    // Proximity that should be changed.
    // Union values and remove lower proximity data
    value = (database_value | data_to_insert) - data_to_remove;
  } else {
    // Remove lower proximity data
    value = database_value - data_to_remove;
  }
  
  // add the current data in data_to_remove for the next proximities
  data_to_remove |= value;
  
  if database_value != value {
    db.put(key, value)?;
  }
}

📝 implementing #4398 should ease the implementation by putting all the proximities of the same word pair under the same database value.

word-position-docids

The word-position-docids database is additive when adding a searchable attribute, which means that processing only the additional attributes and making a union between the database data and the extracted data is valid.

word-docids

As word-position-docids, the word-docids database is additive, the same process is valid when adding one or several attributes to the searchableAttributes list.

Prefix databases

As their non-prefix version, the prefix databases are additive, the same process is valid when adding one or several attributes to the searchableAttributes list.

Related Benchmarks:

  • settings-proximity-precision.json
  • settings-remove-add-swap-searchable.json
  • settings-typo.json

TODO

  • Make Transform filter the additional attributes when creating the flattened OBKV documents
  • Adapt word-pair-proximity-docids
  • Adapt word-position-docids
  • Adapt word-docids
  • Adapt prefix databases
@ManyTheFish ManyTheFish added performance Related to the performance in term of search/indexation speed or RAM/CPU/Disk consumption settings diff-indexing Issues related to settings diff-indexing labels Mar 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Related to the performance in term of search/indexation speed or RAM/CPU/Disk consumption settings diff-indexing Issues related to settings diff-indexing
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant