Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Cache and manage the embeddings in a persistent storage #390

Closed
0x7c13 opened this issue Apr 1, 2024 · 2 comments
Closed
Labels
enhancement New feature or request

Comments

@0x7c13
Copy link
Member

0x7c13 commented Apr 1, 2024

Context / Scenario

This post is to dive deeper into this PR for the related topic: #389

The problem

The problem is simple: we want to avoid calling the embedding API as much as possible since it is often slow and expensive.
One quick and cheap solution is to cache the embeddings by the content hash and see if there is any chance for the collision to happen when feeding the KM with a large documentation or multiple ones with repeated content (that's all above PR is all about).

BUT, I don't think this is an ideal solution for real world scenarios. Why? Because:

  1. We don't get repeated text or paragraphs often in most of the cases.
  2. Above PR only benefits in the scope of current document(s) ingestion.

Let's skip the first one and go straight into the second scenario:

There are lots of cases where we want to update the existing document(s) or re-ingest them as content getting refreshed or updated, either it is a text document or a web page. In both cases, most of the content remain the same but embedding will happen again and again even if you re-import them using the same document id. This is a scenario I believe where a persistent embedding cache storage is needed for improving the speed and reducing the cost of continuously ingested documents.

Proposed solution

In addition to the FileStorageDb and MemoryDb for the vectors and text, we could have another abstraction + implementation for the EmbeddingsCacheDb where it can be configured and used by the GenerateEmbeddingsHandler to avoid re-generating the embeddings for the same partitioned content over time across workers. Ideally storing the content hash in a distributed cache storage like Redis and storing the associated embeddings in a blob storage to work across multiple workers.

We might just need to re-design or update the way how we store the embeddings to make sure it is easy to find if the embedding already exists for the given content hash, so we don't need to store them twice. Ideally just an additional hash mapping of the two is needed or maybe we include the hash in the entity name itself etc.

User should be able to:

  • Customize the storage type and location of this cache.
  • Control the behavior of this cache thru config (a maximum storage limit etc).
  • Violate the cache by certain policy (Ex: all embeddings cache associated with a given document should be removed when the document is deleted with a specified document id or Index)

Importance

would be great to have

@0x7c13 0x7c13 added the enhancement New feature or request label Apr 1, 2024
@0x7c13 0x7c13 changed the title [Feature Request] Caching and manage the embeddings in a persistent storage [Feature Request] Cache and manage the embeddings in a persistent storage Apr 1, 2024
@dluc
Copy link
Collaborator

dluc commented Apr 8, 2024

Posting here some notes from the PR:

  • KM uses multiple embedding generators, so it's important not to consider only the content, but also the generator used and the underlying configuration, e.g. which model.
  • Cache should persist across reboots to provide some benefits, and should be shared over the network in cases where KM runs on multiple nodes
  • As a persistence layer I would consider reusing the available Content Storage, which is itself configurable to store data on disk, Azure blobs, MongoDb

In an early SK prototype I used to cache embedding in the underlying http layer of the embedding generator, so I could use as a cache key the AI provider (e.g. OpenAI endpoint), the AI model name and other params contributing to uniqueness. These params are available more easily in the generator, rather than in code calling the generator.

My recommendation would be integrating a cache behavior inside the generators, rather than caching in each client calling an embedding generator.

@dluc
Copy link
Collaborator

dluc commented May 17, 2024

Looks like the PR has become stale, with a few things to address.

If this is a pressing problem, the approach should be reusable (e.g. not having to add caching logic in every handler - usually caching is a cross-cutting concern solved with generic KV stores decoupled from specific scenarios), scale over multiple VMs (e.g. allow to extend the solution with Redis/Memcache), and being optional via config settings.

@dluc dluc closed this as completed May 17, 2024
@dluc dluc reopened this May 17, 2024
@microsoft microsoft locked and limited conversation to collaborators Jun 5, 2024
@dluc dluc converted this issue into discussion #616 Jun 5, 2024

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants