Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for SQLRecordManager #8

Open
ckurze opened this issue Oct 30, 2023 · 6 comments
Open

Add support for SQLRecordManager #8

ckurze opened this issue Oct 30, 2023 · 6 comments
Labels
enhancement New feature or request wontfix This will not be worked on

Comments

@ckurze
Copy link

ckurze commented Oct 30, 2023

The Record Manager capabilities in LangChain help to deduplicate content, clean-up deleted or mutated source content, etc.: https://python.langchain.com/docs/modules/data_connection/indexing

Adding such capabilities would make it easier to manage embeddings in CrateDB via LangChain.

@amotl
Copy link

amotl commented Nov 7, 2023

Hi Christian,

this is interesting. The "Requirements" section of the corresponding documentation 1 says:

Only works with LangChain vectorstore's that support:

document addition by id (add_documents method with ids argument)
delete by id (delete method with)

... and lists a few compatible vector stores.

Did you already verify it does not work well with CrateDB, and why?

With kind regards,
Andreas.

Footnotes

  1. https://python.langchain.com/docs/modules/data_connection/indexing#requirements

@amotl amotl self-assigned this Nov 7, 2023
@thunderbug1
Copy link

thunderbug1 commented Nov 14, 2023

I did try to use the record manager. It works when I use a sqlite table for the record manager to store its metadata but not when I want to use cratedb itself for that purpose. The reason is that the table uses some sql features that the cratedb dialect does not support.

Only works with LangChain vectorstore's that support:
document addition by id (add_documents method with ids argument)
delete by id (delete method with)

I think we need to distinguish 2 datastores. In one the embeddings are stored and in an additional SQL database the metadata is stored.
The cratedb vector store can already handle the embeddings but not yet the sql metadata table.

@amotl
Copy link

amotl commented Nov 21, 2023

Hi.

Thank you for bringing this subsystem of LangChain to our attention, we apparently missed to add support on the first iteration. Given that the corresponding documentation lists the PGVector adapter as supported, there are chances it can also be supported by CrateDB, but there may also be blockers. We will look into it.

With kind regards,
Andreas.

@amotl
Copy link

amotl commented Nov 21, 2023

Hi again. crate/crate-python#18 explores the situation, but unfortunately, it can't be made work, as there is indeed a blocker:

Because the composite uniqueness constraint on the upsertion_record table is currently being emulated already, it can't also emulate ON CONFLICT behaviour on top easily.

So, we will probably close this as "wontfix".

It doesn't mean it is impossible, but currently, it would stretch the capacity too much. Let us know if you consider this to be an important improvement with a high priority. Otherwise, let's close the issue?

@amotl amotl added enhancement New feature or request wontfix This will not be worked on labels Nov 21, 2023
@amotl amotl removed their assignment Nov 22, 2023
@thunderbug1
Copy link

it would be more of a nice to have. Can't close it for some reason.

@amotl
Copy link

amotl commented Nov 27, 2023

Hi Alex,

thanks for clarifying. We can also keep the issue open to track this topic into the future. When corresponding support will be added to CrateDB, we can easily also add support here. However, it is unlikely, because enforcing uniqueness constraints on larger-than-memory data will be a significant performance hog.

With kind regards,
Andreas.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

3 participants