Open
Description
The typical requirements for RAG projects are generally as follows:
- Import files into a vector database
- From a directory structure
- Be able to update the files
- Without re-importing everything
- Oh, and don't forget to remove files that are no longer present from the vector database
- Since the PDF format isn’t great, we also have some files in Word format
- It’s not just 10 sample documents, but 50,000 with 20 pages each, evolving daily
- The files are, of course, stored in cloud storage
In my opinion, the best approach to handle this using LangChain is with code similar to this:
vector_store=...
record_manager=...
loader=GenericLoader(
blob_loader=FileSystemBlobLoader( # Or CloudBlobLoader
path="mydata/",
glob="**/*",
show_progress=True,
),
blob_parser=DoclingParser()
)
index(
loader.lazy_load(),
record_manager,
vector_store,
batch_size=100,
)
Change FileSystemBlobLoader
to CloudBlobLoader
, and you can manage complex scenarios in just a few lines.
To be compatible, and allow, for example, files to be uploaded directly from cloud storage (see CloudBlobLoader
), it would be a good idea to split the code into Loader
and Parser
.
To be able to write in 20 lines what is usually written in 2000 lines.
See this PR for more information.
Metadata
Metadata
Assignees
Labels
No labels