Skip to content

The Docling Loader is not compatible with GenericLoader. #10

Open
@pprados

Description

@pprados

The typical requirements for RAG projects are generally as follows:

  • Import files into a vector database
  • From a directory structure
  • Be able to update the files
  • Without re-importing everything
  • Oh, and don't forget to remove files that are no longer present from the vector database
  • Since the PDF format isn’t great, we also have some files in Word format
  • It’s not just 10 sample documents, but 50,000 with 20 pages each, evolving daily
  • The files are, of course, stored in cloud storage

In my opinion, the best approach to handle this using LangChain is with code similar to this:

vector_store=...
record_manager=...
loader=GenericLoader(
    blob_loader=FileSystemBlobLoader(  # Or CloudBlobLoader
        path="mydata/",
        glob="**/*",
        show_progress=True,
    ),
    blob_parser=DoclingParser()
)
index(
    loader.lazy_load(),
    record_manager,
    vector_store,
    batch_size=100,
)

Change FileSystemBlobLoader to CloudBlobLoader, and you can manage complex scenarios in just a few lines.

To be compatible, and allow, for example, files to be uploaded directly from cloud storage (see CloudBlobLoader), it would be a good idea to split the code into Loader and Parser.

To be able to write in 20 lines what is usually written in 2000 lines.

See this PR for more information.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions