Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generalize RAG + PDF Chat feature #641

Draft
wants to merge 11 commits into
base: main
Choose a base branch
from
Draft

Generalize RAG + PDF Chat feature #641

wants to merge 11 commits into from

Conversation

mishig25
Copy link
Collaborator

@mishig25 mishig25 commented Dec 18, 2023

TLDR: implement PDF-chat feature

Update 1 here

Closes #609

When user uploads a PDF:

  1. Parse the PDF text, create embeddings, save embeddings in files bucket (that is also used for saving images for multimodal models)
  2. On the next messages in that conversation, use the PDF embeddings for RAG

Limitations

  1. Parse first 20 pages of pdf (can increase it or decrease it)
  2. A conversation can currently have only one uploaded PDF. When a user uploads PDF, it overwrites the existing PDF if there was any
  3. When user enables websearch, then websearch RAG is used, PDF RAG is not used.
  4. Just like Websearch Rag, when Pdf rag is enabled, every message of that conversation will use PDF Rag. (In subsequent PR, we need to use prompting and other techniques that will make the tool usage only when it makes sense)

Testing locally

*install new pdf-parse dependency with npm ci

npm ci 
npm run dev -- --open

Screen recording

Testing by uploading Mamba paper

Screen.Recording.2023-12-18.at.3.46.40.PM.mov

@mishig25 mishig25 changed the title Implemend PDF-chat feature Implement PDF-chat feature Dec 18, 2023
@mishig25 mishig25 marked this pull request as ready for review December 18, 2023 14:57
@nsarrazin
Copy link
Collaborator

🔥 So cool, will test it locally later, but just from looking at the demo, do you think there's an easy way to show an indicator of when a PDF is already uploaded and will be sent with the message ?

@mishig25
Copy link
Collaborator Author

but just from looking at the demo, do you think there's an easy way to show an indicator of when a PDF is already uploaded and will be sent with the message ?

at the moment, there is websearch-like box that indicates pdf rag was used

image

@nsarrazin
Copy link
Collaborator

I meant more when the file is loaded and before the conversation is started, like for images:
image

I guess it would look a bit different since you can only have one PDF per conversation, but it would be nice to have an indication that a PDF will be used to answer the query 👀

@mishig25
Copy link
Collaborator Author

I see. Let me think about it

const loadingTask = pdfjsLib.getDocument({ data });
const pdf = await loadingTask.promise;

const N_MAX_PAGES = 20;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a bit low, 100 or 200 pages maybe?

Copy link
Collaborator Author

@mishig25 mishig25 Dec 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. 100 or 200 would be bit slow to create embeddings on CPU using the current solution of transformers.js (I will provide actual benchmark numbers).

In this case, should I review/push for the community PR #646 that makes it possible to have an embedding endpoint for faster embeddings creation. We can still use tfjs embeddings (the current approach) for websearch, and use TEI-powered embeddings endpoint for PDF embeddings creation & possibly for assistants #639 (if users can upload documents). Wdyt ?

src/lib/buildPrompt.ts Outdated Show resolved Hide resolved
/>
<CarbonUpload class="mr-2 text-xs " /> Upload image
<CarbonUpload class="mr-2 text-xs " />
{#if uploadPdfStatus === PdfUploadStatus.Uploaded}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes as already said, you probably want to display uploaded files somewhere in the UI.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handled in #641 (comment)

Copy link
Collaborator

@nsarrazin nsarrazin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The feature itself works super well, I'm super impressed by what it can do tbh 🔥

There's just a couple things that are not super clear to me around the user experience that I think could be improved:

  • Why does it create an empty conversation on file upload ? Seems to me like it should show that a PDF has been uploaded client-side, and create the conversation with the PDF only when the first message is sent, the way we deal with images
  • Would be cool to have an indicator in a conversation that shows that a PDF is available in a specific conversation. Like something at the top of the conversation that says "${filename}.pdf is available in this conversation" or something
  • It's not super clear from the UI perspective what happens when you upload a PDF to a conversation that already has one. I think it silently replaces the old PDF by the new one, but it's a bit confusing, I think we could just make it so that PDFs can only be uploaded at the beginning of a conversation, wdyt?

I also left a couple of fixes for type checking as comment/suggestions in the PR

context: string;
}

/* eslint-disable no-shadow */
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/* eslint-disable no-shadow */

I think this can be removed by changing .eslintrc.cjs

-       "no-shadow": ["error"],
+ 	"@typescript-eslint/no-shadow": "error",

@@ -0,0 +1,114 @@
<script lang="ts">
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it have been possible to reuse OpenWebSearchResults.svelte and just make it a generic OpenResults maybe ? Looking at the diff between the two it looks like the only difference is the button name ("PDF search" and "Web search") and the input type

src/routes/conversation/[id]/upload-pdf/+server.ts Outdated Show resolved Hide resolved
src/routes/conversation/[id]/upload-pdf/+server.ts Outdated Show resolved Hide resolved
</script>

<svelte:head>
<title>{PUBLIC_APP_NAME}</title>
</svelte:head>

<ChatWindow
on:message={(ev) => createConversation(ev.detail)}
on:message={(ev) => createConversationWithMsg(ev.detail)}
on:uploadpdf={(ev) => createConversationWithPdf(ev.detail)}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why we create an empty chat with a document, seems to me like it should be done like for images, where the files are "stored" in the front-end and the conversation created along when the first message is sent ?

src/routes/conversation/[id]/upload-pdf/+server.ts Outdated Show resolved Hide resolved
src/lib/components/chat/ChatWindow.svelte Outdated Show resolved Hide resolved
src/lib/components/UploadBtn.svelte Outdated Show resolved Hide resolved
@nsarrazin
Copy link
Collaborator

Small note, if I drag and drop a non-pdf file I get the following weird output
Screenshot from 2023-12-21 13-40-59

@nsarrazin nsarrazin added enhancement New feature or request front This issue is related to the front-end of the app. back This issue is related to the Svelte backend or the DB labels Dec 26, 2023
@mishig25 mishig25 force-pushed the chatPDF branch 2 times, most recently from ba5f9b1 to e9ffdab Compare January 9, 2024 11:12
@mishig25
Copy link
Collaborator Author

mishig25 commented Jan 9, 2024

Updates:

  1. Fixed this upload non-img file bug here
  2. Provided better UI/UX for uploaded file (see attached video). Specifically: 1. name if the uploaded pdf with pdf icon appears; 2. this pdf file name and icon does "blinking" animation while the pdf is being uploaded & embeddings are being created; 3. on hover x btn appears, that let's you delete the uploaded PDF file
  3. env var (config) for enabling pdf-chat feature as here
Screen.Recording.2024-01-09.at.1.45.02.PM.mov

What I'm working on now:

  1. Address nathan's comments here & here
  2. Test creating embeddings on a large PDF file with TEI Add embedding models configurable, from both transformers.js and TEI #646

@wdhorton
Copy link

wdhorton commented Jan 9, 2024

Thanks for working on this! One question I had is: what was your thought process for adding a new upload button for PDFs, versus using the existing drag-and-drop functionality that already exists for images?

@mishig25
Copy link
Collaborator Author

@wdhorton currently the UI might still evolve. For now, the reason for resuing the same upload btn (instead of adding a new upload btn) is: having two different upload btns makes the UI look cluttered, especially on smaller screens

@itaybar
Copy link
Contributor

itaybar commented Jan 10, 2024

I have few questions:

  • Why you are limiting this feature to PDF and not csv, txt, etc?
  • I don't sure you have to use embedding for PDF (or at least make it optional, and not mandatory)
  • Why using Mongo as vectorDB and calculate the vector similarity in client side instead of using some real vectorDB that can to it whey more efficient and fast than running it in JS, this can make the code cleaner and remove any limitation of content size
  • Does storing all the embedding is good idea? this can make the DB blow relatively fast.

@mishig25
Copy link
Collaborator Author

@itaybar, thanks a lot for your questions

Why you are limiting this feature to PDF and not csv, txt, etc?

Yes, we will add support to other text files. Once this PR is done, supporting other text files would be trivial. (Might even include as part of this PR)

I don't sure you have to use embedding for PDF (or at least make it optional, and not mandatory)

Could you elaborate on it. And what would be the alternatives?

Why using Mongo as vectorDB and calculate the vector similarity in client side instead of using some real vectorDB that can to it whey more efficient and fast than running it in JS, this can make the code cleaner and remove any limitation of content size. Does storing all the embedding is good idea? this can make the DB blow relatively fast.

Indeed this is a good point. I/we will add support for vectorDB (likely part of this PR)

mishig25 and others added 3 commits January 12, 2024 11:14
* Generlize RAG

* wip

* fix casting
@mishig25 mishig25 changed the title Implement PDF-chat feature Generalize RAG + PDF Chat feature Jan 12, 2024
@mishig25
Copy link
Collaborator Author

mishig25 commented Jan 12, 2024

Update: this PR is getting big. Unfortunately, there is no other option (I think). The specific points are:

  1. In Generlize RAG #689, I've generalized RAG. What does it mean? It means two things: 1. server side. 2. frontend side. In terms of 1. server side, RAG applications have to implement a RAG interface for consistency & better organization of codebase (you can checkout directory src/lib/server/rag). In terms of 2. server side, OpenWebSearchResults svelte component is generalized into OpenRAGResults svelte component that will show up on RAG augmented messages (as suggested here).
  2. Besides creating pdf embeddings through TEI for pdf-chat, we would need vectorDB support for multiple reasons:
    1. without vectorDB, the pdf-chat session will lose the pdf embeddings (for instance, when you close your browser & re-open the same chat-ui conversation, the pdf embeddings will no longer be available)
    2. storing PDF embeddings on mongo gridFS would slow-down performance (as questioned here) & large number of embeddings can cause a lot of latency on the server since findSimilarSentences runs locally on the server. Therefore, we would need support for vectorDB.
    3. We would need vectorDB for other features as well. For instance, we would need vecorDB + pdf RAG for Assistants feature #639. There was also internal slack discussion here.
    4. VectorDB support is more general feature that can have multiple applications. For instance, PDF-chat is just an instance (special case) of vectorDB chat since in PDF-chat, one uses the PDF to populate VectorDB & afterwards it becomes just chat with vectorDB
    5. for the info, checking openai/chatgpt-retrieval-plugin to see if we would need to follow commonly used API for vectorDBs

Should I open a PR for vectorDB support against this branch? wdyt @nsarrazin @gary149

@itaybar
Copy link
Contributor

itaybar commented Jan 12, 2024

@itaybar, thanks a lot for your questions

Why you are limiting this feature to PDF and not csv, txt, etc?

Yes, we will add support to other text files. Once this PR is done, supporting other text files would be trivial. (Might even include as part of this PR)

I don't sure you have to use embedding for PDF (or at least make it optional, and not mandatory)

Could you elaborate on it. And what would be the alternatives?

Why using Mongo as vectorDB and calculate the vector similarity in client side instead of using some real vectorDB that can to it whey more efficient and fast than running it in JS, this can make the code cleaner and remove any limitation of content size. Does storing all the embedding is good idea? this can make the DB blow relatively fast.

Indeed this is a good point. I/we will add support for vectorDB (likely part of this PR)

Thanks for the quick response guys!
About the optional embedding for the pdf, correct me if I wrong, but you can just read the texts from the pdf without using embedding for the images, etc and by that removing the hard limits for content file

@itaybar
Copy link
Contributor

itaybar commented Jan 16, 2024

@mishig25 What is your estimation for merging this in your opinion?

@mishig25
Copy link
Collaborator Author

Can't give exact date. But will do my best to merge it soon :)

pdfSearchResults = await RAGs["pdfChat"].retrieveRagContext(conv, newPrompt, update);
}

messages[messages.length - 1].ragContext = pdfSearchResults;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This causes a bug with web search.
messages[messages.length - 1].ragContext = webSearchResults is being overridden with undefined if no pdf document has been uploaded by the user.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Submitted a PR to fix the issue #745

@mishig25 mishig25 marked this pull request as draft January 22, 2024 08:51
@johndpope
Copy link

Merge. hire 10 more people to help.

@mishig25
Copy link
Collaborator Author

mishig25 commented Feb 4, 2024

Merge

Will merge soon

hire 10 more people to help.

Hiring 10 more people rarely results in 2x productivity (let alone 10x)

@flexchar
Copy link
Contributor

flexchar commented Feb 6, 2024

It's a tough one to implement. I appreciate the work on this one.

I had a couple of thoughts on this I'd like to share.

First, I thought of a plugin-like system that is registered based on the file type, has handler for processing/storing and handler for retrieval. That would allow community to scale while not choking the HF developers which I'm beyond impressed being able to deliver such variety of products.

Alternatively, it could also be a third party API - much like OpenAI function calling works - so that the responsibility is NOT on you but on the end user who chooses to deploy. It's great for developers but would probably be a pain for the those who just one to feel the power of deploying and has no use beyond (inspired by the shutdown story of banana.dev).

I believe these would inherently fit better as the nature of open-source deployments is to customize. As such, there is an infinite number of use cases and solutions... PDFs, images, audio files, web search as input; ChromaDB, Pinecone, Qdrant, PGVector, Meilisearch as storage/retrieval to name a few.

@zubu007
Copy link

zubu007 commented Feb 12, 2024

Hello. I have cloned the "chatPDF" branch to use the pdf upload feature. It works locally on my machine when I run npm run dev. However, when I perform the same thing on my apache2 server, it shows an uploading PDF error. I attaching the screenshot.
image

I have tried copying the exact .env.local file from my local directory to my server's directory but it still shows the same error. Should I open a new issue about it? What else can I provide to help you understand the error?

@flexchar
Copy link
Contributor

Can you ideally skip apache2? Or check error logs from it to see if anything is being logged? Could be that some headers are lost or file being too big and blocked. What does your network devtools say on this upload response?

@zubu007

@zubu007
Copy link

zubu007 commented Feb 16, 2024

The error on the console shows 403 Forbidden error meaning it has something to do with authentication. I am using custom models with separate endpoint rather than the default. However, the chat function with the custom model works as expected. When the pdf is being uploaded, it throws the 403. My question is, the pdf fetch function is using a different authentication?

Let me add further console logs in both my local machine and server to see the difference. I will let you know the results here.

@windprak
Copy link

I have the same problem as #693 It retrieves correctly when looking at the prompt and parameters, but the model answers as if it has gotten only the question.

@zubu007
Copy link

zubu007 commented Mar 4, 2024

Let me state a problem I am facing simply.
Using the chatPDF with the default HF model ("name": "mistralai/Mistral-7B-Instruct-v0.1") it works. But I am trying to add model to the .env.local file as an openai endpoint and use that model for the chatPDF. The chat function works with both model (default and ours) but the pdf-chat only works with the default one. With the same promt, same pdf, same parameters.
I am adding pictures for better understanding.

For HF default model, this was the promt
Screenshot 2024-03-04 113803
And this was the response
Screenshot 2024-03-04 113718

For our own endpoint,
Screenshot 2024-03-04 113819
And this was the response
Screenshot 2024-03-04 113726

Is there something I am missing? It must be something simple because changing the model in the App, one works and the other doesnt. If you need more information about the error let me know.

@lmaosweqf1
Copy link

hey, whats the current status of the PR?

@C-Loftus
Copy link

C-Loftus commented Apr 5, 2024

Also very interested in this feature. Is there help needed for this? Wasn't sure the blocker or how to help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
back This issue is related to the Svelte backend or the DB enhancement New feature or request front This issue is related to the front-end of the app.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature Request] Uploading PDFS/Text Files/Images?