Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GenAI eatting image from DOCX #1462

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

fadingNA
Copy link
Contributor

@fadingNA fadingNA commented Nov 28, 2024

Chage from docx2txt.process to extract them manually
Eatting image as Base64 | Table as HTML tag | Text for paragraph

  • Why was this change needed? (You can also link to an open issue here)

To using multiple vector store to retrieve correct image as paragraph order instead of zip them by docx2text.process()

  • Other information:
  • Progress the issue is now when we retrieve need to find the way to convert base64 back for AI to understand image
    Screenshot 2024-11-28 at 12 00 19 PM

@dartpain let me know if this correct approach or wrong direction

Copy link

vercel bot commented Nov 28, 2024

@fadingNA is attempting to deploy a commit to the Arc53 Team on Vercel.

A member of the Team first needs to authorize it.

Change mpnet to openai

format
@fadingNA
Copy link
Contributor Author

@dartpain this is the progress as we discussed on discord, let me know if this right direction.

  • for the text embedding seems work properly except the default document that using 768 dimension but clip using 512 dimension.
  • Image maybe I have to resize to make base64 smaller before embed to CLIP model
  • Issue here it stuck when I upload docx to server it seems stuck with document that contains images.

Group Split | Split | Ignore Image and Table

update schema IMAGE and Table
Copy link

vercel bot commented Dec 5, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
docs-gpt ✅ Ready (Inspect) Visit Preview 💬 Add feedback Dec 6, 2024 5:52am

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant