New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat/partition_metadata #2933
Comments
I've also wanted this. The title, but also meta tags like the keywords and description, and the og tags. Currently I fetch the URL myself, parse these things out with beautifulsoup, then pass the response text to |
Is your feature request related to a problem? Please describe.
I need to be able to extract additional metadata from HTML documents. Specifically I would like to extract favicons and
head > title
elements.Describe the solution you'd like
Some flexible way to define additional metadata to extract per document type. Text file types could be via regex (as currently seemingly supported), html via selectors, etc.
Describe alternatives you've considered
Doing it post partitioning, before indexing, but it's not elegant nor efficient.
Additional context
Even using LLM's to extract metadata as orchestration frameworks support would be great.
The text was updated successfully, but these errors were encountered: