feat/partition_metadata #2933

Falven · 2024-04-25T16:31:17Z

Is your feature request related to a problem? Please describe.
I need to be able to extract additional metadata from HTML documents. Specifically I would like to extract favicons and head > title elements.

Describe the solution you'd like
Some flexible way to define additional metadata to extract per document type. Text file types could be via regex (as currently seemingly supported), html via selectors, etc.

Describe alternatives you've considered
Doing it post partitioning, before indexing, but it's not elegant nor efficient.

Additional context
Even using LLM's to extract metadata as orchestration frameworks support would be great.

The text was updated successfully, but these errors were encountered:

adieuadieu · 2024-04-29T17:23:46Z

I've also wanted this. The title, but also meta tags like the keywords and description, and the og tags. Currently I fetch the URL myself, parse these things out with beautifulsoup, then pass the response text to partition for the rest. But, would somehow be nicer if partition_html could return these things in a more structured way. Especially for title, would be nice if it came back as an e.g. PageTitle (or, I guess HTMLHeadTitle ?) element type, or something like that.

Falven added the enhancement New feature or request label Apr 25, 2024

scanny added the html label Apr 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat/partition_metadata #2933

feat/partition_metadata #2933

Falven commented Apr 25, 2024 •

edited

adieuadieu commented Apr 29, 2024 •

edited

feat/partition_metadata #2933

feat/partition_metadata #2933

Comments

Falven commented Apr 25, 2024 • edited

adieuadieu commented Apr 29, 2024 • edited

Falven commented Apr 25, 2024 •

edited

adieuadieu commented Apr 29, 2024 •

edited