You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm having a consistent problem having the article content include the main h1 heading if it's in a <heading> element which is outside of the <article>.
The extracted content begins at "Prince George’s County Executive Angela D. Alsobrooks..." - it misses the h1 as well as the subheading right below it.
I've been trying to do some preprocessing of the HTML (basically moving the h1 element into the
) - but I can't get it working. Going through the code I can't quite figure out why exactly its being filtered out in the first place.
Any thoughts?
The text was updated successfully, but these errors were encountered:
It is debatable whether titles are part of the main content, they are not always included in benchmarks. That being said the main title should also be present in the metadata, would that be a solution?
I would say no, because in instances where this happens (like Washington Post), the might be other information like a subtitle (or you could imagine author/etc) being in the
block too.
I agree it's an incorrect usage of the
and semantics - but I have seen it across a few sites (WaPo just being the most prominent).
Anecdotally, when I've seen it it was always been within a
I'm having a consistent problem having the article content include the main h1 heading if it's in a
<heading>
element which is outside of the<article>
.Common example is WaPo - e.g. https://www.washingtonpost.com/dc-md-va/2024/05/14/maryland-democratic-senate-primary/
The extracted content begins at "Prince George’s County Executive Angela D. Alsobrooks..." - it misses the h1 as well as the subheading right below it.
I've been trying to do some preprocessing of the HTML (basically moving the h1 element into the
) - but I can't get it working. Going through the code I can't quite figure out why exactly its being filtered out in the first place.Any thoughts?
The text was updated successfully, but these errors were encountered: