You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I hope I'm not flooding you with too many issues 😄
I have an FAQ page that uses <button /> to show and hide the corresponding content. It looks roughly like this:
<divclass="faq__item" itemscope="" itemprop="mainEntity" itemtype="https://schema.org/Question"><buttonis="toggle-button" class="collapsible-toggle text--strong"
aria-controls="block-template--18681112592652__faq-18db33c8-b7be-456f-b5b4-8c935376f225" aria-expanded="true"
itemprop="name">1. Question: Dolor sit amet<spanclass="animated-plus"></span></button><collapsible-contentid="block-template--18681112592652__faq-18db33c8-b7be-456f-b5b4-8c935376f225"
class="collapsible anchor" itemscope="" itemprop="acceptedAnswer" itemtype="https://schema.org/Answer"
style="overflow: visible;" open=""><divclass="collapsible__content text-container" itemprop="text"><p>1. Answer: Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut
labore et dolore magna aliqua</p></div></collapsible-content></div>
The <button /> contains the question and the <collapsible-content/> contains the answer. However, when I extract the content, the text from the button is not extracted. This usually makes sense because buttons are not really semantic elements, but in this case the buttons contains valuable information.
Here's the full example faq.html:
<divclass="faq"><divclass="faq-navigation hidden-pocket"></div><divclass="faq__wrapper" itemscope="" itemtype="https://schema.org/FAQPage"><h2id="category-template--18681112592652__faq-6c8e6357-90c2-4ba6-8bbb-95ce71242bae"
class="faq__category heading h6 anchor">FAQs</h2><divclass="faq__item" itemscope="" itemprop="mainEntity" itemtype="https://schema.org/Question"><buttonis="toggle-button" class="collapsible-toggle text--strong"
aria-controls="block-template--18681112592652__faq-18db33c8-b7be-456f-b5b4-8c935376f225" aria-expanded="true"
itemprop="name">1. Question: Dolor sit amet<spanclass="animated-plus"></span></button><collapsible-contentid="block-template--18681112592652__faq-18db33c8-b7be-456f-b5b4-8c935376f225"
class="collapsible anchor" itemscope="" itemprop="acceptedAnswer" itemtype="https://schema.org/Answer"
style="overflow: visible;" open=""><divclass="collapsible__content text-container" itemprop="text"><p>1. Answer: Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut
labore et dolore magna aliqua</p></div></collapsible-content></div><divclass="faq__item" itemscope="" itemprop="mainEntity" itemtype="https://schema.org/Question"><buttonis="toggle-button" class="collapsible-toggle text--strong"
aria-controls="block-template--18681112592652__faq-38d0d4d0-5a0b-4ba0-a733-8326f46abeef" aria-expanded="false"
itemprop="name">2. Question: Dolor sit amet<spanclass="animated-plus"></span></button><collapsible-contentid="block-template--18681112592652__faq-38d0d4d0-5a0b-4ba0-a733-8326f46abeef"
class="collapsible anchor" itemscope="" itemprop="acceptedAnswer" itemtype="https://schema.org/Answer"
style="overflow: hidden;"><divclass="collapsible__content text-container" itemprop="text"><p>2. Answer: Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut
labore et dolore magna
aliqua</p></div></collapsible-content></div><divclass="faq__item" itemscope="" itemprop="mainEntity" itemtype="https://schema.org/Question"><buttonis="toggle-button" class="collapsible-toggle text--strong"
aria-controls="block-template--18681112592652__faq-04c7cdf4-44b6-454c-9f88-8ecab2ab380a" aria-expanded="false"
itemprop="name">3. Question: Dolor sit amet<spanclass="animated-plus"></span></button><collapsible-contentid="block-template--18681112592652__faq-04c7cdf4-44b6-454c-9f88-8ecab2ab380a"
class="collapsible anchor" itemscope="" itemprop="acceptedAnswer" itemtype="https://schema.org/Answer"><divclass="collapsible__content text-container" itemprop="text"><p>3. Answer: Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut
labore et dolore magna
aliqua</p></div></collapsible-content></div></div></div>
Here is how I run it:
cat faq.html | trafilatura --formatting --links
The received result:
1. Answer: Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua
3. Answer: Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua
There seems to be also a small bug, because it extracts the first and third answer but omits the second answer: 2. Answer: Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua
That's the result I would like to get:
## FAQs
1. Question: Dolor sit amet
1. Answer: Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua
2. Question: Dolor sit amet
2. Answer: Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua
3. Question: Dolor sit amet
3. Answer: Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua
The FAQ elements contain semantic information from schema.org, like itemtype="https://schema.org/Question" and other attributes like itemprop and itemscope. Maybe it is possible to implement a rule that keeps these semantic elements in the result?
The text was updated successfully, but these errors were encountered:
Hi, thanks for the detailed example, as you say this seems to be a bug (item 2), a potential enhancement (button), and a problem with the source at the same time (I think <collapsible-content> is not a valid HTML tag). I need to check what can reasonably be done here.
Hi,
I hope I'm not flooding you with too many issues 😄
I have an FAQ page that uses
<button />
to show and hide the corresponding content. It looks roughly like this:The
<button />
contains the question and the<collapsible-content/>
contains the answer. However, when I extract the content, the text from the button is not extracted. This usually makes sense because buttons are not really semantic elements, but in this case the buttons contains valuable information.Here's the full example
faq.html
:Here is how I run it:
cat faq.html | trafilatura --formatting --links
The received result:
There seems to be also a small bug, because it extracts the first and third answer but omits the second answer:
2. Answer: Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua
That's the result I would like to get:
The FAQ elements contain semantic information from schema.org, like
itemtype="https://schema.org/Question"
and other attributes likeitemprop
anditemscope
. Maybe it is possible to implement a rule that keeps these semantic elements in the result?The text was updated successfully, but these errors were encountered: