Introduce paragraph count summary #2761

gagath · 2020-05-28T19:12:11Z

Let the users use only the n-first paragraphs of the article as article
summary.

The advantage of this approach is that we avoid the random word count
ellipsis that will cut content in pieces, while not having to copy the
first paragraph of the article into the article's summary metadata.

Pull Request Checklist

Ensured tests pass and (if applicable) updated functional test output
Conformed to code style guidelines by running appropriate linting tools
Added tests for changed code
Updated documentation for changed code

boxydog · 2023-10-29T19:19:26Z

The functionality (paragraphs) seems reasonable. Using negative and positive SUMMARY_MAX_LENGTH feels confusing. I'd introduce a different config var. But, my opinion is not binding.

boxydog · 2023-10-29T19:22:16Z

Less confusing would be SUMMARY_MAX_LENGTH (for words) and SUMMARY_MAX_PARAGRAPHS (for paragraphs).

If both are set, could 1) apply both, or 2) pick a winner (e.g., SUMMARY_MAX_PARAGRAPHS), or 3) raise an error, or 4) hidden option 4.

pelican/tests/test_contents.py

paulocoutinhox · 2023-11-13T20:50:54Z

+1

avaris · 2023-11-13T21:49:55Z

pelican/utils.py

+ for i in range(count):
+ substr = substr[tag_stop:]
+ tag_start = substr.find('<p>')
+ tag_stop = substr.find('</p>') + len('</p>')


My time on the interwebs taught me that the possibility of finding a closing tag for  is like a coin toss. Even the HTML spec is permissive about it and "formally" allows omitting  in most cases. Long story short, I am trying to say that this is quite unreliable.

The HTML is mostly generated from a generator (reST, Markdown…) that hopefully emits closing tags, so in practice this should not happen very much.

in practice this should not happen very much.

That doesn't cut it though :). rst and md generators may play nice (for now) but content can come from any number for sources not limited to those, including direct html files, plugins for other formats etc. In software, edge cases happen more common than one would think initially :).

I'm not suggesting employing full DOM parsers like lxml or similar, although that would be more complete. But that can be postponed until the need arises. It should at least produce something reasonable in those cases: i.e. if  is not present (substr.find('') == -1), take up to the next :

tag_stop = substr.find('') if tag_stop == -1: # no closing p tag_stop = substr.find('', tag_start + len('')) # if tag_stop == -1: # if there are no more s take everything else? # tag_stop = None else: tag_stop += len('')

gagath · 2023-11-17T17:05:07Z

Rebased
Added new setting SUMMARY_MAX_PARAGRAPHS
Added a new test to verify what happens when both SUMMARY_MAX_PARAGRAPHS and SUMMARY_MAX_LENGTH are set
Fixed formatting issues
Added missing documentation

docs/settings.rst

pelican/contents.py

Let the users use only the n-first paragraphs of the article as article summary. The advantage of this approach is that we avoid the random word count ellipsis that will cut content in pieces, while not having to copy the first paragraph of the article into the article's summary metadata. If both SUMMARY_MAX_PARAGRAPHS and SUMMARY_MAX_LENGTH are set, then the SUMMARY_MAX_LENGTH option will apply to the number of paragraphs in SUMMARY_MAX_PARAGRAPHS.

gagath · 2023-11-18T16:22:46Z

@avaris Thanks for your comments, fixed them.

gagath force-pushed the paragraph-summary branch from 17b7467 to d7935a4 Compare May 28, 2020 19:14

justinmayer added awaiting review needs docs labels Oct 5, 2021

gagath force-pushed the paragraph-summary branch from d7935a4 to 298dd8c Compare January 5, 2022 15:57

boxydog reviewed Oct 29, 2023

View reviewed changes

pelican/tests/test_contents.py Outdated Show resolved Hide resolved

avaris reviewed Nov 13, 2023

View reviewed changes

gagath force-pushed the paragraph-summary branch from 298dd8c to 1c044ed Compare November 17, 2023 16:59

gagath changed the title ~~[WIP] Introduce paragraph count summary~~ Introduce paragraph count summary Nov 17, 2023

gagath force-pushed the paragraph-summary branch from 1c044ed to d01c5b7 Compare November 17, 2023 17:06

avaris reviewed Nov 18, 2023

View reviewed changes

docs/settings.rst Outdated Show resolved Hide resolved

avaris reviewed Nov 18, 2023

View reviewed changes

pelican/contents.py Outdated Show resolved Hide resolved

gagath force-pushed the paragraph-summary branch from d01c5b7 to fd029b2 Compare November 18, 2023 16:21

justinmayer removed the needs docs label Nov 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce paragraph count summary #2761

Introduce paragraph count summary #2761

gagath commented May 28, 2020 •

edited

boxydog commented Oct 29, 2023

boxydog commented Oct 29, 2023

paulocoutinhox commented Nov 13, 2023

avaris Nov 13, 2023

gagath Nov 17, 2023

avaris Nov 18, 2023 •

edited

gagath commented Nov 17, 2023

gagath commented Nov 18, 2023

Introduce paragraph count summary #2761

Are you sure you want to change the base?

Introduce paragraph count summary #2761

Conversation

gagath commented May 28, 2020 • edited

Pull Request Checklist

boxydog commented Oct 29, 2023

boxydog commented Oct 29, 2023

paulocoutinhox commented Nov 13, 2023

avaris Nov 13, 2023

Choose a reason for hiding this comment

gagath Nov 17, 2023

Choose a reason for hiding this comment

avaris Nov 18, 2023 • edited

Choose a reason for hiding this comment

gagath commented Nov 17, 2023

gagath commented Nov 18, 2023

gagath commented May 28, 2020 •

edited

avaris Nov 18, 2023 •

edited