-
Notifications
You must be signed in to change notification settings - Fork 76
Description
Environment
ESLint version: HEAD
@eslint/markdown version: HEAD
Node version: 20.18.0
npm version: 10.9.2
Operating System: Windows 11
What problem do you want to solve?
@eslint/eslint-team
Hi team 😄
I’d like to suggest using the cheerio
library instead of manual regex for handling HTML, as it would provide a more robust and reliable approach.
(Handling HTML directly may not be the main focus for the team, since the prelint
feature is currently under RFC. However, HTML is a standard feature of CommonMark, and we already have a lot of logic built around it. That’s why I believe this change is necessary.)
While working on the @eslint/markdown
repository, I’ve seen a lot of regex-related fixes. Since Markdown handles natural text, sometimes regex is unavoidable.
However, lately many fixes have focused on false positives or negatives with HTML nodes, as seen in these issues and PRs:
- Ongoing:
- Rule Change: Improve HTML id/name attribute parsing in no-missing-link-fragments #481,
- Bug:
no-multiple-h1
andrequire-alt-text
miss errors after a HTML comment is closed #464, - fix: improve HTML id/name regex for unquoted values and spaces #480,
- fix: detect errors after comments in no-multiple-h1 and require-alt-text #468
- Merged:
Manual regex for HTML nodes worked fine initially, but as the project has grown and more users are adopting this plugin, more problems are cropping up. Some examples:
- False positives and negatives
- Inconsistent HTML node handling across rules
- Potential security issues (ReDos)
- Increased maintenance cost for reviewing regex-related issues and writing robust patterns
Because of all this, there’s a growing need for consistent HTML node handling.
So, I’d like to propose introducing cheerio
for HTML node handling.
Here are some pros and cons:
Pros:
- More robust code with fewer false positives/negatives
- Consistent HTML node handling across the repository, reducing maintenance cost
- Safer from ReDos attacks (We already have merged a PR about this)
- Less time spent reviewing and maintaining custom regex
Cons:
- Adds a new dependency
- Some learning curve for the library
- There’s an ongoing RFC for using
prelint
for HTML handling (though I’m not sure this covers all cases, since HTML is a standard part of CommonMark and we already have a lot of logic around it; a quicker solution may be needed, asprelint
could take a while)
As for the roadmap, I don’t think the transition would be too costly:
- For ongoing issues/PRs: Authors could switch to
cheerio
instead of regex - For others: I’m happy to take this on and refactor as needed
What do you think is the correct solution?
I’d like to suggest using the cheerio
library instead of manual regex for handling HTML, as it would provide a more robust and reliable approach.
Participation
- I am willing to submit a pull request for this change.
Additional comments
If it would help make things more robust and reliable, I’m open to other suggestions. Let’s think about how we can handle HTML nodes in a consistent and safe way.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status