Skip to content

Change Request: Using cheerio library instead of manual regex for handling HTML #483

@lumirlumir

Description

@lumirlumir

Environment

ESLint version: HEAD
@eslint/markdown version: HEAD
Node version: 20.18.0
npm version: 10.9.2
Operating System: Windows 11

What problem do you want to solve?

@eslint/eslint-team

Hi team 😄

I’d like to suggest using the cheerio library instead of manual regex for handling HTML, as it would provide a more robust and reliable approach.

(Handling HTML directly may not be the main focus for the team, since the prelint feature is currently under RFC. However, HTML is a standard feature of CommonMark, and we already have a lot of logic built around it. That’s why I believe this change is necessary.)


While working on the @eslint/markdown repository, I’ve seen a lot of regex-related fixes. Since Markdown handles natural text, sometimes regex is unavoidable.

However, lately many fixes have focused on false positives or negatives with HTML nodes, as seen in these issues and PRs:

Manual regex for HTML nodes worked fine initially, but as the project has grown and more users are adopting this plugin, more problems are cropping up. Some examples:

  • False positives and negatives
  • Inconsistent HTML node handling across rules
  • Potential security issues (ReDos)
  • Increased maintenance cost for reviewing regex-related issues and writing robust patterns

Because of all this, there’s a growing need for consistent HTML node handling.

So, I’d like to propose introducing cheerio for HTML node handling.

Here are some pros and cons:

Pros:

  • More robust code with fewer false positives/negatives
  • Consistent HTML node handling across the repository, reducing maintenance cost
  • Safer from ReDos attacks (We already have merged a PR about this)
  • Less time spent reviewing and maintaining custom regex

Cons:

  • Adds a new dependency
  • Some learning curve for the library
  • There’s an ongoing RFC for using prelint for HTML handling (though I’m not sure this covers all cases, since HTML is a standard part of CommonMark and we already have a lot of logic around it; a quicker solution may be needed, as prelint could take a while)

As for the roadmap, I don’t think the transition would be too costly:

  • For ongoing issues/PRs: Authors could switch to cheerio instead of regex
  • For others: I’m happy to take this on and refactor as needed

What do you think is the correct solution?

I’d like to suggest using the cheerio library instead of manual regex for handling HTML, as it would provide a more robust and reliable approach.

Participation

  • I am willing to submit a pull request for this change.

Additional comments

If it would help make things more robust and reliable, I’m open to other suggestions. Let’s think about how we can handle HTML nodes in a consistent and safe way.

Metadata

Metadata

Assignees

Type

No type

Projects

Status

Ready to Implement

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions