Skip to content

Replace bleach with nh3 for HTML sanitization#163

Open
hoheinzollern wants to merge 7 commits intotorchbox:mainfrom
hoheinzollern:main
Open

Replace bleach with nh3 for HTML sanitization#163
hoheinzollern wants to merge 7 commits intotorchbox:mainfrom
hoheinzollern:main

Conversation

@hoheinzollern
Copy link
Copy Markdown

@hoheinzollern hoheinzollern commented Mar 14, 2026

This PR replaces bleach with nh3 for HTML sanitization, providing significant performance improvements while maintaining the same security guarantees.

Key Changes

  • API Compatibility: All existing configuration options continue to work exactly as before
  • Performance: nh3 is significantly faster than bleach (benchmarks show ~20x improvement)
  • Security: Maintains the same security guarantees with automatic attribute handling

Technical Details

  • Updated constants to use sets instead of lists (nh3 requirement)
  • Modified configuration functions to work with nh3's API
  • Removed rel from allowed attributes (nh3 handles this automatically for security)
  • Updated all tests to work with the new data structures

Testing

  • All existing tests pass (note the necessary upgrade, nh3 adds the rel attribute to links with no option to remove it)
  • Basic functionality verified (markdown rendering, HTML sanitization)

Migration Guide

This change is mostly transparent to users. No code changes are required in projects using wagtail-markdown. Visible changes will be improved performance and the loss of rel attribute.

References

This commit replaces the bleach library with nh3 for HTML sanitization,
providing significant performance improvements while maintaining the same
security guarantees.

Key changes:
- Updated dependency from bleach to nh3
- Modified constants to use sets instead of lists (nh3 requirement)
- Updated configuration functions to work with nh3's API
- Adjusted tests to work with the new data structures
- Removed 'rel' from allowed attributes (nh3 handles this automatically)
- Updated documentation to reflect the change

The migration is transparent to users - all existing configuration
options continue to work exactly as before, but with improved performance.
@hoheinzollern
Copy link
Copy Markdown
Author

Currently the CI is failing with weird messages (one wants me to do the opposite of the other, in two instances where dict() and dict comprehensions are used) so I'm unsure how to fix it.

else:
# Convert nh3 default attributes to same format as user attributes
default_attrs = nh3_kwargs.get("attributes", {})
if default_attrs and isinstance(next(iter(default_attrs.values())), set):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

q: why is this check needed? you set the attributes defaults to a set in https://github.com/torchbox/wagtail-markdown/pull/163/changes#diff-842b7897cbee817049ee8502502a7bc00491ba84ca8162d1ea20670be14e0398R96

should the user attributes be converted to a set?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reverted this check

@zerolab
Copy link
Copy Markdown
Member

zerolab commented Mar 16, 2026

Currently the CI is failing with weird messages (one wants me to do the opposite of the other, in two instances where dict() and dict comprehensions are used) so I'm unsure how to fix it.

Ruff fails with https://docs.astral.sh/ruff/rules/unnecessary-generator-dict/ and https://docs.astral.sh/ruff/rules/unnecessary-comprehension/

@hoheinzollern
Copy link
Copy Markdown
Author

Ruff fails with https://docs.astral.sh/ruff/rules/unnecessary-generator-dict/ and https://docs.astral.sh/ruff/rules/unnecessary-comprehension/

Ok, I'm not sure precisely how to address this.

I tried a different approach where all the logic that converts lists to sets (seems necessary with the new library) is localized in the function _sanitise_markdown_html.

@hoheinzollern
Copy link
Copy Markdown
Author

OK, looks like the CI passes now, seems like it's addressed. One critical bit is the requirement to use sets. I've made the choice to do the conversion, but it's equally adequate to pass on the change of interface to the users of wagtail-markdown.
My reason for sending this patch is that I am relying on puput which depends on this project. My own pipeline was complaining about the outdated bleach package, so I figured it's an easy enough fix to switch to nh3 instead of just upgrading a deprecated package to a newer version.

Comment on lines +36 to +40
nh3_kwargs["tags"] = set(nh3_kwargs["tags"])
nh3_kwargs["attributes"] = {
key: set(value) for key, value in nh3_kwargs["attributes"].items()
}
nh3_kwargs["filter_style_properties"] = set(nh3_kwargs["filter_style_properties"])
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, these belongs in _get_nhs3_kwargs

"<a>anchor tag</a> &lt;script&gt;alert('boom!')&lt;/script&gt;</p>"
'text with a <a href="https://example.com" rel="noopener noreferrer">link</a>\n'
"and some disallowed tag and attributes: italic, "
'<a rel="noopener noreferrer">anchor tag</a> </p>'
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I agree to noopener noreferrer, not everyone will. I think we should leave this to the user, but default to the nh3 defaults.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants