Replace bleach with nh3 for HTML sanitization#163
Replace bleach with nh3 for HTML sanitization#163hoheinzollern wants to merge 7 commits intotorchbox:mainfrom
Conversation
This commit replaces the bleach library with nh3 for HTML sanitization, providing significant performance improvements while maintaining the same security guarantees. Key changes: - Updated dependency from bleach to nh3 - Modified constants to use sets instead of lists (nh3 requirement) - Updated configuration functions to work with nh3's API - Adjusted tests to work with the new data structures - Removed 'rel' from allowed attributes (nh3 handles this automatically) - Updated documentation to reflect the change The migration is transparent to users - all existing configuration options continue to work exactly as before, but with improved performance.
|
Currently the CI is failing with weird messages (one wants me to do the opposite of the other, in two instances where dict() and dict comprehensions are used) so I'm unsure how to fix it. |
src/wagtailmarkdown/utils.py
Outdated
| else: | ||
| # Convert nh3 default attributes to same format as user attributes | ||
| default_attrs = nh3_kwargs.get("attributes", {}) | ||
| if default_attrs and isinstance(next(iter(default_attrs.values())), set): |
There was a problem hiding this comment.
q: why is this check needed? you set the attributes defaults to a set in https://github.com/torchbox/wagtail-markdown/pull/163/changes#diff-842b7897cbee817049ee8502502a7bc00491ba84ca8162d1ea20670be14e0398R96
should the user attributes be converted to a set?
Ruff fails with https://docs.astral.sh/ruff/rules/unnecessary-generator-dict/ and https://docs.astral.sh/ruff/rules/unnecessary-comprehension/ |
Ok, I'm not sure precisely how to address this. I tried a different approach where all the logic that converts lists to sets (seems necessary with the new library) is localized in the function |
|
OK, looks like the CI passes now, seems like it's addressed. One critical bit is the requirement to use sets. I've made the choice to do the conversion, but it's equally adequate to pass on the change of interface to the users of |
| nh3_kwargs["tags"] = set(nh3_kwargs["tags"]) | ||
| nh3_kwargs["attributes"] = { | ||
| key: set(value) for key, value in nh3_kwargs["attributes"].items() | ||
| } | ||
| nh3_kwargs["filter_style_properties"] = set(nh3_kwargs["filter_style_properties"]) |
There was a problem hiding this comment.
IMHO, these belongs in _get_nhs3_kwargs
| "<a>anchor tag</a> <script>alert('boom!')</script></p>" | ||
| 'text with a <a href="https://example.com" rel="noopener noreferrer">link</a>\n' | ||
| "and some disallowed tag and attributes: italic, " | ||
| '<a rel="noopener noreferrer">anchor tag</a> </p>' |
There was a problem hiding this comment.
While I agree to noopener noreferrer, not everyone will. I think we should leave this to the user, but default to the nh3 defaults.
This PR replaces bleach with nh3 for HTML sanitization, providing significant performance improvements while maintaining the same security guarantees.
Key Changes
Technical Details
relfrom allowed attributes (nh3 handles this automatically for security)Testing
relattribute to links with no option to remove it)Migration Guide
This change is mostly transparent to users. No code changes are required in projects using wagtail-markdown. Visible changes will be improved performance and the loss of
relattribute.References