Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: Gumbo.parse and .fragment now use keyword arguments #3199

Merged
merged 6 commits into from
May 24, 2024

Conversation

flavorjones
Copy link
Member

@flavorjones flavorjones commented May 21, 2024

What problem is this PR intended to solve?

Refactor Gumbo.parse and Gumbo.fragment to use keyword arguments for parse options (instead of positional arguments).

These methods are internal and not part of the public API, so this isn't considered a breaking change. We're changing them to be more extensible going forward.

See comments at #3178 for context

This PR also

  • introduces checking for the Gumbo options passed in
  • takes care to delete deprecated kwarg max_parse_errors
  • cleans up options passed around in tests

Have you included adequate test coverage?

It's a refactor, existing test coverage should be sufficient.

Does this change affect the behavior of either the C or the Java implementations?

No behavior changes, but the HTML5 parser is only available in CRuby.

@stevecheckoway
Copy link
Contributor

Hi @flavorjones,

This looks pretty much exactly like what I had in mind. A few questions:

  1. The names rb_gumbo_s_parse and rb_gumbo_s_fragment seem like they're from the Ruby extension API due to the leading rb_. How do you feel about just dropping the rb_?

  2. Should rb_scan_args_kw() be used instead of rb_scan_args()? I'm not entirely sure I understand from the documentation when the former should be used but it seems like it could be more specific about what it is looking for in terms of keyword arguments.

  3. How do you feel about something like this to ensure that the required keyword arguments are being passed.

    diff --git a/ext/nokogiri/gumbo.c b/ext/nokogiri/gumbo.c
    index dd333f7b..a9bfe6cf 100644
    --- a/ext/nokogiri/gumbo.c
    +++ b/ext/nokogiri/gumbo.c
    @@ -316,6 +316,32 @@ parse_cleanup(VALUE parse_args)
       return Qnil;
     }
    
    +// Scan the keyword arguments for options common to the document and fragment
    +// parse.
    +static GumboOptions
    +common_options(VALUE kwargs) {
    +  // The order of the keywords determines the order of the values below.
    +  // If this order is changed, then setting the options below must change as
    +  // well.
    +  ID keywords[] = {
    +    rb_intern_const("max_attributes"),
    +    rb_intern_const("max_errors"),
    +    rb_intern_const("max_tree_depth"),
    +  };
    +  VALUE values[sizeof keywords / sizeof keywords[0]];
    +
    +  // Extract the values coresponding to the required keywords. Raise an error
    +  // if required arguments are missing.
    +  rb_get_kwargs(kwargs, keywords, 3, -1, values);
    +
    +  GumboOptions options = kGumboDefaultOptions;
    +  options.max_attributes = NUM2INT(values[0]);
    +  options.max_errors = NUM2INT(values[1]);
    +  options.max_tree_depth = NUM2INT(values[2]);
    +
    +  return options;
    +}
    +
     static VALUE parse_continue(VALUE parse_args);
    
     /*
    @@ -331,10 +357,7 @@ rb_gumbo_s_parse(int argc, VALUE *argv, VALUE _self)
         kwargs = rb_hash_new();
       }
    
    -  GumboOptions options = kGumboDefaultOptions;
    -  options.max_attributes = NUM2INT(rb_hash_aref(kwargs, ID2SYM(rb_intern_const("max_attributes"))));
    -  options.max_errors = NUM2INT(rb_hash_aref(kwargs, ID2SYM(rb_intern_const("max_errors"))));
    -  options.max_tree_depth = NUM2INT(rb_hash_aref(kwargs, ID2SYM(rb_intern_const("max_tree_depth"))));
    +  GumboOptions options = common_options(kwargs);
    
       GumboOutput *output = perform_parse(&options, input);
       ParseArgs args = {
    @@ -440,6 +463,8 @@ rb_gumbo_s_fragment(int argc, VALUE *argv, VALUE _self)
         kwargs = rb_hash_new();
       }
    
    +  GumboOptions options = common_options(kwargs);
    +
       if (NIL_P(ctx)) {
         ctx_tag = "body";
         ctx_ns = GUMBO_NAMESPACE_HTML;
    @@ -543,14 +568,8 @@ error:
       }
    
       // Perform a fragment parse.
    -  GumboOptions options = kGumboDefaultOptions;
    -  options.max_attributes = NUM2INT(rb_hash_aref(kwargs, ID2SYM(rb_intern_const("max_attributes"))));
    -  options.max_errors = NUM2INT(rb_hash_aref(kwargs, ID2SYM(rb_intern_const("max_errors"))));
    -
    -  // Add one to account for the HTML element.
    -  int depth = NUM2INT(rb_hash_aref(kwargs, ID2SYM(rb_intern_const("max_tree_depth"))));
    -  options.max_tree_depth = depth < 0 ? -1 : (depth + 1);
    -
    +  // Add one to the max tree depth to account for the HTML element.
    +  options.max_tree_depth = options.max_tree_depth < 0 ? -1 : (options.max_tree_depth + 1);
       options.fragment_context = ctx_tag;
       options.fragment_namespace = ctx_ns;
       options.fragment_encoding = encoding;

    I initially made it an error to pass additional keyword arguments (by changing the -1 argument to 0), but then several tests fail because of unexpected keyword arguments encoding and max_parse_errors. It looks like the former is making it into the keyword args due to a bug in our tests where we're passing the encoding as an (ignored) keyword argument rather than as a positional argument (Pass encoding as a positional argument #3201). The latter is apparently a deprecated name for max_errors.

    I do like the idea of getting an error if the wrong keyword arguments are passed (although maybe that would be a breaking change?). We could remove max_parse_errors from the options prior to calling the C function.

@stevecheckoway
Copy link
Contributor

Here's a branch which incorporates #3201 as well as removes the max_parse_errors from the options and then causes errors for unexpected keyword arguments. flavorjones-use-kwargs-html5-methods...stevecheckoway:nokogiri:common-options

@flavorjones
Copy link
Member Author

The names rb_gumbo_s_parse and rb_gumbo_s_fragment seem like they're from the Ruby extension API due to the leading rb_. How do you feel about just dropping the rb_?

I've been trying to gradually improve the naming on our C functions, in particular the ones that are ruby methods, so that stack traces are a bit more meaningful. I don't feel strongly about this convention, but I'm already using it in a lot of places (try grep "^rb_" ext/nokogiri). An alternative I've also used in places is noko_ which is probably better. I'll change it to that.

Should rb_scan_args_kw() be used instead of rb_scan_args()?

Maybe? I'll take another look. The intention of that function was to ease the migration to first-class keyword arguments in Ruby 2.7/3.0, so it's probably worth a second look here.

How do you feel about something like this to ensure that the required keyword arguments are being passed?

Yeah, I like that a lot. I'll pull your branch into this PR.

This was referenced May 24, 2024
flavorjones added a commit that referenced this pull request May 24, 2024
**What problem is this PR intended to solve?**

See discussion at #3199 

cc @stevecheckoway 

(Recreation of #3204 with additional commits)
These methods are internal and not part of the public API, so this
isn't considered a breaking change. We're changing them to be more
extensible going forward.
@flavorjones flavorjones force-pushed the flavorjones-use-kwargs-html5-methods branch from bb8bc2b to 0d2b635 Compare May 24, 2024 18:14
stevecheckoway and others added 2 commits May 24, 2024 14:59
and ensure coverage with and without the explicit encoding argument.

Co-authored-by: Mike Dalessio <[email protected]>
@flavorjones flavorjones force-pushed the flavorjones-use-kwargs-html5-methods branch from 0d2b635 to b924b4b Compare May 24, 2024 19:00
@flavorjones
Copy link
Member Author

@stevecheckoway OK, I think this is ready for a final 👀 if you have a minute.

Copy link
Contributor

@stevecheckoway stevecheckoway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines +580 to +582
// Add one to the max tree depth to account for the HTML element.
if (options.max_tree_depth < UINT_MAX) { options.max_tree_depth++; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch on the overflow here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was caught by the downstream integration test with the Sanitize gem.

And it turns out there's no good way for the compiler to warn us because we can't even use the -Wconversion compiler flag against Ruby header files. I've opened a PR upstream to try to fix that: ruby/ruby#10843

@flavorjones flavorjones force-pushed the flavorjones-use-kwargs-html5-methods branch from b924b4b to e53ea29 Compare May 24, 2024 20:02
@flavorjones flavorjones merged commit fab511c into main May 24, 2024
131 of 132 checks passed
@flavorjones flavorjones deleted the flavorjones-use-kwargs-html5-methods branch May 24, 2024 20:27
@flavorjones flavorjones added this to the v1.17.0 milestone Jun 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants