|
| 1 | += Hacker's guide to the `NodePattern` compiler |
| 2 | + |
| 3 | +This documentation is aimed at anyone wanting to understand / modify the `NodePattern` compiler. |
| 4 | +It assumes some familiarity with the syntax of https://github.com/rubocop-hq/rubocop-ast/blob/master/doc/modules/ROOT/pages/node_pattern.md[`NodePattern`], as well as the AST produced by the `parser` gem. |
| 5 | + |
| 6 | +== High level view |
| 7 | + |
| 8 | +The `NodePattern` compiler uses the same techniques as the `parser` gem: |
| 9 | + |
| 10 | +* a `Lexer` that breaks source into tokens |
| 11 | +* a `Parser` that uses tokens and a `Builder` to emit an AST |
| 12 | +* a `Compiler` that converts this AST into Ruby code |
| 13 | + |
| 14 | +Example: |
| 15 | + |
| 16 | +* Pattern: `+(send nil? {:puts :p} $...)+` |
| 17 | +* Tokens: `+'(', [:tNODE_TYPE, :send], [:tPREDICATE, :nil?], '{', ...+` |
| 18 | +* AST: `+s(:sequence, s(:node_type, :send), s(:predicate, :nil?), s(:union, ...+` |
| 19 | +* Ruby code: |
| 20 | ++ |
| 21 | +[source,ruby] |
| 22 | +---- |
| 23 | +node.is_a?(::RuboCop::AST::Node) && node.children.size >= 2 && |
| 24 | +node.send_type? && |
| 25 | +node.children[0].nil?() && |
| 26 | +(union2 = node.children[1]; ... |
| 27 | +---- |
| 28 | + |
| 29 | +The different parts are described below |
| 30 | + |
| 31 | +== Vocabulary |
| 32 | + |
| 33 | +*"node pattern"*: something that can be matched against a single `AST::Node`. |
| 34 | +While `(int 42)` and `#is_fun?` both correspond to node patterns, `+...+` (without the parenthesis) is not a node pattern. |
| 35 | + |
| 36 | +*"sequence"*: a node pattern that describes the sequence of children of a node (and its type): `+(type first_child second_child ...)+` |
| 37 | + |
| 38 | +*"variadic"*: element of a sequence that can match a variable number of children. |
| 39 | +`+(send _ int* ...)+` has two variadic elements (`int*` and `+...+`). |
| 40 | +`(send _ :name)` contains no variadic element. |
| 41 | +Note that a sequence is itself never variadic. |
| 42 | + |
| 43 | +*"atom"*: element of a pattern that corresponds with a simple Ruby object. |
| 44 | +`(send nil? |
| 45 | +:puts (str 'hello'))` has two atoms: `:puts` and `'hello'`. |
| 46 | + |
| 47 | +== Lexer |
| 48 | + |
| 49 | +The `lexer.rb` defines `Lexer` and has the few definitions needed for the lexer to work. |
| 50 | +The bulk of the processing is in the inherited class that is generated by https://github.com/seattlerb/oedipus_lex[`oedipus_lex`] |
| 51 | + |
| 52 | +[discrete] |
| 53 | +==== Rules |
| 54 | + |
| 55 | +https://github.com/seattlerb/oedipus_lex[`oedipus_lex`] generates the Ruby file `lexer.rex.rb` from the rules defined in `lexer.rex`. |
| 56 | + |
| 57 | +These rules map a Regexp to code that emits a token. |
| 58 | + |
| 59 | +`oedipus_lex` aims to be simple and the generated file is readable. |
| 60 | +It uses https://ruby-doc.org/stdlib-2.7.1/libdoc/strscan/rdoc/StringScanner.html[`StringScanner`] behind the scene. |
| 61 | +It selects the first rule that matches, contrary to many lexing tools that prioritize longest match. |
| 62 | + |
| 63 | +[discrete] |
| 64 | +==== Tokens |
| 65 | + |
| 66 | +The `Lexer` emits tokens with types that are: |
| 67 | + |
| 68 | +* string for the syntactic symbols (e.g. |
| 69 | +`'('`, `'$'`, `+'...'+`) |
| 70 | +* symbols of the form `:tTOKEN_TYPE` for the rest (e.g. |
| 71 | +`:tPREDICATE`) |
| 72 | + |
| 73 | +Tokens are stored as `[type, value]`. |
| 74 | + |
| 75 | +[discrete] |
| 76 | +==== Generation |
| 77 | + |
| 78 | +Use `rake generate:lexer` to generate the `lexer.rex.rb` from `lexer.rex` file. |
| 79 | +This is done automatically by `rake spec`. |
| 80 | + |
| 81 | +NOTE: the `lexer.rex.rb` is not under source control, but is included in the gem. |
| 82 | + |
| 83 | +== Parser |
| 84 | + |
| 85 | +Similarly to the `Lexer`, the `parser.rb` defines `Parser` and has the few definitions needed for the parser to work. |
| 86 | +The bulk of the processing is in the inherited class `parser.racc.rb` that is generated by https://ruby-doc.org/stdlib-2.7.0/libdoc/racc/parser/rdoc/Racc.html#module-Racc-label-Writing+A+Racc+Grammar+File[`racc`] from the rules in `parser.y`. |
| 87 | + |
| 88 | +[discrete] |
| 89 | +==== Nodes |
| 90 | + |
| 91 | +The `Parser` emits `NodePattern::Node` which are similar to RuboCop's node. |
| 92 | +They both inherit from ``parser``'s `Parser::AST::Source::Node`, and share additional methods too. |
| 93 | + |
| 94 | +Like for RuboCop's nodes, some nodes have specicialized classes (e.g. |
| 95 | +`Sequence`) while other nodes use the base class directly (e.g. |
| 96 | +`s(:number, 42)`) |
| 97 | + |
| 98 | +[discrete] |
| 99 | +==== Rules |
| 100 | + |
| 101 | +The rules follow closely the definitions above. |
| 102 | +In particular a distinction between `node_pattern_list`, which is a list of node patterns (each term can match a single node), while the more generic `variadic_pattern_list` is a list of elements, some of which could be variadic, others simple node patterns. |
| 103 | + |
| 104 | +[discrete] |
| 105 | +==== Generation |
| 106 | + |
| 107 | +Similarly to the lexer, use `rake generate:parser` to generate the `parser.racc.rb` from `parser.y` file. |
| 108 | +This is done automatically by `rake spec`. |
| 109 | + |
| 110 | +NOTE: the `parser.racc.rb` is not under source control, but is included in the gem. |
| 111 | + |
| 112 | +== Compiler |
| 113 | + |
| 114 | +The compiler's core is the `Compiler` class. |
| 115 | +It holds the global state (e.g. |
| 116 | +references to named arguments). |
| 117 | +The goal of the compiler is to produce `matching_code`, Ruby code that can be run against an `AST::Node`, or any Ruby object for that matter. |
| 118 | + |
| 119 | +Packaging of that `matching_code` into code for a `lambda`, or method `def` is handled separately by the `MethodDefiner` module. |
| 120 | + |
| 121 | +The compilation itself is handled by three subcompilers: |
| 122 | + |
| 123 | +* `NodePatternSubcompiler` |
| 124 | +* `AtomSubcompiler` |
| 125 | +* `SequenceSubcompiler` |
| 126 | + |
| 127 | +=== Visitors |
| 128 | + |
| 129 | +The subcompilers use the visitor pattern [https://en.wikipedia.org/wiki/Visitor_pattern] |
| 130 | + |
| 131 | +The methods starting with `visit_` are used to process the different types of nodes. |
| 132 | +For a node of type `:capture`, the method `visit_capture` will be called, or if none is defined then `visit_other_type` will be called. |
| 133 | + |
| 134 | +No argument is passed, as the visited node is accessible with the `node` attribute reader. |
| 135 | + |
| 136 | +=== NodePatternSubcompiler |
| 137 | + |
| 138 | +Given any `NodePattern::Node`, it generates the Ruby code that can return `true` or `false` for the given node, or node type for sequence head. |
| 139 | + |
| 140 | +==== `var` vs `access` |
| 141 | + |
| 142 | +The subcompiler can be called with the current node stored either in a variable (provided with the `var:` keyword argument) or via a Ruby expression (e.g. |
| 143 | +`access: 'current_node.children[2]'`). |
| 144 | + |
| 145 | +The subcompiler will not generate code that executes this `access` expression more than once or twice. |
| 146 | +If it might access the node more than that, `multiple_access` will store the result in a temporary variable (e.g. |
| 147 | +`union`). |
| 148 | + |
| 149 | +==== Sequences |
| 150 | + |
| 151 | +Sequences are the most difficult elements to handle and are deferred to the `SequenceSubcompiler`. |
| 152 | + |
| 153 | +==== Atoms |
| 154 | + |
| 155 | +Atoms are handled with `visit_other_type`, which defers to the `AtomSubcompiler` and converts that result to a node pattern by appending `=== cur_node` (or `=== cur_node.type` if in sequence head). |
| 156 | + |
| 157 | +This way, the two arguments in `(_ #func?(%1) %2)` would be compiled differently; |
| 158 | +`%1` would be compiled as `param1`, while `%2` gets compiled as `param2 === node.children[1]`. |
| 159 | + |
| 160 | +==== Precedence |
| 161 | + |
| 162 | +The code generated has higher or equal precedence to `&&`, so as to make chaining convenient. |
| 163 | + |
| 164 | +=== AtomSubcompiler |
| 165 | + |
| 166 | +This subcompiler produces Ruby code that gets evaluated to a Ruby object. |
| 167 | +E.g. |
| 168 | +`"42"`, `:a_symbol`, `param1`. |
| 169 | + |
| 170 | +A good way to think about it is when it has to be passed as arguments to a function call. |
| 171 | +For example: |
| 172 | + |
| 173 | +[source,ruby] |
| 174 | +---- |
| 175 | +# Pattern '#func(42, %1)' compiles to |
| 176 | +func(node, 42, param1) |
| 177 | +---- |
| 178 | + |
| 179 | +Note that any node pattern can be output by this subcompiler, but those that don't correspond to a Ruby literal will be output as a lambda so they can be combined. |
| 180 | +For example: |
| 181 | + |
| 182 | +[source,ruby] |
| 183 | +---- |
| 184 | +# Pattern '#func(int)' compiles to |
| 185 | +func(node, ->(compare) { compare.is_a?(::RuboCop::AST::Node) && compare.int_type? }) |
| 186 | +---- |
| 187 | + |
| 188 | +=== SequenceSubcompiler |
| 189 | + |
| 190 | +The subcompiler compiles the sequences' terms in turn, keeping track of which children of the `AST::Node` are being matched. |
| 191 | + |
| 192 | +==== Variadic terms |
| 193 | + |
| 194 | +The complexity comes from variadic elements, which have complex processing _and_ may make it impossible to know at compile time which children are matched by the subsequent terms. |
| 195 | + |
| 196 | +*Example* (no variadic terms) |
| 197 | + |
| 198 | +---- |
| 199 | +(_type int _ str) |
| 200 | +---- |
| 201 | + |
| 202 | +First child must match `int`, third child must match `str`. |
| 203 | +The subcompiler will use `children[0]` and `children[2]`. |
| 204 | + |
| 205 | +*Example* (one variadic terms) |
| 206 | + |
| 207 | +---- |
| 208 | +(_type int _* str) |
| 209 | +---- |
| 210 | + |
| 211 | +First child must match `int` and _last_ child must match `str`. |
| 212 | +The subcompiler will use `children[0]` and `children[-1]`. |
| 213 | + |
| 214 | +*Example* (multiple variadic terms) |
| 215 | + |
| 216 | +---- |
| 217 | +(_type int+ sym str+) |
| 218 | +---- |
| 219 | + |
| 220 | +The subcompiler can not use any integer and `children[]` to match `sym`. |
| 221 | +This must be tracked at runtime in a variable (`cur_index`). |
| 222 | + |
| 223 | +The subcompiler will use fixed indices before the first variadic element and after the last one. |
| 224 | + |
| 225 | +==== Node pattern terms |
| 226 | + |
| 227 | +The node pattern terms are delegated to the `NodePatternSubcompiler`. |
| 228 | + |
| 229 | +In the pattern `(:sym :sym)`, both `:sym` will be compiled differently because the first `:sym` is in "sequence head": `:sym === node.type` and `:sym == node.children[0]` respectively. |
| 230 | +The subcompiler indicates if the pattern is in "sequence head" or not, so the `NodePatternSubcompiler` can produce the right code. |
| 231 | + |
| 232 | +Variadic elements may not (currently) cover the sequence head. |
| 233 | +As a convenience, `+(...)+` is understood as `+(_ ...)+`. |
| 234 | +Other types of nodes will raise an error (e.g. |
| 235 | +`(<will not compile>)`; |
| 236 | +see `Node#in_sequence_head`) |
| 237 | + |
| 238 | +==== Precedence |
| 239 | + |
| 240 | +Like the node pattern subcompiler, it generates code that has higher or equal precedence to `&&`, so as to make chaining convenient. |
0 commit comments