Porting from pyparsing match_previous_literal #1437

kevemueller · 2024-07-08T17:48:12Z

What is your question?
I am porting from pyparsing to Lark due to expectations on increased performance. Initial test show very promising.

One of the few constructs that I could not identify how to express in Lark is match_previous_literal. It allows to dynamically match based on a previously matched literal.

# pyparsing example
first = Word(nums)
match_expr = first + ":" + match_previous_literal(first)

(see pyparsing-docs.readthedocs.io

I need this functionality to match a sed search and replace like construct.
${var:S/from/to/g} including all of its equivalents, like e.g. ${var:S#from#to#g}
I would like to match the character after the signaling token :S and use that as the delimiter for the expression. The delimited content must be parsed as well, i.e. may contain constructs like ${var2}.

If you're having trouble with your code or grammar

Currently I am using a workaround using templates simply listing some commonly used separators, but this is not exhaustive and not generic. Almost any character can be used as the separator

?sc_template{x, sep}: sep x sep x sep /[1gW]+/?
expansion_modifier_sc: /[SC]/ (sc_template{token, "/"} | sc_template{token, "%"})

I wonder if there is any possibility to define this generically in the grammar. I could live with an interim solution that retrieves the delimited content and run a parse on it in a post-processing step.

The text was updated successfully, but these errors were encountered:

erezsh · 2024-07-08T18:44:09Z

LALR only parses context-free grammars. However, a function like match_previous_literal() is context-sensitive. So the parser can't help you there.

But a regular expression should be capable of matching this construct.

It sounds like the best solution is to create a regexp terminal that matches the entire expression, and then in post-processing run the regexp again to parse the string. (using groups)

kevemueller · 2024-07-09T15:39:28Z

Hi Erez,

thanks for pointing me into the right direction with RE.

I confirm that it is as easy as

expansion_modifier_sc: /[SC]/ /(.).*?\1.*?\1[1gW]*/

to back-reference in the RE the matching first character. The complete expression is handed to the post-processor which will then still need to take it apart using Python code.
To complete the exercise I will still need to add escaping to the RE, but that is beyond Lark.

Thx for this nice tool.

kevemueller · 2024-07-10T18:59:12Z

Update:
This works with earley/dynamic, but not with larl/contextual.
larl/contextual adds additional groups which make calculating the backreference impossible (sometimes \1, sometimes \2).

The error message is re.error: cannot refer to an open group at position

The workaround is to use named capturing group which results in a grammar of

expansion_modifier_sc: /[SC]/ /(?P<sep>.).*?(?P=sep).*?(?P=sep)[1gW]*/

This is agnostic to the lexer adding additional groups.

erezsh · 2024-07-10T19:47:59Z

That makes sense. Maybe there's a way we could make it work, by fixing indexes for example.

Although requiring names for back-references almost sounds like a win :)

kevemueller added the question label Jul 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Porting from pyparsing match_previous_literal #1437

Porting from pyparsing match_previous_literal #1437

kevemueller commented Jul 8, 2024

erezsh commented Jul 8, 2024

kevemueller commented Jul 9, 2024

kevemueller commented Jul 10, 2024 •

edited

Loading

erezsh commented Jul 10, 2024

Porting from pyparsing match_previous_literal #1437

Porting from pyparsing match_previous_literal #1437

Comments

kevemueller commented Jul 8, 2024

erezsh commented Jul 8, 2024

kevemueller commented Jul 9, 2024

kevemueller commented Jul 10, 2024 • edited Loading

erezsh commented Jul 10, 2024

kevemueller commented Jul 10, 2024 •

edited

Loading