Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Superfluous identical ambiguities in Earley #1436

Open
erezsh opened this issue Jul 1, 2024 · 3 comments
Open

Superfluous identical ambiguities in Earley #1436

erezsh opened this issue Jul 1, 2024 · 3 comments
Labels
bug Earley Issues regarding the Earley parser

Comments

@erezsh
Copy link
Member

erezsh commented Jul 1, 2024

Below are two simple grammars. The first one produces 2 identical derivations, and the second one produces 4 identical derivations.

@chanicpanic If you have the time, maybe you could take a look? It looks like it's happening inside the earley-forest.

from lark import Lark

grammar1 ="""
start: (a?)*
a: /a/
"""

grammar2 ="""
start: (/a/?)*
"""

for g in (grammar1, grammar2):
    t = Lark(g, ambiguity="explicit").parse('a')
    print(t.pretty())

Output:

_ambig
  start
    a   a
  start
    a   a

_ambig
  start a
  start a
  start a
  start a
@erezsh erezsh added bug Earley Issues regarding the Earley parser labels Jul 1, 2024
@erezsh
Copy link
Member Author

erezsh commented Jul 1, 2024

P.S. looks like this is happening because of empty rules. For example this is working correctly:

grammar1 ="""
start: (/b/ a?)*
a: /a/
"""

grammar2 ="""
start: (/b/ /a/?)*
"""

for g in (grammar1, grammar2):
    t = Lark(g, ambiguity="explicit").parse('ba')
    print(t.pretty())

So, perhaps it's related to #1312

@chanicpanic
Copy link
Contributor

All of the derivations are distinct. They just happen to produce identical trees after tree shaping. This is most easily seen with the TreeForestTransformer:

from lark import Lark
from lark.parsers.earley_forest import TreeForestTransformer

grammar1 ="""
start: (a?)*
a: /a/
"""

grammar2 ="""
start: (/a/?)*
"""

transformer = TreeForestTransformer(resolve_ambiguity=False)

for g in (grammar1, grammar2):
    node = Lark(g, ambiguity="forest", debug=True).parse('a')
    print(transformer.transform(node).pretty())

Output:

start
  _ambig
    __start_star_0
      a a
    __start_star_0
      __start_star_0
      a a

_ambig
  start
    _ambig
      __start_star_0    a
      __start_star_0
        __start_star_0
        a
  start
    __start_star_0
      _ambig
        __start_star_0  a
        __start_star_0
          __start_star_0
          a

Note that the "identical" ambiguity behavior can also result from user-defined inlined rules.

Example Grammar:

start: _a1 | _a2
_a1: /a/
_a2: /a/

IMO, it's not worthwhile for us to detect and filter out identical tree structures in general. Perhaps we could have special logic for the empty star/plus rule case if desired.

@erezsh
Copy link
Member Author

erezsh commented Aug 30, 2024

All of the derivations are distinct

I just checked, and the second example returns two derivations instead of four, now that we added caching of the nodes. That's progress anyhow :)

# the new result
_ambig
  start	a
  start	a

I wonder if we can solve the remaining ambiguity using such a caching mechanism, but calculate the hash in a way that ignores specific rules (i.e. __rules) ?

Note that the "identical" ambiguity behavior can also result from user-defined inlined rules.

That's a very good point!

But, the repetition operators (+/*) are a core functionality, and Lark kind of pretends that they are actual operators (when in reality they are macros). So it's better not to break this illusion, if we can avoid it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Earley Issues regarding the Earley parser
Projects
None yet
Development

No branches or pull requests

2 participants