Superfluous identical ambiguities in Earley #1436

erezsh · 2024-07-01T08:05:51Z

Below are two simple grammars. The first one produces 2 identical derivations, and the second one produces 4 identical derivations.

@chanicpanic If you have the time, maybe you could take a look? It looks like it's happening inside the earley-forest.

from lark import Lark

grammar1 ="""
start: (a?)*
a: /a/
"""

grammar2 ="""
start: (/a/?)*
"""

for g in (grammar1, grammar2):
    t = Lark(g, ambiguity="explicit").parse('a')
    print(t.pretty())

Output:

_ambig
  start
    a   a
  start
    a   a

_ambig
  start a
  start a
  start a
  start a

The text was updated successfully, but these errors were encountered:

erezsh · 2024-07-01T08:08:21Z

P.S. looks like this is happening because of empty rules. For example this is working correctly:

grammar1 ="""
start: (/b/ a?)*
a: /a/
"""

grammar2 ="""
start: (/b/ /a/?)*
"""

for g in (grammar1, grammar2):
    t = Lark(g, ambiguity="explicit").parse('ba')
    print(t.pretty())

So, perhaps it's related to #1312

chanicpanic · 2024-07-02T02:48:41Z

All of the derivations are distinct. They just happen to produce identical trees after tree shaping. This is most easily seen with the TreeForestTransformer:

from lark import Lark
from lark.parsers.earley_forest import TreeForestTransformer

grammar1 ="""
start: (a?)*
a: /a/
"""

grammar2 ="""
start: (/a/?)*
"""

transformer = TreeForestTransformer(resolve_ambiguity=False)

for g in (grammar1, grammar2):
    node = Lark(g, ambiguity="forest", debug=True).parse('a')
    print(transformer.transform(node).pretty())

Output:

start
  _ambig
    __start_star_0
      a a
    __start_star_0
      __start_star_0
      a a

_ambig
  start
    _ambig
      __start_star_0    a
      __start_star_0
        __start_star_0
        a
  start
    __start_star_0
      _ambig
        __start_star_0  a
        __start_star_0
          __start_star_0
          a

Note that the "identical" ambiguity behavior can also result from user-defined inlined rules.

Example Grammar:

start: _a1 | _a2
_a1: /a/
_a2: /a/

IMO, it's not worthwhile for us to detect and filter out identical tree structures in general. Perhaps we could have special logic for the empty star/plus rule case if desired.

erezsh · 2024-08-30T11:10:30Z

All of the derivations are distinct

I just checked, and the second example returns two derivations instead of four, now that we added caching of the nodes. That's progress anyhow :)

# the new result
_ambig
  start	a
  start	a

I wonder if we can solve the remaining ambiguity using such a caching mechanism, but calculate the hash in a way that ignores specific rules (i.e. __rules) ?

Note that the "identical" ambiguity behavior can also result from user-defined inlined rules.

That's a very good point!

But, the repetition operators (+/*) are a core functionality, and Lark kind of pretends that they are actual operators (when in reality they are macros). So it's better not to break this illusion, if we can avoid it.

erezsh added bug Earley Issues regarding the Earley parser labels Jul 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Superfluous identical ambiguities in Earley #1436

Superfluous identical ambiguities in Earley #1436

erezsh commented Jul 1, 2024

erezsh commented Jul 1, 2024

chanicpanic commented Jul 2, 2024

erezsh commented Aug 30, 2024 •

edited

Loading

Superfluous identical ambiguities in Earley #1436

Superfluous identical ambiguities in Earley #1436

Comments

erezsh commented Jul 1, 2024

erezsh commented Jul 1, 2024

chanicpanic commented Jul 2, 2024

erezsh commented Aug 30, 2024 • edited Loading

erezsh commented Aug 30, 2024 •

edited

Loading