Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow defining grammar in native language constructs #78

Open
SupraSummus opened this issue Oct 29, 2018 · 13 comments
Open

Allow defining grammar in native language constructs #78

SupraSummus opened this issue Oct 29, 2018 · 13 comments

Comments

@SupraSummus
Copy link
Contributor

SupraSummus commented Oct 29, 2018

This is a very common practice that a parser library has it's own grammar syntax (sometimes standardized, like ABNF), but I wonder why is that? Why not just allow user to encode grammar in native language constructs?

Encoding grammar in native language has advantages. Here are some, I can think of:

  • Custom recognizers can be embedded into the grammar. This is better than declaring them as "externs" and then adding in separate file.
  • No problems with string/regexp escaping. Python do this for us.
  • Generally no problems with parsing grammar language.
  • No hassle with imports. They are handled by python. (Import grammar from Python package #46 )
  • Possibility to abstract grammar structure (by use of functions, consts, etc.).
  • Easier/more-standard packaging of grammar files.

When I came across parglare and started digging into the code and found Grammar.from_struct i realized this is the thing. Is there any reason why one shouldn't use this method outside parglare? Maybe all is needed to make this method usable for user is documentation?

I don't mean parglare should get rid of it's grammar parser - just allow to not use it.

@SupraSummus
Copy link
Contributor Author

Another advantage of grammar as a python construct is possibility to inline actions into the grammar.

@igordejanovic
Copy link
Owner

I understand where you are coming from. And your points are valid. I did it in Arpeggio where you can define your grammar either by Python (a.k.a. internal DSL) or using PEG notation (external DSL), or you can even make another notation (there are two variants of PEG notations as a showcase).

The problem with the approach is maintenance overhead and issues regarding consistency between different specification approaches. Another cons is that your grammar is tied up to the particular parsing framework and it is very hard to migrate to something else. With EBNF variants you can easily switch, even when there are differences usually they are not big. For example, see here for an implementation of C parser in parglare where the grammar was taken from elsewhere and minimally modified.

I don't have an intention to support internal grammar spec in parglare. Actually, I have plans to port parglare to other languages and technologies and enable interoperability. E.g. compile parse table in Go and use it in Python. Thus the grammar should be defined a portable external DSL spec.

Yes, you could use from_struct but the purpose of that func is to bootstrap the parser. I don't find it very user friendly for general use and I don't plan to support it.

@SupraSummus
Copy link
Contributor Author

Maintenance overhead

I think maintenance overhead can be reduced by use of from_struct (or equivalent) method in from_string code. from_string would first parse grammar string into python representation and then call from_struct.

This way duplication of code is reduced (and thus bugs and some issues) and consistency is guaranteed. Some bugs in some scenarios can even be avoided: if user prefers to use python representation and skip parsing step then she is safe from bugs in grammar parser code.

Overhead on documentation remains, but there is no such thing as too much documentation ;)

Lack of portability

If users wants the portability (at the cost of advantages I listed above) then they are free to specify the grammar in BNF.

But lets assume that the python representation is "treeish" enough to be easily dumped to/loaded from json/yaml/protobuf/dhall/etc. Then enabling users to use it is absolutely great for portability beacuse they can load dumped grammars into any language they want without need to write relatively complex BNF parser (compared to json.loads).

And a depiction

                                           +----------------------+
                                           | Encoded in json/yaml |
                                           +----------------------+
                                                   ▲
                                                   |
                                                   ▼
+----------------+  parse_grammar_string   +-----------------------+  Grammar.from_struct    +------------------------------------+
| Grammar string | ----------------------> | Python representation | ----------------------> | Grammar ready to be used by parser |
+----------------+                         |                       |                         +------------------------------------+
             \                             |                       |                             ▲
              \    Grammar.from_string     |                       |                            /
               ----------------------------| - - - - - - - - - - - |---------------------------/
                                           +-----------------------+

And a note on EBNF portability and others

EBNF is not standardized notation and in general it's not easily switchable. Usually yes, but not in general. For example I consider rewriting examples on wikipedia page to EBNF comparable (regarding time) to rewriting them into python:

{
  'digit excluding zero': [["1"], ["2"], ["3"], ["4"], ["5"], ["6"], ["7"], ["8"], ["9"]],
  'digit': [["0"], [ident('digit excluding zero')]],
  'integer': [["0"], [optional( "-"), ident('natural number')]],
}

In fact I opened this discussion because of problems with portability. I started adapting existing grammar. I prefer generic solutions so doing this by hand wasn't acceptable. I wrote a parser for ABNF using parglare, then the "desugaring transformer" which converts nested brackets/repetitions/optionals to structures acceptable by parglare, and now I wonder how to load converted grammar to parglare. Do I need to serialize it from python into EBNF and load using from_string into python again? It seems like two unnecessary transformations.

@igordejanovic
Copy link
Owner

Thanks for the detailed explanation of the idea.

Yes, this approach if implemented would be cool. Dumping grammar to json together with LALR parser table would make possible for parser to be used without grammar textual representation and grammar parsing overhead. Still though there would be a need to maintain from_string and from_struct although from_string could be made simpler.

Didn't gave it a thorough thought but I think quite a big chunk of work should be done to make this happen while retaining current set of features.

Anyway, I agree with the general idea and would like to see something along this line implemented in the future.

@SupraSummus
Copy link
Contributor Author

I wrote crude function, that converts grammar description (aka. grammar struct) into parglare.Grammar object. Grammar description accepted by this function is quite language agnostic. For example it can be serialized into json doing just json.dumps(grammar).

Let me know if the format I came up with is suitable for integrating into parglare (replacing the current format)? Of course more sophisticated functionality like specifying associativity and priority still has to be incorporated in it, but I think it'll be straight forward (oh, naive me).

If you are curious how I imagine this format being incorporated into loading grammar form string take a look at ABNF parser and grammar desugaring code - parglare-EBNF can be supported similarily, and I think even easier, because unlike ABNF it forbids nested structures.

@igordejanovic
Copy link
Owner

@SupraSummus Just to let you know that I have been working on a redesign lately for the upcoming 0.10 version. There are a lot of changes and one of them is ability to specify grammar as a Python struct you suggested. Other interesting stuff include action/recognizer definition as python class/objects utilizing inheritance for override and various smaller improvements. Generally, the design is in my view now much simpler and more maintainable.

The work is currently on language-def-redesing, tests are passing, I still need to update the docs but you can look at examples and tests to see the basic ideas.

Interesting module to look at is parglare.lang where a textual grammar syntax is defined using a canonical struct-based approach. It is now easy to define other grammar syntax styles as the built-in is in no way special.

Import semantics is changed to be a simple file import with merging using the same namespace. I'm planning to add a proper language registration/discovery/reference that will inherit a part of the previous import functionality.

@KOLANICH
Copy link
Contributor

@SupraSummus, you may want to take a look at https://gitlab.com/UniGrammar/UniGrammar.py

@hosford42
Copy link

hosford42 commented Sep 5, 2020

@igordejanovic, I see you mentioned working on this for version 0.10. Were these changes merged in to the master branch?

I wrote a natural language parser (Pyramids) a while back. It's terribly slow, due to its naive approach, but it's indispensable because it handles a broad array of natural language constructs, converting them to small semantic graphs that I use in a larger NLU project I'm working on. I'm trying to port the parser to use parglare instead, as it seems you have done an excellent job on this package and mine is shamefully slow in comparison.

However, I have a very large natural language grammar written in a DSL with very specific design requirements that prevent it from being ported easily to parglare's grammar DSL. This leaves me in the position of needing to programmatically construct a natural language grammar as the grammar file written in my own DSL is read, but I'm iffy on how to approach the task. I had expected to see some sort of "grammar builder" object used in the Grammar.from_file and Grammar.from_string methods, which I planned to utilize in the actions for the DSL grammar to construct the natural language grammar on the fly.

I'm considering implementing a grammar builder class that feeds into the struct-based mechanism you already mentioned. This would not only meet my own needs, but it would likely meet @SupraSummus's needs as well. If I do eventually implement something along these lines, would you be interested in merging it into the code base (assuming it meets your code quality standards, etc.) or linking to a separate repo from your own?

@igordejanovic
Copy link
Owner

@hosford42 Yes, I did a rework on a language-def-redesign branch but there was a different in a way import system is working so I was reluctant to merge to master until it is either made compatible or thoroughly tested and documented with a migration path. I was quite happy with the end result and definitely will work on merging it to master when find some time (probably won't happen soon unfortunately).

If I recall correctly, this branch was complete, all tests were passing.

You can start to look here https://github.com/igordejanovic/parglare/blob/language-def-redesign/tests/func/rework/test_grammar.py

And this module is interesting as it represents an implementation of parglare grammar DSL using internal struct-based definition. This means that parglare grammar language is just one of many possible grammar languages you can create. You can provide your own if you would like.

@igordejanovic
Copy link
Owner

I just need to warn you that there is an important rework of GLR algorithm on master I did recently that greatly improves handling of ambiguous grammar (e.g. natural languages). It is not yet merged to language-def-redesign branch as it will require some manual work as the two branches have diverged considerably. That sync is something I plan to do in the near future.

@hosford42
Copy link

If I am understanding you correctly, the language-def-redesign branch will be synced up to master soon, but will not be merged in the near future due to test coverage and backwards compatibility concerns. Is that correct? If so, it sounds like I should be working off of language-def-redesign for the time being.

@igordejanovic
Copy link
Owner

Yes, you understood correctly.

@hosford42
Copy link

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants