Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Terminal Editing before Compression #1461

Open
clementfaisandier opened this issue Sep 3, 2024 · 3 comments
Open

Terminal Editing before Compression #1461

clementfaisandier opened this issue Sep 3, 2024 · 3 comments

Comments

@clementfaisandier
Copy link

The Problem

I am trying to apply a general grammar on various types of text files; specifically on code and documentation files in languages such as python, C, LaTeX... All of these use different comment characters, since my grammar has a keen interest in comments, the COMMENT_CHAR terminal must be set to the right value for each file I need to parse.

I recommend lark should allow users to change/set terminal values. Specifically,

Alternatives

I have already tried using edit_terminals to produce this behavior. However, edit_terminals occurs after terminals values are processed and compressed into a minimal set of tokens. Although one could modify these complex regexes, this requires the user to understand how lark works behind the scenes, would be very clunky, and prone to errors.

Although one could also modify the input grammar before passing it to lark, text processing should be lark's responsibility. Having to parse a grammar and modify it so lark can parse the user's files seems a bit backwards.

Context

This my original post and has an example to describe the problem:

I'm having an issue using edit_terminals: I'm finding that Lark is compressing the terminals I've defined before I use edit_terminals.

This is my grammar; the terminal I am looking to modify is COMMENT_CHAR to support multiple languages:

start: (snippet | LINE)*

snippet: snippet_marker LINE* 

// TODO: Evaluate if the prefix for each token should be in or out of the token.
snippet_marker.1: PREFIX MARKER _IWS* /.+/ SUFFIX
PREFIX: _IWS* COMMENT_CHAR _IWS*
MARKER: _SPIDER
SUFFIX: _EOL
_DEFINITION_TOKENS: CONTEXT _IWS* TOPIC _IWS* CONTENT_TYPE
_BOOLEAN_FLAGS: _DEFINITION_TOKENS? (LINK | EMBEDDING)          // Makes the definition tokens filtering options
CONTEXT: "#" /\w{1,16}/     // What's the general theme?
TOPIC: "@" /\w{1,16}/       // What is this snippet about?
CONTENT_TYPE: "$" /[DRAC]/  // Documentation, Reasoning, API, Code
LINK: "?"
EMBEDDING: "!"

// Resources

_SPIDER: "//\(oo)/\\"
_BORING: "SNIPPET"
_ROBOT: "[o_o]"

LINE: _IWS* COMMENT_CHAR? _SENTENCE? _EOL
_SENTENCE: (_WORD _IWS+)* _WORD
_WORD: /\S+/
COMMENT_CHAR: "TO BE OVERRIDE BY PROGRAM- DO NOT REMOVE - DO NOT USE ANOTHER TOKEN FOR COMMENT CHAR"

_EOL: _IWS* _NL
_IWS: /[\t ]/
_NL: /\r?\n/

This is the python:

import lark

def terminal_callback(terminal_definition):
        print(terminal_definition)

with open('grammar.lark', 'rt') as file:
    parser = lark.Lark(file.read(), edit_terminals=terminal_callback)

with open('sandbox/src/base_calc.py', 'rt') as file:
        parser.parse(text=file.read())

But the output is:

TerminalDef('PREFIX', '(?:[\t ])*TO\\ BE\\ OVERRIDE\\ BY\\ LARIAT\\ \\-\\ DO\\ NOT\\ REMOVE\\ \\-\\ DO\\ NOT\\ USE\\ ANOTHER\\ TOKEN\\ FOR\\ COMMENT\\ CHAR(?:[\t ])*')
TerminalDef('MARKER', '//\\\\\\(oo\\)/\\\\')
TerminalDef('SUFFIX', '(?:[\t ])*\r?\n')
TerminalDef('LINE', '(?:[\t ])*(?:TO\\ BE\\ OVERRIDE\\ BY\\ LARIAT\\ \\-\\ DO\\ NOT\\ REMOVE\\ \\-\\ DO\\ NOT\\ USE\\ ANOTHER\\ TOKEN\\ FOR\\ COMMENT\\ CHAR)?(?:(?:\\S+(?:[\t ])+)*\\S+)?(?:[\t ])*\r?\n')
TerminalDef('_IWS', '[\t ]')
TerminalDef('__ANON_0', '.+')

Clearly COMMENT_CHAR was absorbed by PREFIX, which makes it difficult to consistently.

There's a possibility I'm not using this right, but I also feel the edit_terminals option should occur before compression, otherwise users need to predict compression to use it consistently.

Thank you,
Clement

@clementfaisandier
Copy link
Author

My first issue! Let me know if you'd like me to make modifications or if there's anything I can do to help @erezsh.

@MegaIng
Copy link
Member

MegaIng commented Sep 3, 2024

IMO, a better way to solve the problem of overwriting COMMENT_CHAR is to use the Grammar builder:

import lark
from lark.load_grammar import GrammarBuilder

with open('grammar.lark', 'rt') as file:
    gb = GrammarBuilder()
    gb.load_grammar(content, 'grammar.lark')
    gb.load_grammar('%override COMMENT_CHAR: "//"')
    parser = lark.Lark(gb.build())


with open('sandbox/src/base_calc.py', 'rt') as file:
        parser.parse(text=file.read())

This already works.

@clementfaisandier
Copy link
Author

@MegaIng That looks promising, is there documentation for Grammar Builder? I can't seem to find it...

I'll try this solution soon and close the ticket if it solves the issue, I just don't have the time at the moment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants