Terminal Editing before Compression #1461

clementfaisandier · 2024-09-03T13:48:42Z

The Problem

I am trying to apply a general grammar on various types of text files; specifically on code and documentation files in languages such as python, C, LaTeX... All of these use different comment characters, since my grammar has a keen interest in comments, the COMMENT_CHAR terminal must be set to the right value for each file I need to parse.

I recommend lark should allow users to change/set terminal values. Specifically,

Alternatives

I have already tried using edit_terminals to produce this behavior. However, edit_terminals occurs after terminals values are processed and compressed into a minimal set of tokens. Although one could modify these complex regexes, this requires the user to understand how lark works behind the scenes, would be very clunky, and prone to errors.

Although one could also modify the input grammar before passing it to lark, text processing should be lark's responsibility. Having to parse a grammar and modify it so lark can parse the user's files seems a bit backwards.

Context

This my original post and has an example to describe the problem:

I'm having an issue using edit_terminals: I'm finding that Lark is compressing the terminals I've defined before I use edit_terminals.

This is my grammar; the terminal I am looking to modify is COMMENT_CHAR to support multiple languages:

start: (snippet | LINE)*

snippet: snippet_marker LINE* 

// TODO: Evaluate if the prefix for each token should be in or out of the token.
snippet_marker.1: PREFIX MARKER _IWS* /.+/ SUFFIX
PREFIX: _IWS* COMMENT_CHAR _IWS*
MARKER: _SPIDER
SUFFIX: _EOL
_DEFINITION_TOKENS: CONTEXT _IWS* TOPIC _IWS* CONTENT_TYPE
_BOOLEAN_FLAGS: _DEFINITION_TOKENS? (LINK | EMBEDDING)          // Makes the definition tokens filtering options
CONTEXT: "#" /\w{1,16}/     // What's the general theme?
TOPIC: "@" /\w{1,16}/       // What is this snippet about?
CONTENT_TYPE: "$" /[DRAC]/  // Documentation, Reasoning, API, Code
LINK: "?"
EMBEDDING: "!"

// Resources

_SPIDER: "//\(oo)/\\"
_BORING: "SNIPPET"
_ROBOT: "[o_o]"

LINE: _IWS* COMMENT_CHAR? _SENTENCE? _EOL
_SENTENCE: (_WORD _IWS+)* _WORD
_WORD: /\S+/
COMMENT_CHAR: "TO BE OVERRIDE BY PROGRAM- DO NOT REMOVE - DO NOT USE ANOTHER TOKEN FOR COMMENT CHAR"

_EOL: _IWS* _NL
_IWS: /[\t ]/
_NL: /\r?\n/

This is the python:

import lark

def terminal_callback(terminal_definition):
        print(terminal_definition)

with open('grammar.lark', 'rt') as file:
    parser = lark.Lark(file.read(), edit_terminals=terminal_callback)

with open('sandbox/src/base_calc.py', 'rt') as file:
        parser.parse(text=file.read())

But the output is:

TerminalDef('PREFIX', '(?:[\t ])*TO\\ BE\\ OVERRIDE\\ BY\\ LARIAT\\ \\-\\ DO\\ NOT\\ REMOVE\\ \\-\\ DO\\ NOT\\ USE\\ ANOTHER\\ TOKEN\\ FOR\\ COMMENT\\ CHAR(?:[\t ])*')
TerminalDef('MARKER', '//\\\\\\(oo\\)/\\\\')
TerminalDef('SUFFIX', '(?:[\t ])*\r?\n')
TerminalDef('LINE', '(?:[\t ])*(?:TO\\ BE\\ OVERRIDE\\ BY\\ LARIAT\\ \\-\\ DO\\ NOT\\ REMOVE\\ \\-\\ DO\\ NOT\\ USE\\ ANOTHER\\ TOKEN\\ FOR\\ COMMENT\\ CHAR)?(?:(?:\\S+(?:[\t ])+)*\\S+)?(?:[\t ])*\r?\n')
TerminalDef('_IWS', '[\t ]')
TerminalDef('__ANON_0', '.+')

Clearly COMMENT_CHAR was absorbed by PREFIX, which makes it difficult to consistently.

There's a possibility I'm not using this right, but I also feel the edit_terminals option should occur before compression, otherwise users need to predict compression to use it consistently.

Thank you,
Clement

The text was updated successfully, but these errors were encountered:

clementfaisandier · 2024-09-03T13:49:57Z

My first issue! Let me know if you'd like me to make modifications or if there's anything I can do to help @erezsh.

MegaIng · 2024-09-03T14:56:30Z

IMO, a better way to solve the problem of overwriting COMMENT_CHAR is to use the Grammar builder:

import lark
from lark.load_grammar import GrammarBuilder

with open('grammar.lark', 'rt') as file:
    gb = GrammarBuilder()
    gb.load_grammar(content, 'grammar.lark')
    gb.load_grammar('%override COMMENT_CHAR: "//"')
    parser = lark.Lark(gb.build())


with open('sandbox/src/base_calc.py', 'rt') as file:
        parser.parse(text=file.read())

This already works.

clementfaisandier · 2024-09-13T20:59:29Z

@MegaIng That looks promising, is there documentation for Grammar Builder? I can't seem to find it...

I'll try this solution soon and close the ticket if it solves the issue, I just don't have the time at the moment.

clementfaisandier added the enhancement label Sep 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Terminal Editing before Compression #1461

Terminal Editing before Compression #1461

clementfaisandier commented Sep 3, 2024

clementfaisandier commented Sep 3, 2024

MegaIng commented Sep 3, 2024

clementfaisandier commented Sep 13, 2024

Terminal Editing before Compression #1461

Terminal Editing before Compression #1461

Comments

clementfaisandier commented Sep 3, 2024

The Problem

Alternatives

Context

clementfaisandier commented Sep 3, 2024

MegaIng commented Sep 3, 2024

clementfaisandier commented Sep 13, 2024