-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
parsing MRS #371
Comments
I see the difference, and I don't recall if that was intentional. However, SimpleMRS is just one serialization format of MRS (e.g., the XML format of MRS has no such constraints, and the JSON format has the pattern So my question is: does this discrepancy cause a problem in your processing? If not, I'm not particularly inclined to change things. However, the first priority of PyDelphin's development philosophy is:
So if we have a notion what is correct for the syntax of MRS variables encompassing all serialization formats, processors, and grammars, then I would accept a pull request to make that change. |
From https://delphinqa.ling.washington.edu/t/mrs-dmrs-not-displayed-in-ltdb/986/6 The quotedstring in the MRS grammar is defined in https://github.com/delph-in/docs/wiki/MrsRFC#simple as:
Is this the expected behavior? See "named_rel" and "named rel", double quotes were removed and the
|
Perhaps you'd be interested in this thread from the old mailing list about predicate names: Note that some replies spill over into the new year, so they aren't properly threaded with the above: The use of Back to the quoted string predicates, we don't enforce any real restrictions of the quoted string, and escaped quotes are fine. Note that >>> m = simplemrs.decode('[RELS: < [ "named_rel" LBL: h1 ARG0: e2 ] [ named LBL: h3 ARG0: e4 ] >]')
>>> m.rels[0].predicate == m.rels[1].predicate
True
>>> m.rels[0].predicate
'named'
>>> m = simplemrs.decode('[RELS: < [ "named rel" LBL: h1 ARG0: e2 ] [ named LBL: h3 ARG0: e4 ] >]')
>>> m.rels[0].predicate == m.rels[1].predicate
False
>>> m.rels[0].predicate
'named rel' PyDelphin should only be outputting forms that it can parse again, so your second example indicates a bug in serialization. If there's a space in the predicate name, it should probably be quoted. See how the following cannot be parsed again: >>> m = simplemrs.decode('[RELS: < [ named rel LBL: h1 ARG0: e2 ] >]')
Traceback (most recent call last):
[...]
delphin.mrs._exceptions.MRSSyntaxError:
line 1, character 17
[RELS: < [ named rel LBL: h1 ARG0: e2 ] >]
^
MRSSyntaxError: expected: a feature |
Thanks for the references. So from https://github.com/delph-in/docs/wiki/PredicateRfc#type-vs-string and https://github.com/delph-in/docs/wiki/PredicateRfc#limitations-and-conventions
But I also found
A TypePred could never have a space in the name, right? If so, how do you escape the spaces? The StringPred, on the other hand, can have spaces, but grammar writers should avoid it. Actually, the current WSJ profiles from ERG do not seem to contain StringPred
|
Essentially, yes. In MRS-land there is no distinction between
Not sure what you mean by this. There are more limits as to what characters a non-string (type) predicate can contain because of parsing concerns. The current grammar on MrsRFC does not allow, for instance, non-breaking spaces, although I think I recall Stephan wanting to allow them...
We don't currently allow escaped characters outside of strings, and spaces delimit tokens, so string preds are the only way to put spaces (non-breaking or not) inside a predicate symbol. I would like to think that most grammar engineers would choose to not represent spaces in their semantics. I can imagine it happening with generic (unknown word handling) predicates, in some cases, or maybe in Thai, but there are other options for representing words with spaces (e.g., |
In (2), I just agree with what you said:
By the way, MrsRFC has another mistake, TypePred should be
This is what you have in simplemrs.py but simplified; this means that I know that @oepen cares about these things... hope to have his attention on this issue. I'm not sure if @john-a-carroll can also make comments. The (defun read-mrs-predicate (stream)
(loop
for c = (peek-char nil stream nil nil)
then (peek-char nil stream nil nil)
while (and c (whitespacep c)) do (read-char stream nil nil))
(let* ((c (peek-char nil stream nil nil))
(string
(if (char= c #\")
(read stream nil nil)
(coerce
(loop
for c = (read-char stream nil nil)
while (and c (not (whitespacep c))
(not (member c '(#\< #\[ #\") :test #'char=)))
collect c
finally (when (and c (not (whitespacep c)))
(unread-char c stream)))
'string))))
(cond
((zerop (length string))
(error
"unexpected end of file in read-mrs-predicate() at position ~a"
(file-position stream)))
(*normalize-predicates-p* (normalize-predicate string))
((char= c #\") string)
(t (vsym string))))) If I got it right, it will not read
I prefer not to mix the conventions about the names (e.g., the surface predicate names convention of _lemma_pos_sense) with the semantics of MRS. In other words, I would prefer to follow the MrsRFC rather than PyDelphin. What I didn't get is the SQSYMBOL! The REGEX matches only a single quote followed by one char In the
|
I think you're right that there's a bug in the >>> import re
>>> r = re.compile(r"_?([^_\s]+_)*(_rel)?")
>>> r.match('_foo_n_unmatched')
<re.Match object; span=(0, 7), match='_foo_n_'> But your proposal also has the same bug on the other side: >>> r2 = re.compile(r"_?(_[^_\s]+)*(_rel)?")
>>> r2.match('_foo_n_unmatched')
<re.Match object; span=(0, 1), match='_'> Something like this would work better: >>> r3 = re.compile(r"_?([^_\s]+_)*[^_\s]+(_rel)?")
>>> r3.match('_foo_n_matched')
<re.Match object; span=(0, 14), match='_foo_n_matched'> But the next problem is to exclude the lnk characterization: >>> r3.match('_foo_n_matched<0:4> ')
<re.Match object; span=(0, 19), match='_foo_n_matched<0:4>'> If we change those character classes to >>> r4 = re.compile(r"_[^\s_]+_[nvajrscpqxud](?:_(?:[^\s_<]|<(?![-0-9:#@ ]*>\s))+)?(?:_rel)?")
>>> r4.match('_foo_n_matched')
<re.Match object; span=(0, 14), match='_foo_n_matched'>
>>> r4.match('_foo_n_matched<0:4> ')
<re.Match object; span=(0, 14), match='_foo_n_matched'>
>>> r4.match('_<cat>/NN_u_unknown')
<re.Match object; span=(0, 19), match='_<cat>/NN_u_unknown'> Two notes about this:
The issues above are why I said that the MrsRFC page could use some revision. For instance, it might be good to say that
I agree. The reason the regex spells out the convention for unquoted surface predicates is that I rely on the structure to allow |
This is a deprecated form that persisted in the grammar matrix (and all customized grammars) until mid-2014: Also see the last two lines from the Type vs String paragraph of the PredicateRfc wiki:
The single-quoted variant was from Lisp's quoted symbols: https://www.gnu.org/software/emacs/manual/html_node/elisp/Quoting.html The only reason PyDelphin allows it is to be able to parse the output from old grammars and profiles. However, it seems like there is a bug in the current pattern as it only accepts single-character single-quoted symbols: >>> from delphin.codecs import simplemrs
>>> simplemrs.decode("[RELS: < [ 'a LBL: h0 ] > ]").rels
[<EP object (h0:a()) at 139694271815616>]
>>> simplemrs.decode("[RELS: < [ 'abc LBL: h0 ] > ]").rels
Traceback (most recent call last):
[...]
delphin.mrs._exceptions.MRSSyntaxError:
line 1, character 13
[RELS: < [ 'abc LBL: h0 ] > ]
^
MRSSyntaxError: expected: a feature I haven't heard any complaints about this, so maybe there isn't much demand for support for this legacy syntax. |
Hum, your regex is capturing the Lnk caracterization too, with or without following spaces: >>> import re
>>> r4 = re.compile(r"_[^\s_]+_[nvajrscpqxud](?:_(?:[^\s_<]|<(?![-0-9:#@ ]*>\s))+)?(?:_rel)?")
>>> r4.match('_<cat>/NN_u_unknown<1,2>')
<re.Match object; span=(0, 24), match='_<cat>/NN_u_unknown<1,2>'>
>>> r4.match('_<cat>/NN_u_unknown<1,2> ')
<re.Match object; span=(0, 24), match='_<cat>/NN_u_unknown<1,2>'> |
I also opened an issue at delph-in/erg#45 |
The negative Lookahead would make my life hard! Not sure how to implement it using the parser combinators. |
In the WSJ profiles I found 1. |
You have an invalid lnk characterization string: >>> import re
>>> r4 = re.compile(r"_[^\s_]+_[nvajrscpqxud](?:_(?:[^\s_<]|<(?![-0-9:#@ ]*>\s))+)?(?:_rel)?")
>>> r4.match('_<cat>/NN_u_unknown<0:4> ')
<re.Match object; span=(0, 19), match='_<cat>/NN_u_unknown'>
For ease of parsing, even better is if we could simply say:
Anything else is a lexing error. Control characters and quotes may not be in symbols. If you want control characters in things like predicates, they must be quoted. This would be completely oblivious to predicate conventions. Unfortunately this would mean that we could no longer parse some predicates that our tools currently output, so it would need to be a backwards incompatible change involving coordination with tool developers and a deprecation period. If you feel strongly about it, propose the changes to MrsRFC, invite discussion, and get consensus. |
Oh... sorry. My mistake. I didn't know about the token indices type of Link. Does Ace support it? I agree with your suggestion. The (coerce
(loop
for c = (read-char stream nil nil)
while (and c (not (whitespacep c))
(not (member c '(#\< #\[ #\") :test #'char=)))
collect c
finally (when (and c (not (whitespacep c)))
(unread-char c stream)))
'string) it collects chars that are:
Surely all problems with |
@arademaker that's not more restricted, it's more permissive. That means that |
Yet another error in the MRS grammar
but CARG appears first in the Ace output:
I don't know how to serialize the MRS from LKB. (the SimpleMRS format as you call), but the UI shows as the last argument! So it is not defined by the grammar, but by the tool. |
Yes you are write, they can. But the real problem is the (defun read-mrs-predicate (stream)
(loop for c = (peek-char nil stream nil nil)
then (peek-char nil stream nil nil)
while (and c (whitespacep c)) do (read-char stream nil nil))
(let* ((c (peek-char nil stream nil nil))
(string
(if (char= c #\")
;; if true read the whole string using the Lisp read function
(read stream nil nil)
;; else reads while NOT whitespace and NOT one of < or [ or "
(coerce
(loop for c = (read-char stream nil nil)
while (and c (not (whitespacep c))
(not (member c '(#\< #\[ #\") :test #'char=)))
collect c
finally (when (and c (not (whitespacep c)))
(unread-char c stream)))
'string))))
(cond
((zerop (length string))
(error
"unexpected end of file in read-mrs-predicate() at position ~a"
(file-position stream)))
(*normalize-predicates-p* (normalize-predicate string))
((char= c #\") string)
(t (vsym string))))) |
Hmm, I think you're right that this is overly-prescriptive, and I hadn't noticed that ACE puts it before the role-arguments. There should only be one CARG, but maybe the syntax is the wrong place to enforce that constraint. A minimal change would be something like:
Yes, it's not very ambiguous if you see |
Ace does not impose any particular order on the arguments of a predication. It is perfectly happy with
|
@arademaker did that MRS come out of ACE? Because it looks like ACE has the l += safe_snprintf(str+l, len-l, "%c[ %s<%d:%d> LBL: ", mrs_tab, ep->pred, ep->cfrom, ep->cto);
l += snprint_mrs_var_marked(str+l, len-l, ep->lbl, marked);
for(j=0;j<ep->nargs;j++)
{
l += safe_snprintf(str+l, len-l, " %s: ", ep->args[j].name);
if(ep->args[j].value)
l += snprint_mrs_var_marked(str+l, len-l, ep->args[j].value, marked);
else l += safe_snprintf(str+l, len-l, " (null)");
}
l += safe_snprintf(str+l, len-l, " ]"); (permalink: https://github.com/delph-in/ace/blob/19576aff0f7c74e6ff904405e2ca21f2c9afe8ff/mrs.c#L644-L653)
|
The MRS was edited manually by me, but parsed by Ace
actually, even if I add properties to handlers, see the non-sense MRS below (properties in h6)
Ace parsed, the error was fired during the generation:
You pointed to the pretty-printer, not to the parser, during parser, Ace is very flexible: |
Yes, I intentionally pointed to the simple-mrs printing code because I was showing that the position of My point of view is that PyDelphin should be able to read unconventional MRS serializations that come from tools like ACE or the LKB, but I don't feel a particular need to accommodate variations that come from hand-editing the MRSs. If you feel strongly about such flexibility and make a PR, I'll consider it since it probably doesn't cause any harm, but I'm not interested in making the change myself. Since this GitHub issue is for PyDelphin and not the DELPH-IN wiki, let me try to review the suggested changes to PyDelphin:
I think everything else is just questions about the DELPH-IN wiki, ACE, or the LKB, and such questions are probably better on the Discourse site as they'd reach a wider audience. |
Closing this. See the links above for further developments on those issues. |
Variable names are defined as:
https://github.com/delph-in/pydelphin/blob/develop/delphin/variable.py#L22
They are post-processed after being parsed as strings. But https://github.com/delph-in/docs/wiki/MrsRFC#simple says
That is, variables can't start with
-
, and the\w
class accepts unicode word characters without the ASCII flag in there.compile
, according to https://docs.python.org/3/library/re.html.The text was updated successfully, but these errors were encountered: