-
Notifications
You must be signed in to change notification settings - Fork 23
Rule-based handling of punctuation #14
Comments
I have a large dataset and want to parse it using UCCA. But, there is a kind of punctuation commonly used in this dataset which is recognized as a word after parsing, how can I deal with this problem? |
For plain text, TUPA uses spaCy for tokenization and punctuation identification. This is the relevant line of code: https://github.com/danielhers/ucca/blob/master/ucca/convert.py#L769
Where |
Thank you for your detailed answer! |
Since punctuation has a specific location it has to appear in (according to UCCA normalization rules, it has to be a child of the lowest common ancestor of its preceding and following terminal), there is no need to make the classifier decide where it should go.
Punctuation should stay in the list of terminals so that the BiLSTM sees it when going over the text, but it should not go in the buffer as nodes.
The text was updated successfully, but these errors were encountered: