note on spacesafter #58

jwijffels · 2019-10-22T12:55:59Z

from here https://ufal.mff.cuni.cz/udpipe/users-manual

Basically this means the misc field can have SpacesBefore=/SpacesAfter=/SpacesInToken=
with the following possible values

\s: space
\t: tab
\r: CR character
\n: LF character
\p: | (pipe character)
\: \ (backslash character)

You can see that in e.g.

> library(udpipe)
> x <- udpipe(" .It remains all spaces. You see\n\n\n. We started a new paragraph.", "english")
> x[, c("doc_id", "paragraph_id", "sentence_id", "term_id", "token", "misc")]
   doc_id paragraph_id sentence_id term_id     token                           misc
1    doc1            1           1       1         . SpacesBefore=\\s|SpaceAfter=No
2    doc1            1           2       2        It                           <NA>
3    doc1            1           2       3   remains                           <NA>
4    doc1            1           2       4       all                           <NA>
5    doc1            1           2       5    spaces                  SpaceAfter=No
6    doc1            1           2       6         .                           <NA>
7    doc1            1           3       7       You                           <NA>
8    doc1            1           3       8       see          SpacesAfter=\\n\\n\\n
9    doc1            2           4       9         .                           <NA>
10   doc1            2           5      10        We                           <NA>
11   doc1            2           5      11   started                           <NA>
12   doc1            2           5      12         a                           <NA>
13   doc1            2           5      13       new                           <NA>
14   doc1            2           5      14 paragraph                  SpaceAfter=No
15   doc1            2           5      15         .                SpacesAfter=\\n

Except the last (bnosac/udpipe#27), this is because of a bug in the R package I maintain at bnosac/udpipe#27 which I still need to fix

By default, UDPipe uses custom MISC fields to store all spaces in the original document. This markup is backward compatible with CoNLL-U v2 SpaceAfter=No feature. This markup can be utilized by the plaintext output format, which allows reconstructing the original document.

Note that in theory not only spaces, but also other original content can be saved in this way (for example XML tags if the input was encoded in a XML file).

The markup uses the following MISC fields on tokens (not words in multi-word tokens):

SpacesBefore=content (by default empty): spaces/other content preceding the token
SpacesAfter=content (by default a space if SpaceAfter=No feature is not present, empty otherwise): spaces/other content following the token
SpacesInToken=content (by default equal to the FORM of the token): FORM of the token including original spaces (this is needed only if tokens are allowed to contain spaces and a token contains a tab or newline characters)
The content of all the three fields must be escaped to allow storing tabs and newlines. The following C-like schema is used:

\s: space
\t: tab
\r: CR character
\n: LF character
\p: | (pipe character)
\\: \ (backslash character)

The text was updated successfully, but these errors were encountered:

jwijffels · 2019-10-22T12:57:12Z

This was just to inform you.

jwijffels · 2019-10-22T21:25:33Z

FYI. This function reconstructs the text from a udpipe tokenised dataset https://github.com/bnosac/udpipe/blob/master/R/udpipe_reconstruct.R

taylor-arnold · 2019-10-23T14:02:02Z

Oh, that's actually even better (reconstructing the text what I really always want anyway)! Thanks again!

taylor-arnold · 2019-10-23T14:03:01Z

I needed to push out the updated 3.0.0 ahead of a workshop next week, but will be working on more minor revisions for 3.0.1; I'll probably just return something similar to text_with_ws that space yields.

jwijffels · 2019-10-23T14:09:37Z

Good luck with the workshop.

jwijffels closed this as completed Oct 22, 2019

taylor-arnold mentioned this issue Oct 31, 2019

cnlp_annotate crashes when the string is empty #61

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

note on spacesafter #58

note on spacesafter #58

jwijffels commented Oct 22, 2019

jwijffels commented Oct 22, 2019

jwijffels commented Oct 22, 2019

taylor-arnold commented Oct 23, 2019

taylor-arnold commented Oct 23, 2019

jwijffels commented Oct 23, 2019

note on spacesafter #58

note on spacesafter #58

Comments

jwijffels commented Oct 22, 2019

jwijffels commented Oct 22, 2019

jwijffels commented Oct 22, 2019

taylor-arnold commented Oct 23, 2019

taylor-arnold commented Oct 23, 2019

jwijffels commented Oct 23, 2019