Skip to content

note on spacesafter #58

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jwijffels opened this issue Oct 22, 2019 · 5 comments
Closed

note on spacesafter #58

jwijffels opened this issue Oct 22, 2019 · 5 comments

Comments

@jwijffels
Copy link

from here https://ufal.mff.cuni.cz/udpipe/users-manual

Basically this means the misc field can have SpacesBefore=/SpacesAfter=/SpacesInToken=
with the following possible values

  • \s: space
  • \t: tab
  • \r: CR character
  • \n: LF character
  • \p: | (pipe character)
  • \: \ (backslash character)

You can see that in e.g.

> library(udpipe)
> x <- udpipe(" .It remains all spaces. You see\n\n\n. We started a new paragraph.", "english")
> x[, c("doc_id", "paragraph_id", "sentence_id", "term_id", "token", "misc")]
   doc_id paragraph_id sentence_id term_id     token                           misc
1    doc1            1           1       1         . SpacesBefore=\\s|SpaceAfter=No
2    doc1            1           2       2        It                           <NA>
3    doc1            1           2       3   remains                           <NA>
4    doc1            1           2       4       all                           <NA>
5    doc1            1           2       5    spaces                  SpaceAfter=No
6    doc1            1           2       6         .                           <NA>
7    doc1            1           3       7       You                           <NA>
8    doc1            1           3       8       see          SpacesAfter=\\n\\n\\n
9    doc1            2           4       9         .                           <NA>
10   doc1            2           5      10        We                           <NA>
11   doc1            2           5      11   started                           <NA>
12   doc1            2           5      12         a                           <NA>
13   doc1            2           5      13       new                           <NA>
14   doc1            2           5      14 paragraph                  SpaceAfter=No
15   doc1            2           5      15         .                SpacesAfter=\\n

Except the last (bnosac/udpipe#27), this is because of a bug in the R package I maintain at bnosac/udpipe#27 which I still need to fix

By default, UDPipe uses custom MISC fields to store all spaces in the original document. This markup is backward compatible with CoNLL-U v2 SpaceAfter=No feature. This markup can be utilized by the plaintext output format, which allows reconstructing the original document.

Note that in theory not only spaces, but also other original content can be saved in this way (for example XML tags if the input was encoded in a XML file).

The markup uses the following MISC fields on tokens (not words in multi-word tokens):

SpacesBefore=content (by default empty): spaces/other content preceding the token
SpacesAfter=content (by default a space if SpaceAfter=No feature is not present, empty otherwise): spaces/other content following the token
SpacesInToken=content (by default equal to the FORM of the token): FORM of the token including original spaces (this is needed only if tokens are allowed to contain spaces and a token contains a tab or newline characters)
The content of all the three fields must be escaped to allow storing tabs and newlines. The following C-like schema is used:

\s: space
\t: tab
\r: CR character
\n: LF character
\p: | (pipe character)
\\: \ (backslash character)
@jwijffels
Copy link
Author

This was just to inform you.

@jwijffels
Copy link
Author

FYI. This function reconstructs the text from a udpipe tokenised dataset https://github.com/bnosac/udpipe/blob/master/R/udpipe_reconstruct.R

@taylor-arnold
Copy link
Owner

Oh, that's actually even better (reconstructing the text what I really always want anyway)! Thanks again!

@taylor-arnold
Copy link
Owner

I needed to push out the updated 3.0.0 ahead of a workshop next week, but will be working on more minor revisions for 3.0.1; I'll probably just return something similar to text_with_ws that space yields.

@jwijffels
Copy link
Author

Good luck with the workshop.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants