Strange behaviour of command-line detokenizer #43

mlforcada · 2019-04-01T13:55:17Z

No description provided.

mlforcada · 2019-04-01T13:57:08Z

I can't get the command-line detokenizer to work properly. I have tried this:

$ echo "L'amitié nous a fait forts d'esprit" | sacremoses tokenize -l fr | sacremoses detokenize -l fr

and I get

L & a p o s ; a m i t i é n o u s a f a i t f o r t s d & a p o s ; e s p r i t

What am I doing wrong?
Cheers!

alvations · 2019-04-01T14:35:23Z

Sorry about it, I think it was cause by a mistake in a previous version which was patched in #36

Could you try the latest version pip install -U sacremoses? Should be version 0.0.13 now.

It should work now:

alvas@ubi:~$ echo "L'amitié nous a fait forts d'esprit" | sacremoses tokenize -l fr | sacremoses detokenize -l fr

[out]:

L' amitié nous a fait forts d' esprit

Seems like this https://github.com/alvations/sacremoses/blob/master/sacremoses/tokenize.py#L481 isn't used when iterating... I'll check that first thing tomorrow morning =)

alvations · 2019-04-01T14:36:52Z

Hmmm, seems like the apostrophe for french isn't working as expected though:

From original moses:

$ echo "L'amitié nous a fait forts d'esprit" | ~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l fr | ~/mosesdecoder/scripts/tokenizer/detokenizer.perl -l fr
Detokenizer Version $Revision: 4134 $
Language: fr
Tokenizer Version 1.1
Language: fr
Number of threads: 1
L'amitié nous a fait forts d'esprit

mlforcada · 2019-04-01T14:48:20Z

Wow, that was fast. Yes, apostrophes don't look good when detokenized (they are separated with spaces).

mlforcada · 2019-04-01T20:51:15Z

I'll be grateful if you let me know of any progress.

mlforcada · 2019-04-10T16:16:10Z

(1) Have you had a chance to solve the problem with spaces when detokenizing, @alvations ?

(2) Also, apparently, there is a way to specify the language when creating the tokenizer.

from sacremoses import MosesTokenizer
mt=MosesTokenizer(lang="fr")

it would be nice to document this in the README.md. By the way, when the language is not "en", "it" or "fr", but you specify it, the apostrophes are doubly escaped with backslashes for a reason that escapes me:

mt=MosesTokenizer(lang="es")
print(mt.tokenize("Un texto con 'comillas' para probar"))

produces

['Un', 'texto', 'con', '\\&apos;', 'comillas', '\\&apos;', 'para', 'probar']

where it would be more appropriate to have (as with lang="en")

['Un', 'texto', 'con', '&apos;', 'comillas', '&apos;', 'para', 'probar']

By the way, I could consider offering my help with Catalan ("ca") tokenization. The current French and Italian model partly works, but Catalan has post-verbal pronouns such as

Informa-te'n

which should be tokenized as

Informa -te &apos;n

But first I'd have to get better acquainted with your code.

Cheers,
Mikel

alvations · 2019-04-12T01:18:24Z

@mlforcada Sorry for the delay!

Now the latest version should have the french apostrophes patched.

from sacremoses.tokenize import MosesTokenizer, MosesDetokenizer
mt = MosesTokenizer(lang='fr')
md = MosesDetokenizer(lang='fr')
md.detokenize(mt.tokenize("L'amitié nous a fait forts d'esprit")) == "L'amitié nous a fait forts d'esprit"

I was catching the end of string symbol in the token after the apostrophes' clitics so that was wrong re.search(u'^[{}]$'.format(self.IsAlpha), tokens[i + 1])), from the original Moses detokenizer, they didn't have it ($words[$i+1] =~ /^[\p{IsAlpha}]/))

alvations · 2019-04-12T01:21:51Z

Regarding the Spanish escaping of the ampersand, I'm not able to reproduce it, shouldn't be a problem with version >=0.0.13. The latest french patch would be >=0.0.14

Which version of sacremoses are you using?

>>> import sacremoses
>>> sacremoses.__version__
0.0.19

>>> from sacremoses.tokenize import MosesTokenizer, MosesDetokenizer
>>> mt=MosesTokenizer(lang="es")
>>> print(mt.tokenize("Un texto con 'comillas' para probar"))

['Un', 'texto', 'con', '&apos;', 'comillas', '&apos;', 'para', 'probar']

Also having Catalan specific rules would be awesome! I've vested interest for Catalan text processing =)

Do you have a list of rules and words that should prevent weird splitting for Catalan?

mlforcada · 2019-04-12T09:01:47Z

Thanks a million, @alvations !
I updated. Sacremoses says now it is 0.0.19.
Detokenization for French works as a breeze now! Cheers!

Catalan rules for apostrophes and hyphens with pronouns, articles and prepositions:

Work as in French and italian:

[dlmnts]'WORD → [dlmnts]&apos; WORD

Single pronoun after verb, apostrophe.

VERB'[lmnst] → VERB &apos;[lmnst]
VERB'ns → VERB &apos;ns
VERB'ls → VERB &apos;ls

Single pronoun, after verb, with hyphen:
VERB-(me|te|lo|la|li|nos|vos|us|los|les|se|ne|hi) → VERB -(me|te|lo|la|li|nos|vos|us|los|les|se|ne|hi)

Two pronouns, after verb, two hyphens (bit overgenerated, but should work)

VERB-(me|te|se|lo|la|li|nos|us|vos|los|les)-(em|el|la|li|en|ens|us|els|les|hi|ho) →
VERB -(me|te|se|lo|la|li|nos|us|vos|los|les) -(em|el|la|li|en|ens|us|els|les|hi|ho)

Two pronouns, apostrophe and hyphenated

VERB'(ns|ls)-(el|la|els|les|li|ho|hi|en) → VERB '(ns|ls) -(el|la|els|les|li|ho|hi|en)

Two pronouns, hyphenated and apostrophe

VERB'(me|te|se|li|)-(m|t|s|l|ns|ls) → VERB '(me|te|se|li) -(m|t|s|l|ns|ls)

In this last case, probably the second part could be processed with the single apostrophe rule above.

Thanks again,

@mlforcada

mlforcada · 2019-04-12T09:02:59Z

Aggh, the last one is wrong.
It should be

VERB-(me|te|se|li|)'(m|t|s|l|ns|ls) → VERB -(me|te|se|li) &apos;(m|t|s|l|ns|ls)

Sorry about that!

alvations · 2019-04-13T03:43:32Z

Thanks @mlforcada! Let me see how I could convert the rules above =)

alvations added the bug Something isn't working label Apr 12, 2019

alvations mentioned this issue Apr 12, 2019

Patching detokenize for french clitics #45

Merged

ZJaume mentioned this issue May 4, 2020

Consider sacremoses bitextor/bicleaner#35

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strange behaviour of command-line detokenizer #43

Strange behaviour of command-line detokenizer #43

mlforcada commented Apr 1, 2019

mlforcada commented Apr 1, 2019

alvations commented Apr 1, 2019 •

edited

alvations commented Apr 1, 2019

mlforcada commented Apr 1, 2019

mlforcada commented Apr 1, 2019

mlforcada commented Apr 10, 2019

alvations commented Apr 12, 2019 •

edited

alvations commented Apr 12, 2019 •

edited

mlforcada commented Apr 12, 2019

mlforcada commented Apr 12, 2019

alvations commented Apr 13, 2019

Strange behaviour of command-line detokenizer #43

Strange behaviour of command-line detokenizer #43

Comments

mlforcada commented Apr 1, 2019

mlforcada commented Apr 1, 2019

alvations commented Apr 1, 2019 • edited

alvations commented Apr 1, 2019

mlforcada commented Apr 1, 2019

mlforcada commented Apr 1, 2019

mlforcada commented Apr 10, 2019

alvations commented Apr 12, 2019 • edited

alvations commented Apr 12, 2019 • edited

mlforcada commented Apr 12, 2019

mlforcada commented Apr 12, 2019

alvations commented Apr 13, 2019

alvations commented Apr 1, 2019 •

edited

alvations commented Apr 12, 2019 •

edited

alvations commented Apr 12, 2019 •

edited