Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange behaviour of command-line detokenizer #43

Open
mlforcada opened this issue Apr 1, 2019 · 11 comments
Open

Strange behaviour of command-line detokenizer #43

mlforcada opened this issue Apr 1, 2019 · 11 comments
Labels
bug Something isn't working

Comments

@mlforcada
Copy link

No description provided.

@mlforcada
Copy link
Author

I can't get the command-line detokenizer to work properly. I have tried this:

$ echo "L'amitié nous a fait forts d'esprit" | sacremoses tokenize -l fr | sacremoses detokenize -l fr

and I get

L & a p o s ; a m i t i é n o u s a f a i t f o r t s d & a p o s ; e s p r i t

What am I doing wrong?
Cheers!

@alvations
Copy link
Contributor

alvations commented Apr 1, 2019

Sorry about it, I think it was cause by a mistake in a previous version which was patched in #36

Could you try the latest version pip install -U sacremoses? Should be version 0.0.13 now.

It should work now:

alvas@ubi:~$ echo "L'amitié nous a fait forts d'esprit" | sacremoses tokenize -l fr | sacremoses detokenize -l fr

[out]:

L' amitié nous a fait forts d' esprit

Seems like this https://github.com/alvations/sacremoses/blob/master/sacremoses/tokenize.py#L481 isn't used when iterating... I'll check that first thing tomorrow morning =)

@alvations
Copy link
Contributor

Hmmm, seems like the apostrophe for french isn't working as expected though:

From original moses:

$ echo "L'amitié nous a fait forts d'esprit" | ~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l fr | ~/mosesdecoder/scripts/tokenizer/detokenizer.perl -l fr
Detokenizer Version $Revision: 4134 $
Language: fr
Tokenizer Version 1.1
Language: fr
Number of threads: 1
L'amitié nous a fait forts d'esprit

@mlforcada
Copy link
Author

Wow, that was fast. Yes, apostrophes don't look good when detokenized (they are separated with spaces).

@mlforcada
Copy link
Author

I'll be grateful if you let me know of any progress.

@mlforcada
Copy link
Author

(1) Have you had a chance to solve the problem with spaces when detokenizing, @alvations ?

(2) Also, apparently, there is a way to specify the language when creating the tokenizer.

from sacremoses import MosesTokenizer
mt=MosesTokenizer(lang="fr")

it would be nice to document this in the README.md. By the way, when the language is not "en", "it" or "fr", but you specify it, the apostrophes are doubly escaped with backslashes for a reason that escapes me:

mt=MosesTokenizer(lang="es")
print(mt.tokenize("Un texto con 'comillas' para probar"))

produces

['Un', 'texto', 'con', '\\'', 'comillas', '\\'', 'para', 'probar']

where it would be more appropriate to have (as with lang="en")

['Un', 'texto', 'con', ''', 'comillas', ''', 'para', 'probar']

By the way, I could consider offering my help with Catalan ("ca") tokenization. The current French and Italian model partly works, but Catalan has post-verbal pronouns such as

Informa-te'n

which should be tokenized as

Informa -te 'n

But first I'd have to get better acquainted with your code.

Cheers,
Mikel

@alvations
Copy link
Contributor

alvations commented Apr 12, 2019

@mlforcada Sorry for the delay!

Now the latest version should have the french apostrophes patched.

from sacremoses.tokenize import MosesTokenizer, MosesDetokenizer
mt = MosesTokenizer(lang='fr')
md = MosesDetokenizer(lang='fr')
md.detokenize(mt.tokenize("L'amitié nous a fait forts d'esprit")) == "L'amitié nous a fait forts d'esprit"

I was catching the end of string symbol in the token after the apostrophes' clitics so that was wrong re.search(u'^[{}]$'.format(self.IsAlpha), tokens[i + 1])), from the original Moses detokenizer, they didn't have it ($words[$i+1] =~ /^[\p{IsAlpha}]/))

@alvations
Copy link
Contributor

alvations commented Apr 12, 2019

Regarding the Spanish escaping of the ampersand, I'm not able to reproduce it, shouldn't be a problem with version >=0.0.13. The latest french patch would be >=0.0.14

Which version of sacremoses are you using?

>>> import sacremoses
>>> sacremoses.__version__
0.0.19

>>> from sacremoses.tokenize import MosesTokenizer, MosesDetokenizer
>>> mt=MosesTokenizer(lang="es")
>>> print(mt.tokenize("Un texto con 'comillas' para probar"))

['Un', 'texto', 'con', ''', 'comillas', ''', 'para', 'probar']

Also having Catalan specific rules would be awesome! I've vested interest for Catalan text processing =)

Do you have a list of rules and words that should prevent weird splitting for Catalan?

@alvations alvations added the bug Something isn't working label Apr 12, 2019
@mlforcada
Copy link
Author

Thanks a million, @alvations !
I updated. Sacremoses says now it is 0.0.19.
Detokenization for French works as a breeze now! Cheers!

Catalan rules for apostrophes and hyphens with pronouns, articles and prepositions:

Work as in French and italian:

[dlmnts]'WORD → [dlmnts]' WORD

Single pronoun after verb, apostrophe.

VERB'[lmnst] → VERB '[lmnst]
VERB'ns → VERB 'ns
VERB'ls → VERB 'ls

Single pronoun, after verb, with hyphen:
VERB-(me|te|lo|la|li|nos|vos|us|los|les|se|ne|hi) → VERB -(me|te|lo|la|li|nos|vos|us|los|les|se|ne|hi)

Two pronouns, after verb, two hyphens (bit overgenerated, but should work)

VERB-(me|te|se|lo|la|li|nos|us|vos|los|les)-(em|el|la|li|en|ens|us|els|les|hi|ho) →
VERB -(me|te|se|lo|la|li|nos|us|vos|los|les) -(em|el|la|li|en|ens|us|els|les|hi|ho)

Two pronouns, apostrophe and hyphenated

VERB'(ns|ls)-(el|la|els|les|li|ho|hi|en) → VERB '(ns|ls) -(el|la|els|les|li|ho|hi|en)

Two pronouns, hyphenated and apostrophe

VERB'(me|te|se|li|)-(m|t|s|l|ns|ls) → VERB '(me|te|se|li) -(m|t|s|l|ns|ls)

In this last case, probably the second part could be processed with the single apostrophe rule above.

Thanks again,

@mlforcada 

@mlforcada
Copy link
Author

Aggh, the last one is wrong.
It should be

VERB-(me|te|se|li|)'(m|t|s|l|ns|ls) → VERB -(me|te|se|li) '(m|t|s|l|ns|ls)

Sorry about that!

@alvations
Copy link
Contributor

Thanks @mlforcada! Let me see how I could convert the rules above =)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants