Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"p.m." is not tokenized as in the original script. #21

Open
pypae opened this issue Jan 22, 2019 · 2 comments
Open

"p.m." is not tokenized as in the original script. #21

pypae opened this issue Jan 22, 2019 · 2 comments

Comments

@pypae
Copy link
Contributor

pypae commented Jan 22, 2019

I could not yet figure out why, but in the original script, the dot in p.m. at the end of a sentence is not split up, while with this port it is.

The original script even explicitly leaves out p.m from its nonbreaking prefixes, so i'd expect the behavior seen in the port.

@alvations
Copy link
Contributor

The original script added that new hack that changed quite recently: moses-smt/mosesdecoder#204

This difference isn't accounted for in sacremoses. And I'm really not sure whether we should or not.

@ZJaume
Copy link
Collaborator

ZJaume commented Jun 4, 2020

Why sacremoses shouldn't include this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants