-
Notifications
You must be signed in to change notification settings - Fork 458
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hyphen at line break removed #180
Comments
Actually there is another not so clear word hyphenation example: Becomes: In the nxml file it is annotated as 'non-compartmental'. I believe both versions are valid. So I would probably not treat that as a bug. (But I thought it was worth mentioning anyhow) |
Thanks for the issue! Dehyphenization is tricky ;). I was aware of these issues, but it's very useful to have a dedicated issue for that and discuss. So far what is implemented now is very simplistic - it does not require dictionaries and resources, but it produces the errors you are mentioning. Basically, when there is an hyphen at the end of the line, we always dehyphenize. We could introduce additional rules to improve it, like looking if we have numbers in the tokens before concatenating it, and/or to have a language-specific list of prefix (like anti-, non-, post-). We might always have some rare errors/exceptions. Adding rules manually is endless and not really in the spirit of GROBID... Another approach could be to use a Machine Learning based text cleaner/normalizer, for instance like |
Intuitively I was going to claim that it isn't common that words are broken across lines. Scanning through the same PDF the evidence shows it is quite common actually. One approach could also be to look at other examples within the same document. I checked a few examples (around 7), the only one I couldn't find another time so far was 'NON-MEM'. If it was for me, then I would like to have the option to include an element for the hyphens at line boundaries. Then I could do some post processing. Would you be happy to use the PMC dataset as training data? Otherwise that seems to be training data that should be fairly easily extracted from any PDF with XML / text data (at least in English). Maybe even without PDF. |
On 05/04/17 13:16, Daniel Ecer wrote:
Intuitively I was going to claim that it isn't common that words are broken across lines. Scanning through the same PDF the evidence shows it is quite common actually.
That's one of the multiple differences between MSWord and LaTeX: by default, Word does not "hyphenize" words (and did not propose an option to do so before v2010), whereas LaTeX always had an option to do so (at least manually - to my knowledge, dating back to the 90s). Since most of the scientific papers are written with LaTeX, you'll indeed encounter many word hyphenation :)
…--
Guillaume MULLER, PhD
Presans c/o REMIX COWORKING - L'APPART
57 rue de Turbigo
75003 Paris
France
http://www.presans.com
http://feeds.feedburner.com/OYI/fr
|
As a complement, here is a blog post from one founder of Authorea about the ratio of scientific papers written with LaTeX (~18% of all articles, but, as expected, largely dominating in a couple of domains). |
We are also running into this issue. Example PDF: http://ecp.acponline.org/sepoct01/kent.pdf Is it possible to at least return the newlines, so we can modify the result ourself based on some rules? |
@borkdude thank you for reporting these errors - I think the best would be to have a more robust dehyphenization process (it works not that bad normally...). The problem with outputing the End Of Line in the final TEI result are:
@lfoppiano hello Luca, would you have some time to look at these dehyphenization errors? my bad excuse: you're the last one who has modified it :D :D ? |
FYI I'm checking on the pdf
What I would do is:
|
About the dephypenisation, the current method using LayoutToken is not working well. Dephypenisation using text is much better for the moment because more flexible with the spaces around, it's why this one was used. The method using LayoutToken should be reviewed/extended I think. Be careful that dehyphenize must be called only in certain fields where we are sure to have only text, performing it at clusteror level does not seem the right moment, because we still not know what is exactly the type of the current labelled segment. |
The current dehypenisation method using layout tokens is not complete. I would aim to merge the three methods and produce a single one using layout tokens and having the possibility to have more aggressive approach.
The idea was to use the clusteror to extract, and apply the dehypenisation after the text is recomposed, not at the same moment. |
OK I see, you were talking about the abstract for the clusteror. As I said the old-fashioned Header model is not using LayoutToken for decoding the CRF results, it follows a different logic where the EOL are (voluntarily) not preserved - they were actually used to represent two discontinuous segments for the same field, for instance for keyword or author fields... so the different dehyphenization method (which works fine in the It would be necessary to rewrite entirely the method HeaderParser.resultExtraction() (with clusteror for decoding CRF results) and pay attention to some other stuff in BiblioItem (there is a special hack to propagate LayoutToken for authors, in order to make bounding boxes for authors present in the TEI - we would need to find a way to generalize that, in order to keep the layout tokens for any fields for creating corresponding bounding boxes). For me it was a different task, issue #136, to have every aspects updated at the same time - which is why I mentioned that it is quite a lot of work (and also why it is still an open issue since one and half year ;) ). Then the textual fields extracted from the header would be aligned with all the other models, and ready to use the common dehyphenization method. |
OK so I will focus on the dehypenise() method using LayoutTokens and we could have a version getting text and tokenizing it under the hood. I will see whether to merge also with the aggressive version or not. |
…ken, when a \n is encountered #180
I've implemented something to fix the dehypenisation. I'm sure it will require a couple of iterations. |
I've ran the pubmed end 2 end evaluation.
@kermitt2 do you see any differences (hopefully a little improvement) with the previuos e2e measures? |
There are differences, in particular I see a loss in citation metadatas and improvement on abstract. However the only way to be sure is to run it on the same architecture with and without the fixes (in case you have a branch). It depends also if you use consolidation or not. |
I've checked based on the first comment and with 0.5.6 the hypens are safe :-) I also checked the comment from @borkdude and we improve the results on kent.pdf, @borkdude could you have a look, especially if you have other cases? |
…ken, when a \n is encountered kermitt2#180 Former-commit-id: 27ffd3b
In the first pubmed evaluation manuscript, a number of times 'α2-integrin' is at a line break, e.g.:
"was mediated through the inhibition of expression of α2-
integrin (1,2). Integrins are receptors that mediate attachment"
In the output it becomes:
"was mediated through the inhibition of expression of α 2integrin..."
(The space is another issue #179)
In some cases it may be desirable to remove the hyphen. Not in this case. Probably never when there is a number?
The text was updated successfully, but these errors were encountered: