-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Interface for Truncation #6
Comments
Hi, When dealing with long input texts, I think the trivial solution is to use Longformer or its successors. Ofc, there is still a limit in sequence length, so truncation is still needed at some point. For truncation, I think the right solution is to handle truncation at very beginning before passing on input texts to the metric. This is because the metric tokenizes every input text twice, which in turn produces words (via UDPiple) and word pieces (via BERT-tokenizer) for different purposes. It is necessary to ensure that the same text undergoes this two-step tokenization process. Hope this helps. |
Hi, I created a fork for this topic. Would you please have a look? If you think the change is of any help I could open a pull request. |
Hi, thank you for this! I think your change is correct. Did you run tests for verification? Regarding Longformer, the answer is negative -- it underperforms Conpono and BERT-NLI on SUMMEval. |
Hi, If you are interested I could share the bug with you once I got time. Thanks for the performance info. |
Hi, I am happy to look into the issue once you provide me with details. |
I create an example showing the error. Replace the ... in the doc definition with the content of this attached file. (It is to long to display here.)
The final error is
(Should i add the whole stacktrace?) I tried to google it, but do to a lack of experience I couldn't find the connection of the discussed issues and this problem. I tried to analyse the error using this script:
Result at my machine:
As test 3 produces an error while test 1 and 2 don't, it seems the problem is not about the content, but about the length. So maybe longformer triggers an error for too long texts. What is your opinion on this strange behavior? |
Hi,
I think an interface to configure whether the input is truncated or not, would be pretty handy.
Of course one could handle the truncation before passing the texts into the metric, but that seems like unnecessary overhead for a use case which could appear quite often.
I have seen one fork started fiddling in this direction, but I assume this approach is probably not working. They configured the tokenizer to perform truncation, but if I see things right, the input text is not only processed using the tokenizer. For example DS_Sent in line 153: the creation of the entity graph.
I am wondering what you think would be the best approach to handle this issue?
I assume it would be best to handle the truncation of the text as early as possible, so not every code bit which uses it has to be altered and blown up. So I see 2 options:
This would make all calls of one scorer be aligned and the individual calls remain with their easy interface.
But i am not sure if this is flexible enough and whether truncation makes sense for the non DS_... methods.
Both methods would perform truncation in the scorer file and pass the preprocessed texts into the code DS logic. The disadvantage is that one has to tokenize, truncate and repack the tokens into text, which seems a bit of computational overhead.
The text was updated successfully, but these errors were encountered: