Interface for Truncation #6

UntotaufUrlaub · 2023-06-10T18:45:11Z

Hi,

I think an interface to configure whether the input is truncated or not, would be pretty handy.

Of course one could handle the truncation before passing the texts into the metric, but that seems like unnecessary overhead for a use case which could appear quite often.
I have seen one fork started fiddling in this direction, but I assume this approach is probably not working. They configured the tokenizer to perform truncation, but if I see things right, the input text is not only processed using the tokenizer. For example DS_Sent in line 153: the creation of the entity graph.

I am wondering what you think would be the best approach to handle this issue?
I assume it would be best to handle the truncation of the text as early as possible, so not every code bit which uses it has to be altered and blown up. So I see 2 options:

Pass a truncation argument to DiscoScorer() which is then used in all the methods (LexicalChain, DS_SENT_NN ...) to decide truncation.
This would make all calls of one scorer be aligned and the individual calls remain with their easy interface.
But i am not sure if this is flexible enough and whether truncation makes sense for the non DS_... methods.
Add a truncation argument to the 4 DS_... methods.

Both methods would perform truncation in the scorer file and pass the preprocessed texts into the code DS logic. The disadvantage is that one has to tokenize, truncate and repack the tokens into text, which seems a bit of computational overhead.

andyweizhao · 2023-06-10T21:50:46Z

Hi,

When dealing with long input texts, I think the trivial solution is to use Longformer or its successors. Ofc, there is still a limit in sequence length, so truncation is still needed at some point.

For truncation, I think the right solution is to handle truncation at very beginning before passing on input texts to the metric. This is because the metric tokenizes every input text twice, which in turn produces words (via UDPiple) and word pieces (via BERT-tokenizer) for different purposes. It is necessary to ensure that the same text undergoes this two-step tokenization process.

Hope this helps.

UntotaufUrlaub · 2023-06-12T07:01:29Z

Hi,

I created a fork for this topic. Would you please have a look? If you think the change is of any help I could open a pull request.
I added truncation and adjusted the code that one can use Longformer without manual code change.
Do you know if the suggested Longformer model performs comparable to Conpono and BERT-NLI?

andyweizhao · 2023-07-25T16:22:00Z

Hi, thank you for this! I think your change is correct. Did you run tests for verification? Regarding Longformer, the answer is negative -- it underperforms Conpono and BERT-NLI on SUMMEval.

UntotaufUrlaub · 2023-07-25T16:45:34Z

Hi,
Yes, I did run some test. The truncation part seems to work fine.
The longformer part has some issues:
One bug I could not solve yet (maybe running on CPU will makes debugging easier :) ).
And the config of the downloaded models needs to be adjusted to work with Auto-Classes. See the open issue in the fork.

If you are interested I could share the bug with you once I got time.

Thanks for the performance info.

andyweizhao · 2023-07-25T20:11:35Z

Hi, I am happy to look into the issue once you provide me with details.

UntotaufUrlaub · 2023-07-27T07:30:29Z

I create an example showing the error. Replace the ... in the doc definition with the content of this attached file. (It is to long to display here.)

from disco_score import DiscoScorer

doc = """..."""
summ = """the daily and sunday politics are on-air six days a week for much of the year reporting the political news from westminster and beyond."""

DiscoScorer(    # device='cuda:0',
                    device="cpu",
                    model_name='allenai/longformer-base-4096',
                    truncation=None,
                ).DS_SENT_NN(summ, [doc])

The final error is

File "/.../python3.11/site-packages/torch/nn/functional.py", line 2210, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)

IndexError: index out of range in self

(Should i add the whole stacktrace?)

I tried to google it, but do to a lack of experience I couldn't find the connection of the discussed issues and this problem.

I tried to analyse the error using this script:

from disco_score import DiscoScorer

doc = """..."""
summ = """the daily and sunday politics are on-air six days a week for much of the year reporting the political news from westminster and beyond."""

print("test 1")
try:
    DiscoScorer(    # device='cuda:0',
                    device="cpu",
                    model_name='allenai/longformer-base-4096',
                    truncation=None,
                ).DS_SENT_NN(summ, [doc[:len(doc) // 8]])
    print("Passed")
except:
    print("Error")

print("test 2")
try:
    DiscoScorer(    # device='cuda:0',
                    device="cpu",
                    model_name='allenai/longformer-base-4096',
                    truncation=None,
                ).DS_SENT_NN(summ, [doc[len(doc) // 8:len(doc) // 4]])
    print("Passed")
except:
    print("Error")

print("test 3")
try:
    DiscoScorer(    # device='cuda:0',
                    device="cpu",
                    model_name='allenai/longformer-base-4096',
                    truncation=None,
                ).DS_SENT_NN(summ, [doc[:len(doc) // 4]])
    print("Passed")
except:
    print("Error")

print("test 4")
try:
    DiscoScorer(    # device='cuda:0',
                    device="cpu",
                    model_name='allenai/longformer-base-4096',
                    truncation=3500,
                ).DS_SENT_NN(summ, [doc[:len(doc) // 4]])
    print("Passed")
except:
    print("Error")

print("test 5")
try:
    DiscoScorer(    # device='cuda:0',
                    device="cpu",
                    model_name='allenai/longformer-base-4096',
                    truncation=4000,
                ).DS_SENT_NN(summ, [doc[:len(doc) // 4]])
    print("Passed")
except:
    print("Error")

Result at my machine:

test 1
Passed
test 2
Passed
test 3
Error
test 4
Passed
test 5
Error

As test 3 produces an error while test 1 and 2 don't, it seems the problem is not about the content, but about the length. So maybe longformer triggers an error for too long texts.
Test 4 passed shows that the error can be circumvented by truncating the text. Somehow I need to truncate to a smaller size as I would expect. I thought truncation to 4000 should be fine for longformer 4096, but test 5 fails. So if the cause of the bug is the input length than the truncation code seems flawed, at least when combined with longformer. Maybe somewhere a mismatch between tokenizers.

What is your opinion on this strange behavior?

UntotaufUrlaub mentioned this issue Jul 27, 2023

Problem with longformer (and truncation) UntotaufUrlaub/DiscoScore_WithHandlingLongInputs#3

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Interface for Truncation #6

Interface for Truncation #6

UntotaufUrlaub commented Jun 10, 2023 •

edited

Loading

andyweizhao commented Jun 10, 2023

UntotaufUrlaub commented Jun 12, 2023

andyweizhao commented Jul 25, 2023

UntotaufUrlaub commented Jul 25, 2023

andyweizhao commented Jul 25, 2023

UntotaufUrlaub commented Jul 27, 2023

Interface for Truncation #6

Interface for Truncation #6

Comments

UntotaufUrlaub commented Jun 10, 2023 • edited Loading

andyweizhao commented Jun 10, 2023

UntotaufUrlaub commented Jun 12, 2023

andyweizhao commented Jul 25, 2023

UntotaufUrlaub commented Jul 25, 2023

andyweizhao commented Jul 25, 2023

UntotaufUrlaub commented Jul 27, 2023

UntotaufUrlaub commented Jun 10, 2023 •

edited

Loading