[Question]: Reproducing the score of official microsoft/llmlingua-2-xlm-roberta-large-meetingbank #156

xvyaward · 2024-05-23T13:59:47Z

Describe the issue

Following the issue 155, I'm trying to reproduce the results of the official llmlingua-2-xlm-roberta-large-meetingbank model using Mistral-7B as black-box llm.

In specific, I tried to fine-tune the XLM-RoBERTa model with the officially provided dataset, using this train.sh.

Here is my detailed process:

Format, label, and filter the official dataset with reference to the collect_data.sh.
Fine-tune the XLM-RoBERTa model using train.sh, with hyperparameters from the LLMLingua-2 paper.

Here are the current issues:

It's hard to reproduce the Table-4 results of LLMLingua-2 paper, or even scores in issue 155. Here are my reproduced results:

	MeetingBank	MeetingBank	LongBench
	QA	summary	2000 token avg.	2000 token narrativeqa	multifieldqa_en	multifieldqa_zh	qasper
LLMLingua-2 scores reproduced with official model weights	73.59	29.95	25.65	10.07	36.61	26.47	29.46
LLMLingua-2 reproduced with fine-tuning	68.95	30.05	24.67	9.14	33.91	26.49	29.12

I found that the official llmlingua-2-xlm-roberta-large-meetingbank model weight has the word_embedding size of [250102, 1024]. This is larger than the original [250002, 1024] size of XLM-RoBERTa.
I guess this is relevant to the added special tokens in prompt_compressor.py, but train_roberta.py example does nothing about this, so my fine-tuned model has the same word_embedding weight size with the original RoBERTa ([250002, 1024]).

I tried to resize token embedding size first then fine-tune, but the results were almost the same.

I guess the example in train.sh doesn't use filtered results, which is named annotation_kept_cs512_meetingbank_train_formated.pt in collect_data.sh. This seems like a minor issue :)

If the process of training the official model is the same as the process provided as an example here, can you please let me know what needs to be changed in the above process?
Thank you for reading.

iofu728 · 2024-05-24T08:20:08Z

Hi @xvyaward, thanks for your support of LLMLingua-2 and for sharing the detailed experimental results.

Hi @pzs19, could you provide more details to help @xvyaward reproduce the experiments? Thanks!

pzs19 · 2024-05-24T15:35:45Z

Hi @xvyaward, thanks for your interest and the very detailed description.

Could you please share more information on how you use the mistral model for inference? As the sampling and evaluation strategies can have an impact on the overall results, such as the temperature used in sampling and whether to truncate the answer when "\n" appears.
The reason why word_embedding is largers than the original one is that we add special tokens to assign to the words that need to be forcibly retained. For example, "llmlingua" may be tokenized to "llm" and "lingua". If we want to always keep "llmlingua", we need to replace it to a new token before using tokenizer. We did not add these additional tokens during training.
Thank you for pointing this out, we will fix it soon.

Hope these explanations can help you.

dingjingzhen · 2024-05-29T08:08:23Z

Hi @xvyaward, thanks for your interest and the very detailed description.

Could you please share more information on how you use the mistral model for inference? As the sampling and evaluation strategies can have an impact on the overall results, such as the temperature used in sampling and whether to truncate the answer when "\n" appears.

The reason why word_embedding is largers than the original one is that we add special tokens to assign to the words that need to be forcibly retained. For example, "llmlingua" may be tokenized to "llm" and "lingua". If we want to always keep "llmlingua", we need to replace it to a new token before using tokenizer. We did not add these additional tokens during training.

Thank you for pointing this out, we will fix it soon.

Hope these explanations can help you.

Is there a standard full training script available? We also expect to train a compressor ourselves, including the word_embedding mentioned earlier.

pzs19 · 2024-05-30T15:40:19Z

Hi @xvyaward, thanks for your interest and the very detailed description.

Could you please share more information on how you use the mistral model for inference? As the sampling and evaluation strategies can have an impact on the overall results, such as the temperature used in sampling and whether to truncate the answer when "\n" appears.

The reason why word_embedding is largers than the original one is that we add special tokens to assign to the words that need to be forcibly retained. For example, "llmlingua" may be tokenized to "llm" and "lingua". If we want to always keep "llmlingua", we need to replace it to a new token before using tokenizer. We did not add these additional tokens during training.

Thank you for pointing this out, we will fix it soon.

Hope these explanations can help you.

Is there a standard full training script available? We also expect to train a compressor ourselves, including the word_embedding mentioned earlier.

Yes! We have provided the experiment code for LLMLingua-2 in ./experiments/llmlingua2. The training data for the compressor is also available at HuggingFace.

You can run ./experiments/llmlingua2/data_collection/collect_data.sh first, which will get word labels in the original data and filter out bad samples. Then use the train.sh script in ./experiments/llmlingua2/model_training to train the compressor. You may need to modify the training code to include special tokens during training.

xvyaward · 2024-05-30T16:24:12Z

Hi @pzs19, thank you for your kind reply.

I modified eval_meetingbank_qa.py to use vllm version of mistral model. Here is the mine code for generation part:

terminators = [
        model.get_tokenizer().eos_token_id
]

sampling_params = SamplingParams(
        max_tokens=args.n_max_token_ans,
        stop_token_ids=terminators,
        temperature=0.0, 
        top_p=1.0,
    )

response = model.generate(query, sampling_params=sampling_params)

I used temperature=0.0 and top_p=1.0 following the paper, and I believe answers are truncated with "\n" during evaluation, by experiments/llmlingua2/evaluation/metrics.py.

However, I still can't reproduce the score from the official llmlingua-2-xlm-roberta-large-meetingbank. The score for in-domain meetingbank_qa has especially dropped significantly, from 73.6 to 68.

In the first answer, you mentioned "We did not add these additional tokens during training."
However, you also suggested that "You may need to modify the training code to include special tokens during training." to the answer for dingjingzhen.

So which answer is correct to reproduce the score of llmlingua-2-xlm-roberta-large-meetingbank using the official MeetingBank-LLMCompressed dataset? And if possible, can you share example code that handles special tokens during training?

Thank you.

pzs19 · 2024-05-30T16:39:19Z

Hi @pzs19, sorry for the misunderstanding.

In the last response, I mean if you want to add special tokens during training, you need to modify our training code. In our experiment, special tokens are not added during training.

xvyaward added the question Further information is requested label May 23, 2024

iofu728 assigned QianhuiWu and pzs19 May 24, 2024

xvyaward closed this as completed May 30, 2024

xvyaward reopened this May 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question]: Reproducing the score of official microsoft/llmlingua-2-xlm-roberta-large-meetingbank #156

[Question]: Reproducing the score of official microsoft/llmlingua-2-xlm-roberta-large-meetingbank #156

xvyaward commented May 23, 2024 •

edited

Loading

iofu728 commented May 24, 2024

pzs19 commented May 24, 2024 •

edited

Loading

dingjingzhen commented May 29, 2024

pzs19 commented May 30, 2024 •

edited

Loading

xvyaward commented May 30, 2024

pzs19 commented May 30, 2024

[Question]: Reproducing the score of official microsoft/llmlingua-2-xlm-roberta-large-meetingbank #156

[Question]: Reproducing the score of official microsoft/llmlingua-2-xlm-roberta-large-meetingbank #156

Comments

xvyaward commented May 23, 2024 • edited Loading

Describe the issue

iofu728 commented May 24, 2024

pzs19 commented May 24, 2024 • edited Loading

dingjingzhen commented May 29, 2024

pzs19 commented May 30, 2024 • edited Loading

xvyaward commented May 30, 2024

pzs19 commented May 30, 2024

xvyaward commented May 23, 2024 •

edited

Loading

pzs19 commented May 24, 2024 •

edited

Loading

pzs19 commented May 30, 2024 •

edited

Loading