Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question]: Reproducing the score of official microsoft/llmlingua-2-xlm-roberta-large-meetingbank #156

Open
xvyaward opened this issue May 23, 2024 · 6 comments
Assignees
Labels
question Further information is requested

Comments

@xvyaward
Copy link

xvyaward commented May 23, 2024

Describe the issue

Following the issue 155, I'm trying to reproduce the results of the official llmlingua-2-xlm-roberta-large-meetingbank model using Mistral-7B as black-box llm.

In specific, I tried to fine-tune the XLM-RoBERTa model with the officially provided dataset, using this train.sh.

Here is my detailed process:

  1. Format, label, and filter the official dataset with reference to the collect_data.sh.
  2. Fine-tune the XLM-RoBERTa model using train.sh, with hyperparameters from the LLMLingua-2 paper.

Here are the current issues:

  1. It's hard to reproduce the Table-4 results of LLMLingua-2 paper, or even scores in issue 155. Here are my reproduced results:
MeetingBank MeetingBank LongBench
QA summary 2000 token avg. 2000 token narrativeqa multifieldqa_en multifieldqa_zh qasper
LLMLingua-2 scores reproduced with official model weights 73.59 29.95 25.65 10.07 36.61 26.47 29.46
LLMLingua-2 reproduced with fine-tuning 68.95 30.05 24.67 9.14 33.91 26.49 29.12
  1. I found that the official llmlingua-2-xlm-roberta-large-meetingbank model weight has the word_embedding size of [250102, 1024]. This is larger than the original [250002, 1024] size of XLM-RoBERTa.
    I guess this is relevant to the added special tokens in prompt_compressor.py, but train_roberta.py example does nothing about this, so my fine-tuned model has the same word_embedding weight size with the original RoBERTa ([250002, 1024]).
  • I tried to resize token embedding size first then fine-tune, but the results were almost the same.
  1. I guess the example in train.sh doesn't use filtered results, which is named annotation_kept_cs512_meetingbank_train_formated.pt in collect_data.sh. This seems like a minor issue :)

If the process of training the official model is the same as the process provided as an example here, can you please let me know what needs to be changed in the above process?
Thank you for reading.

@xvyaward xvyaward added the question Further information is requested label May 23, 2024
@iofu728
Copy link
Contributor

iofu728 commented May 24, 2024

Hi @xvyaward, thanks for your support of LLMLingua-2 and for sharing the detailed experimental results.

Hi @pzs19, could you provide more details to help @xvyaward reproduce the experiments? Thanks!

@pzs19
Copy link
Contributor

pzs19 commented May 24, 2024

Hi @xvyaward, thanks for your interest and the very detailed description.

  1. Could you please share more information on how you use the mistral model for inference? As the sampling and evaluation strategies can have an impact on the overall results, such as the temperature used in sampling and whether to truncate the answer when "\n" appears.

  2. The reason why word_embedding is largers than the original one is that we add special tokens to assign to the words that need to be forcibly retained. For example, "llmlingua" may be tokenized to "llm" and "lingua". If we want to always keep "llmlingua", we need to replace it to a new token before using tokenizer. We did not add these additional tokens during training.

  3. Thank you for pointing this out, we will fix it soon.

Hope these explanations can help you.

@dingjingzhen
Copy link

Hi @xvyaward, thanks for your interest and the very detailed description.

  1. Could you please share more information on how you use the mistral model for inference? As the sampling and evaluation strategies can have an impact on the overall results, such as the temperature used in sampling and whether to truncate the answer when "\n" appears.
  2. The reason why word_embedding is largers than the original one is that we add special tokens to assign to the words that need to be forcibly retained. For example, "llmlingua" may be tokenized to "llm" and "lingua". If we want to always keep "llmlingua", we need to replace it to a new token before using tokenizer. We did not add these additional tokens during training.
  3. Thank you for pointing this out, we will fix it soon.

Hope these explanations can help you.

Is there a standard full training script available? We also expect to train a compressor ourselves, including the word_embedding mentioned earlier.

@pzs19
Copy link
Contributor

pzs19 commented May 30, 2024

Hi @xvyaward, thanks for your interest and the very detailed description.

  1. Could you please share more information on how you use the mistral model for inference? As the sampling and evaluation strategies can have an impact on the overall results, such as the temperature used in sampling and whether to truncate the answer when "\n" appears.
  2. The reason why word_embedding is largers than the original one is that we add special tokens to assign to the words that need to be forcibly retained. For example, "llmlingua" may be tokenized to "llm" and "lingua". If we want to always keep "llmlingua", we need to replace it to a new token before using tokenizer. We did not add these additional tokens during training.
  3. Thank you for pointing this out, we will fix it soon.

Hope these explanations can help you.

Is there a standard full training script available? We also expect to train a compressor ourselves, including the word_embedding mentioned earlier.

Yes! We have provided the experiment code for LLMLingua-2 in ./experiments/llmlingua2. The training data for the compressor is also available at HuggingFace.

You can run ./experiments/llmlingua2/data_collection/collect_data.sh first, which will get word labels in the original data and filter out bad samples. Then use the train.sh script in ./experiments/llmlingua2/model_training to train the compressor. You may need to modify the training code to include special tokens during training.

@xvyaward
Copy link
Author

Hi @pzs19, thank you for your kind reply.

  1. I modified eval_meetingbank_qa.py to use vllm version of mistral model. Here is the mine code for generation part:
terminators = [
        model.get_tokenizer().eos_token_id
]

sampling_params = SamplingParams(
        max_tokens=args.n_max_token_ans,
        stop_token_ids=terminators,
        temperature=0.0, 
        top_p=1.0,
    )

response = model.generate(query, sampling_params=sampling_params)

I used temperature=0.0 and top_p=1.0 following the paper, and I believe answers are truncated with "\n" during evaluation, by experiments/llmlingua2/evaluation/metrics.py.

However, I still can't reproduce the score from the official llmlingua-2-xlm-roberta-large-meetingbank. The score for in-domain meetingbank_qa has especially dropped significantly, from 73.6 to 68.

  1. In the first answer, you mentioned "We did not add these additional tokens during training."
    However, you also suggested that "You may need to modify the training code to include special tokens during training." to the answer for dingjingzhen.

So which answer is correct to reproduce the score of llmlingua-2-xlm-roberta-large-meetingbank using the official MeetingBank-LLMCompressed dataset? And if possible, can you share example code that handles special tokens during training?

Thank you.

@xvyaward xvyaward reopened this May 30, 2024
@pzs19
Copy link
Contributor

pzs19 commented May 30, 2024

Hi @pzs19, sorry for the misunderstanding.

In the last response, I mean if you want to add special tokens during training, you need to modify our training code. In our experiment, special tokens are not added during training.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

5 participants