Skip to content

Commit ce5f857

Browse files
committed
iclr code update
1 parent 26c7b35 commit ce5f857

File tree

8 files changed

+174
-449
lines changed

8 files changed

+174
-449
lines changed

README.md

Lines changed: 9 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -67,43 +67,30 @@ cd LLM_scoring && bash scoring_api.sh
6767
---
6868

6969
### 🧩 Step 2. Score curation
70-
Th score curation codebase is from [Docta](https://github.com/Docta-ai/docta) in the `./score_curation` path. You can execute the score curation by running
70+
One can execute the score curation by running
7171
```
7272
cd score_curation && bash diagnose.sh
7373
```
74-
The corresponding curation report files could be found in the path `./score_curation/results`.
74+
The corresponding curation report files can be found in the path `score_curation_results/`.
7575

7676

7777
---
7878

7979
### 🧩 Step 3. Data selection
80-
Given the existing score curation reports, you can directly use the following jupyter notebooks to do data selection including all baselines: `data_generation.ipynb`. The generated subsets can be further used for LLM instruction tuning. Other selected datasets used for ablation study can be also generated from the following jupyter notebooks located in the `./score_curation` path: `data_gen_score_curation.ipynb` and `data_gen_data_scale.ipynb`. In particular, we use `data_gen_score_curation.ipynb` to generate subsets after curating machine-generated raw scores.
81-
80+
Given the existing score curation reports, one can directly generate the high-quality subset by
81+
```
82+
python subset_generation.py
83+
```
84+
The generated subsets can be further used for the following LLM instruction tuning.
8285

8386

8487
---
8588
### 🧩 Step 4. Finetune & Evaluation
86-
Given the selected subsets in the path `model_finetune/selected_data/`, you can use the code base from [TULU](https://github.com/allenai/open-instruct) to finetune base models (Mistral or LLaMA) and then do evaluation.
87-
In particular, you can submit the jobs via launcher under the path `model_finetune/`. For example, you can submit the job by running the code
88-
```
89-
cd model_finetune/ && launcher run job_pipeline_all.yaml
90-
```
91-
92-
93-
Futhermore, we can also execute the code locally, e.g.,
89+
Given the selected subsets in the path `selected_data/`, one can use the code base from [TULU](https://github.com/allenai/open-instruct) to finetune base models (Mistral or LLaMA) and then do evaluation. Here, for convenience, one can also finetune the model by
9490
```
95-
cd model_finetune/ && bash run_pipeline_all.sh
91+
cd model_finetune/ && bash run_pipeline.sh
9692
```
9793

98-
One can present the final result by running
99-
```
100-
python model_finetune/read_results.py
101-
```
102-
103-
------
104-
105-
## Final results
106-
The final results of LLM judging compared with human-annotated dataset LIMA can be found in `lima_compare_plot.ipynb`. Moreover, for the tabular results, you can check the `reading_results.ipynb` jupyter notebook.
10794

10895
------
10996

model_finetune/read_results.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -6,10 +6,10 @@
66

77
def main(
88
root_result_path = 'results',
9-
train_dataset='all_train',
9+
raw_dataset='tulu_300k',
1010
base_model = "meta-llama/Meta-Llama-3.1-8B",
1111
rating_model='mistralai/Mistral-7B-Instruct-v0.3',
12-
baseline_tag = 'filtered',
12+
baseline_tag = 'ds2_10k',
1313
):
1414

1515
all_results = {}
@@ -20,7 +20,7 @@ def main(
2020
for tag in baseline_tags:
2121
baseline_results = {}
2222
for eval_dataset in eval_dataset_lists:
23-
path = root_result_path + f'/{rating_model}/{train_dataset}/{eval_dataset}/{base_model}/{tag}/metrics.json'
23+
path = root_result_path + f'/{rating_model}/{raw_dataset}/{eval_dataset}/{base_model}/{tag}/metrics.json'
2424
try:
2525
with open(path, 'r') as f:
2626
json_file = json.load(f)

model_finetune/run_pipeline.sh

Lines changed: 20 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -3,22 +3,21 @@ export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
33
NUM_GPUS=8
44
SEED=42
55

6-
TRAIN_DATASET_LIST=('flan_v2' 'oasst1' 'wizardlm' 'dolly' 'stanford_alpaca' 'all_train') # full data list
7-
6+
RAW_DATASET_LIST=('tulu_300k') # data source
87
rating_model="meta-llama/Meta-Llama-3.1-8B-Instruct" #"gpt-4o-mini" 'mistralai/Mistral-7B-Instruct-v0.3'
98

109
declare -A base_models
1110
base_models["meta-llama/Meta-Llama-3.1-8B"]="128 1 2048" # TOTAL_BATCH_SIZE BATCH_SIZE_PER_GPU max_seq_length
12-
# data types represent the generated subsets by baselines
13-
data_types=('completion' 'perplexity' 'knn' 'less' 'full' 'random' 'label-filtered' 'diversity-filtered' 'filtered')
1411

12+
# data types represent the generated subsets by baselines
13+
data_types=('ds2_10k')
1514

1615

1716
#############################################################
1817
######## model finetuning on selected training data #########
1918
#############################################################
2019

21-
cluster_root_path="output"
20+
cluster_root_path="../model_output"
2221
mkdir -p $cluster_root_path
2322

2423
for base_model in "${!base_models[@]}"
@@ -29,7 +28,7 @@ do
2928
max_seq_length=${params[2]}
3029

3130

32-
for train_dataset_name in "${TRAIN_DATASET_LIST[@]}"
31+
for raw_dataset_name in "${RAW_DATASET_LIST[@]}"
3332
do
3433

3534
for data_type in "${data_types[@]}"
@@ -41,7 +40,7 @@ do
4140
fi
4241

4342
mkdir -p $cluster_root_path/models/
44-
train_data="selected_data/${rating_model}/${train_dataset_name}/${data_type}_dataset.json"
43+
train_data="../selected_data/${rating_model}/${raw_dataset_name}/${data_type}_dataset.json"
4544

4645
GRADIENT_ACC_STEPS=$(($TOTAL_BATCH_SIZE/$NUM_GPUS/$BATCH_SIZE_PER_GPU))
4746
echo "Training ${base_model} using $NUM_GPUS GPUs, $BATCH_SIZE_PER_GPU batch size per GPU, $GRADIENT_ACC_STEPS gradient accumulation steps"
@@ -72,20 +71,20 @@ do
7271
--warmup_ratio 0.03 \
7372
--weight_decay 0. \
7473
--num_train_epochs 5 \
75-
--output_dir $cluster_root_path/models/${rating_model}/${train_dataset_name}/${base_model}/lora_${data_type}/ \
74+
--output_dir $cluster_root_path/models/${rating_model}/${raw_dataset_name}/${base_model}/lora_${data_type}/ \
7675
--with_tracking \
7776
--report_to tensorboard \
7877
--logging_steps 1
7978

8079
python merge_lora.py \
8180
--base_model_name_or_path $base_model \
82-
--lora_model_name_or_path $cluster_root_path/models/${rating_model}/${train_dataset_name}/${base_model}/lora_${data_type}/ \
83-
--output_dir $cluster_root_path/models/${rating_model}/${train_dataset_name}/${base_model}/lora_merged_${data_type}/ \
81+
--lora_model_name_or_path $cluster_root_path/models/${rating_model}/${raw_dataset_name}/${base_model}/lora_${data_type}/ \
82+
--output_dir $cluster_root_path/models/${rating_model}/${raw_dataset_name}/${base_model}/lora_merged_${data_type}/ \
8483
--save_tokenizer
8584

8685
sleep 10s
8786

88-
rm -rf $cluster_root_path/models/${rating_model}/${train_dataset_name}/${base_model}/lora_${data_type}
87+
rm -rf $cluster_root_path/models/${rating_model}/${raw_dataset_name}/${base_model}/lora_${data_type}
8988

9089
done
9190
done
@@ -102,11 +101,11 @@ echo "starting evaluating finetuned models..."
102101

103102
for base_model in "${!base_models[@]}"; do
104103

105-
for train_dataset_name in "${TRAIN_DATASET_LIST[@]}"; do
104+
for raw_dataset_name in "${TRAIN_DATASET_LIST[@]}"; do
106105

107106
for data_type in "${data_types[@]}"; do
108107

109-
model_name_or_path=$cluster_root_path/models/${rating_model}/${train_dataset_name}/${base_model}/lora_merged_${data_type}
108+
model_name_or_path=$cluster_root_path/models/${rating_model}/${raw_dataset_name}/${base_model}/lora_merged_${data_type}
110109

111110
if [[ $data_type == "base" ]]; then
112111
echo "base model evaluation"
@@ -117,7 +116,7 @@ for base_model in "${!base_models[@]}"; do
117116

118117
#### MMLU: factual knowledge
119118
eval_dataset_name='mmlu'
120-
local_save_dir=${cluster_root_path}/results/${rating_model}/${train_dataset_name}/${eval_dataset_name}/${base_model}/$data_type
119+
local_save_dir=${cluster_root_path}/results/${rating_model}/${raw_dataset_name}/${eval_dataset_name}/${base_model}/$data_type
121120

122121
CUDA_VISIBLE_DEVICES=0 python -m eval.mmlu.run_eval \
123122
--ntrain 0 \
@@ -129,7 +128,7 @@ for base_model in "${!base_models[@]}"; do
129128

130129
##### GSM8k: reasoning
131130
eval_dataset_name='gsm'
132-
local_save_dir=${cluster_root_path}/results/${rating_model}/${train_dataset_name}/${eval_dataset_name}/${base_model}/$data_type
131+
local_save_dir=${cluster_root_path}/results/${rating_model}/${raw_dataset_name}/${eval_dataset_name}/${base_model}/$data_type
133132

134133
CUDA_VISIBLE_DEVICES=1 python -m eval.gsm.run_eval \
135134
--data_dir raw_data/eval/gsm/ \
@@ -142,7 +141,7 @@ for base_model in "${!base_models[@]}"; do
142141

143142
###### BBH: reasoning
144143
eval_dataset_name='bbh'
145-
local_save_dir=${cluster_root_path}/results/${rating_model}/${train_dataset_name}/${eval_dataset_name}/${base_model}/$data_type
144+
local_save_dir=${cluster_root_path}/results/${rating_model}/${raw_dataset_name}/${eval_dataset_name}/${base_model}/$data_type
146145

147146
CUDA_VISIBLE_DEVICES=2 python -m eval.bbh.run_eval \
148147
--data_dir raw_data/eval/bbh \
@@ -154,7 +153,7 @@ for base_model in "${!base_models[@]}"; do
154153

155154
##### truthfulness
156155
eval_dataset_name='truthfulqa'
157-
local_save_dir=${cluster_root_path}/results/${rating_model}/${train_dataset_name}/${eval_dataset_name}/${base_model}/$data_type
156+
local_save_dir=${cluster_root_path}/results/${rating_model}/${raw_dataset_name}/${eval_dataset_name}/${base_model}/$data_type
158157

159158
CUDA_VISIBLE_DEVICES=3 python -m eval.truthfulqa.run_eval \
160159
--data_dir raw_data/eval/truthfulqa \
@@ -171,7 +170,7 @@ for base_model in "${!base_models[@]}"; do
171170

172171
###### multilinguality
173172
eval_dataset_name='tydiqa'
174-
local_save_dir=${cluster_root_path}/results/${rating_model}/${train_dataset_name}/${eval_dataset_name}/${base_model}/$data_type
173+
local_save_dir=${cluster_root_path}/results/${rating_model}/${raw_dataset_name}/${eval_dataset_name}/${base_model}/$data_type
175174

176175
CUDA_VISIBLE_DEVICES=4 python -m eval.tydiqa.run_eval \
177176
--data_dir raw_data/eval/tydiqa/ \
@@ -194,15 +193,15 @@ done
194193
sleep 10s
195194

196195
for base_model in "${!base_models[@]}"; do
197-
for train_dataset_name in "${TRAIN_DATASET_LIST[@]}"; do
196+
for raw_dataset_name in "${RAW_DATASET_LIST[@]}"; do
198197

199198
for data_type in "${data_types[@]}"; do
200199
echo "*** Processing rating model:: ${rating_model} ***"
201200
echo "*** Processing Base model:: ${base_model} ***"
202-
echo "*** Processing training dataset:: ${train_dataset_name} ***"
201+
echo "*** Processing training dataset:: ${raw_dataset_name} ***"
203202
echo "*** Processing data type:: ${data_type} ***"
204203

205-
python3 read_results.py --root_result_path "${cluster_root_path}/results" --train_dataset $train_dataset_name --base_model $base_model --rating_model $rating_model --baseline_tag $data_type
204+
python3 read_results.py --root_result_path "${cluster_root_path}/results" --raw_dataset $raw_dataset_name --base_model $base_model --rating_model $rating_model --baseline_tag $data_type
206205

207206
done
208207

0 commit comments

Comments
 (0)