Skip to content

Commit f87d6a0

Browse files
author
daven
committed
init
1 parent 5a43aa6 commit f87d6a0

18 files changed

+261014
-2
lines changed

README.md

Lines changed: 110 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,110 @@
1-
# K2
2-
Code and datasets for paper "Learning A Foundation Language Model for Geoscience Knowledge Understanding and Utilization"
1+
<div style="text-align:center">
2+
<img src="https://big-cheng.com/k2/k2.png" alt="k2-logo" width="200"/>
3+
<h2>🏔️ Large Language Model for Geoscience</h2>
4+
</div>
5+
6+
Code and data for paper ***"Learning A Foundation Language Model for Geoscience Knowledge Understanding and Utilization"***
7+
8+
## Introduction
9+
10+
We introduce **K2** (7B), an open-source language model trained by firstly further pretraining LLaMA on collected and cleaned geoscience literatures including geoscience open-access papers and wikipedia pages and secondly fine-tuning with knowledge-intensive instruction tuning data (GeoSignal). Preliminary evaluation using GeoBenchmark (consisting of NPEE and AP Test on Geology, Geography and Environmental science) as benchmark. Compared with several baseline models with similar number of parameters, K2 outperforms the baselines on both the objective tasks and the subjective tasks.
11+
In this repository, we will share the following code and data
12+
13+
- We release K2 weights in two parts (one can add our delta to the original LLaMA weights, and use `peft_model` with `transformers` to obtain the entire K2 model.)
14+
- Delta weights after further pretraining with geoscience text corous to comply with the LLaMA model license.
15+
- Adapter model weights trained by PEFT (LoRA).
16+
- We release the cora data of GeoSignal under the contraint of DDE, if you want the full version of GeoSignal, you can [email](mailto:[email protected]) the author for further cooperation.
17+
- We release the GeoBenchmark, the first-ever benchmark for the evaluation of the capability of LLMs on geoscience.
18+
- We release the code of further pretrain and instruction tuning of K2.
19+
20+
***The following is the overview of training K2:***
21+
![overview](https://big-cheng.com/k2/overview.png)
22+
23+
## Data
24+
25+
### Further pretrain
26+
27+
Our text corpus for further pretraining on LLaMA-7B consists of 3.9 billion tokens from geoscience papers published in selected highquality journals in earth science and mainly collected by [GAKG](https://gakg.acemap.info/).
28+
29+
**Delta Model on [Huggingface](https://huggingface.co/): [daven3/k2_fp_delta](https://huggingface.co/daven3/k2_fp_delta)**
30+
31+
### Instruction Tuning: GeoSignal
32+
33+
Scientific domain adaptation has two main steps during the instruction tuning.
34+
- Instruction tuning with general instruction-tuning data, here we use Alpaca-GPT4.
35+
- Instruction tuning with restructured domain knowledge, which we call expertise instruction tuning. For K2, we use knowledge-intensive instruction data, GeoSignal.
36+
37+
***The following is the illustration of training domain specific language model recipe:***
38+
![recipe](https://big-cheng.com/k2/recipe.png)
39+
40+
- **Adapter Model on [Huggingface](https://huggingface.co/): [daven3/k2_it_adapter](https://huggingface.co/daven3/k2_fp_delta)**
41+
- **Dataset on [Huggingface](https://huggingface.co/): [geosignal](https://huggingface.co/datasets/daven3/geosignal)**
42+
43+
### Benchmark: GeoBenchmark
44+
45+
In GeoBenchmark, we collect 183 multiple-choice questions in NPEE,
46+
and 1,395 in total in AP Test, for objective tasks. Meanwhile, we gather all 939 subjective questions in NPEE to be the subjective tasks set and use 50 to measure the baselines with human evaluation.
47+
48+
- **Dataset on [Huggingface](https://huggingface.co/): [geobenchmark](https://huggingface.co/datasets/daven3/geobenchmark)**
49+
50+
## Code
51+
52+
### Further Pretrain
53+
54+
The training script is **`run_clm.py`**
55+
56+
```bash
57+
deepspeed --num_gpus=4 run_clm.py --deepspeed ds_config_zero3.json >log 2>&1 &
58+
```
59+
60+
### Instruction tuning
61+
62+
The training script is **`finetune.py`**
63+
64+
- For first step: alignment with human
65+
```bash
66+
python finetune.py --base_model /path/to/checkpoint-30140 --data_path /path/to/alpaca.json --output_dir /path/to/stage/one/model/ --cuda_id 2 --lora_target_modules "q_proj" "k_proj" "v_proj"
67+
```
68+
69+
- For second step: alignment with expert
70+
```bash
71+
python finetune.py --base_model /path/to/checkpoint-30140 --data_path /path/to/geosignal.json --output_dir /path/to/stage/two/model/ --cuda_id 2 --lora_target_modules "q_proj" "k_proj" "v_proj" --resume_from_checkpoint /path/to/stage/one/model/
72+
```
73+
74+
## Why named K2 ?
75+
76+
K2 is original from the name of the second highest mountain in the world, which we believe in the future larger and more powerful geoscience language models will be created. What's more, to train a model to shift to a disciplain with a large domain barrier, we have encountered many difficulties *(collecting corpus, clean academic data, computing power, ...)*, which shares with the fact that climbing K2 is no less difficult than Mount Everest🏔️.
77+
78+
## Contributors
79+
80+
This project was founded by the Acemap at Shanghai Jiao Tong University, the including [Cheng Deng](https://github.com/davendw49), [Tianhang Zhang](https://github.com/zthang), [Zhongmou He](https://github.com/twelfth-star), [Qiyuan Chen](), [Yuanyuan Shi](), [Le Zhou](), supervised by Weinan Zhang, Luoyi Fu, Zhouhan Lin and Junxian He, Xinbing Wang. The whole project is under the support from Chenghu Zhou and Institute of Geographical Science, Natural Resources Research, Chinese Academy of Sciences and [Deep-time Digital Earth Big Science Project](https://www.iugs.org/dde).
81+
82+
83+
## Acknowledgements
84+
85+
K2 has referred the following open-source projects. We would like to express our gratitude and respect to the researchers of the projects.
86+
87+
- Facebook LLaMA: https://github.com/facebookresearch/llama
88+
- Stanford Alpaca: https://github.com/tatsu-lab/stanford_alpaca
89+
- alpaca-lora by @tloen: https://github.com/tloen/alpaca-lora
90+
91+
K2 is under the support of [Deep-time Digital Earth Big Science Project](https://www.iugs.org/dde).
92+
93+
## TO-DO
94+
- [ ] Release the full version of GeoSignal.
95+
- [ ] Release the evaluation code over GeoBenchmark.
96+
- [ ] Series of applications with K2.
97+
98+
## License
99+
k2 is a research preview intended for non-commercial use only, subject to the model License of LLaMA, Terms of Use of the data generated by OpenAI. Please contact us if you find any potential violation. The code is released under the Apache License 2.0.
100+
101+
## Citation
102+
If you use the code or data of **k2**, please declare the reference with:
103+
104+
```
105+
@misc{deng2023k2,
106+
title={Learning A Foundation Language Model for Geoscience Knowledge Understanding and Utilization},
107+
author={Cheng Deng, Tianhang Zhang, Zhongmou He, Qiyuan Chen, Yuanyuan Shi, Le Zhou, Luoyi Fu, Weinan Zhang, Xinbing Wang, Chenghu Zhou, Zhouhan Lin and Junxian He},
108+
year={2023}
109+
}
110+
```

apply_delta.py

Lines changed: 165 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,165 @@
1+
"""
2+
Apply the delta weights on top of a base model.
3+
4+
Usage:
5+
python3 -m fastchat.model.apply_delta --base ~/model_weights/llama-7b --target ~/model_weights/vicuna-7b --delta lmsys/vicuna-7b-delta-v1.1
6+
"""
7+
import argparse
8+
import gc
9+
import glob
10+
import json
11+
import os
12+
import shutil
13+
import tempfile
14+
15+
from huggingface_hub import snapshot_download
16+
import torch
17+
from torch import nn
18+
from tqdm import tqdm
19+
from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig
20+
21+
22+
GB = 1 << 30
23+
24+
25+
def split_files(model_path, tmp_path, split_size):
26+
if not os.path.exists(model_path):
27+
model_path = snapshot_download(repo_id=model_path)
28+
if not os.path.exists(tmp_path):
29+
os.makedirs(tmp_path)
30+
31+
file_pattern = os.path.join(model_path, "pytorch_model-*.bin")
32+
files = glob.glob(file_pattern)
33+
34+
part = 0
35+
try:
36+
for file_path in tqdm(files):
37+
state_dict = torch.load(file_path)
38+
new_state_dict = {}
39+
40+
current_size = 0
41+
for name, param in state_dict.items():
42+
param_size = param.numel() * param.element_size()
43+
44+
if current_size + param_size > split_size:
45+
new_file_name = f"pytorch_model-{part}.bin"
46+
new_file_path = os.path.join(tmp_path, new_file_name)
47+
torch.save(new_state_dict, new_file_path)
48+
current_size = 0
49+
new_state_dict = None
50+
gc.collect()
51+
new_state_dict = {}
52+
part += 1
53+
54+
new_state_dict[name] = param
55+
current_size += param_size
56+
57+
new_file_name = f"pytorch_model-{part}.bin"
58+
new_file_path = os.path.join(tmp_path, new_file_name)
59+
torch.save(new_state_dict, new_file_path)
60+
new_state_dict = None
61+
gc.collect()
62+
new_state_dict = {}
63+
part += 1
64+
except Exception as e:
65+
print(f"An error occurred during split_files: {e}")
66+
shutil.rmtree(tmp_path)
67+
raise
68+
69+
70+
def apply_delta_low_cpu_mem(base_model_path, target_model_path, delta_path):
71+
delta_tokenizer = AutoTokenizer.from_pretrained(delta_path, use_fast=False)
72+
delta_config = AutoConfig.from_pretrained(delta_path)
73+
74+
if os.path.exists(target_model_path):
75+
shutil.rmtree(target_model_path)
76+
os.makedirs(target_model_path)
77+
78+
split_size = 4 * GB
79+
80+
with tempfile.TemporaryDirectory() as tmp_base_path, tempfile.TemporaryDirectory() as tmp_delta_path:
81+
print(f"Split files for the base model to {tmp_base_path}")
82+
split_files(base_model_path, tmp_base_path, split_size)
83+
print(f"Split files for the delta weights to {tmp_delta_path}")
84+
split_files(delta_path, tmp_delta_path, split_size)
85+
86+
base_pattern = os.path.join(tmp_base_path, "pytorch_model-*.bin")
87+
base_files = glob.glob(base_pattern)
88+
delta_pattern = os.path.join(tmp_delta_path, "pytorch_model-*.bin")
89+
delta_files = glob.glob(delta_pattern)
90+
delta_state_dict = torch.load(delta_files[0])
91+
92+
print("Applying the delta")
93+
weight_map = {}
94+
total_size = 0
95+
96+
for i, base_file in tqdm(enumerate(base_files)):
97+
state_dict = torch.load(base_file)
98+
file_name = f"pytorch_model-{i}.bin"
99+
for name, param in state_dict.items():
100+
if name not in delta_state_dict:
101+
for delta_file in delta_files:
102+
delta_state_dict = torch.load(delta_file)
103+
gc.collect()
104+
if name in delta_state_dict:
105+
break
106+
107+
state_dict[name] += delta_state_dict[name]
108+
weight_map[name] = file_name
109+
total_size += param.numel() * param.element_size()
110+
gc.collect()
111+
torch.save(state_dict, os.path.join(target_model_path, file_name))
112+
113+
with open(
114+
os.path.join(target_model_path, "pytorch_model.bin.index.json"), "w"
115+
) as f:
116+
json.dump(
117+
{"weight_map": weight_map, "metadata": {"total_size": total_size}}, f
118+
)
119+
120+
print(f"Saving the target model to {target_model_path}")
121+
delta_tokenizer.save_pretrained(target_model_path)
122+
delta_config.save_pretrained(target_model_path)
123+
124+
125+
def apply_delta(base_model_path, target_model_path, delta_path):
126+
print(f"Loading the delta weights from {delta_path}")
127+
delta_tokenizer = AutoTokenizer.from_pretrained(delta_path, use_fast=False)
128+
delta = AutoModelForCausalLM.from_pretrained(
129+
delta_path, torch_dtype=torch.float16, low_cpu_mem_usage=True
130+
)
131+
132+
print(f"Loading the base model from {base_model_path}")
133+
base = AutoModelForCausalLM.from_pretrained(
134+
base_model_path, torch_dtype=torch.float16, low_cpu_mem_usage=True
135+
)
136+
137+
print("Applying the delta")
138+
for name, param in tqdm(base.state_dict().items(), desc="Applying delta"):
139+
assert name in delta.state_dict()
140+
param.data += delta.state_dict()[name]
141+
142+
print(f"Saving the target model to {target_model_path}")
143+
base.save_pretrained(target_model_path)
144+
delta_tokenizer.save_pretrained(target_model_path)
145+
146+
147+
if __name__ == "__main__":
148+
parser = argparse.ArgumentParser()
149+
parser.add_argument("--base-model-path", type=str, required=True)
150+
parser.add_argument("--target-model-path", type=str, required=True)
151+
parser.add_argument("--delta-path", type=str, required=True)
152+
parser.add_argument(
153+
"--low-cpu-mem",
154+
action="store_true",
155+
help="Lower the cpu memory usage. This will split large files and use "
156+
"disk as swap to reduce the memory usage below 10GB.",
157+
)
158+
args = parser.parse_args()
159+
160+
if args.low_cpu_mem:
161+
apply_delta_low_cpu_mem(
162+
args.base_model_path, args.target_model_path, args.delta_path
163+
)
164+
else:
165+
apply_delta(args.base_model_path, args.target_model_path, args.delta_path)

0 commit comments

Comments
 (0)