davendw49
diff --git a/‎README.md
Lines changed: 110 additions & 2 deletions b/‎README.md
Lines changed: 110 additions & 2 deletions
diff --git a/‎apply_delta.py
Lines changed: 165 additions & 0 deletions b/‎apply_delta.py
Lines changed: 165 additions & 0 deletions
@@ -1,2 +1,110 @@
-# K2
-Code and datasets for paper "Learning A Foundation Language Model for Geoscience Knowledge Understanding and Utilization"
+<div style="text-align:center">
+<img src="https://big-cheng.com/k2/k2.png" alt="k2-logo" width="200"/>
+<h2>🏔️ Large Language Model for Geoscience</h2>
+</div>
+
+Code and data for paper ***"Learning A Foundation Language Model for Geoscience Knowledge Understanding and Utilization"***
+
+## Introduction
+
+We introduce **K2** (7B), an open-source language model trained by firstly further pretraining LLaMA on collected and cleaned geoscience literatures including geoscience open-access papers and wikipedia pages and secondly fine-tuning with knowledge-intensive instruction tuning data (GeoSignal). Preliminary evaluation using GeoBenchmark (consisting of NPEE and AP Test on Geology, Geography and Environmental science) as benchmark. Compared with several baseline models with similar number of parameters, K2 outperforms the baselines on both the objective tasks and the subjective tasks. 
+In this repository, we will share the following code and data
+
+- We release K2 weights in two parts (one can add our delta to the original LLaMA weights, and use `peft_model` with `transformers` to obtain the entire K2 model.)
+    - Delta weights after further pretraining with geoscience text corous to comply with the LLaMA model license. 
+    - Adapter model weights trained by PEFT (LoRA).
+- We release the cora data of GeoSignal under the contraint of DDE, if you want the full version of GeoSignal, you can [email](mailto:[email protected]) the author for further cooperation.
+- We release the GeoBenchmark, the first-ever benchmark for the evaluation of the capability of LLMs on geoscience.
+- We release the code of further pretrain and instruction tuning of K2.
+
+***The following is the overview of training K2:***
+![overview](https://big-cheng.com/k2/overview.png)
+
+## Data
+
+### Further pretrain
+
+Our text corpus for further pretraining on LLaMA-7B consists of 3.9 billion tokens from geoscience papers published in selected highquality journals in earth science and mainly collected by [GAKG](https://gakg.acemap.info/).
+
+**Delta Model on [Huggingface](https://huggingface.co/): [daven3/k2_fp_delta](https://huggingface.co/daven3/k2_fp_delta)**
+
+### Instruction Tuning: GeoSignal
+
+Scientific domain adaptation has two main steps during the instruction tuning. 
+- Instruction tuning with general instruction-tuning data, here we use Alpaca-GPT4. 
+- Instruction tuning with restructured domain knowledge, which we call expertise instruction tuning. For K2, we use knowledge-intensive instruction data, GeoSignal.
+
+***The following is the illustration of training domain specific language model recipe:***
+![recipe](https://big-cheng.com/k2/recipe.png)
+
+- **Adapter Model on [Huggingface](https://huggingface.co/): [daven3/k2_it_adapter](https://huggingface.co/daven3/k2_fp_delta)**
+- **Dataset on [Huggingface](https://huggingface.co/): [geosignal](https://huggingface.co/datasets/daven3/geosignal)**
+
+### Benchmark: GeoBenchmark
+
+In GeoBenchmark, we collect 183 multiple-choice questions in NPEE,
+and 1,395 in total in AP Test, for objective tasks. Meanwhile, we gather all 939 subjective questions in NPEE to be the subjective tasks set and use 50 to measure the baselines with human evaluation. 
+
+- **Dataset on [Huggingface](https://huggingface.co/): [geobenchmark](https://huggingface.co/datasets/daven3/geobenchmark)**
+
+## Code
+
+### Further Pretrain
+
+The training script is **`run_clm.py`**
+
+```bash
+deepspeed --num_gpus=4 run_clm.py --deepspeed ds_config_zero3.json >log 2>&1 &
+```
+
+### Instruction tuning
+
+The training script is **`finetune.py`**
+
+- For first step: alignment with human
+```bash
+python finetune.py --base_model /path/to/checkpoint-30140 --data_path /path/to/alpaca.json --output_dir /path/to/stage/one/model/ --cuda_id 2 --lora_target_modules "q_proj" "k_proj" "v_proj"
+```
+
+- For second step: alignment with expert
+```bash
+python finetune.py --base_model /path/to/checkpoint-30140 --data_path /path/to/geosignal.json --output_dir /path/to/stage/two/model/ --cuda_id 2 --lora_target_modules "q_proj" "k_proj" "v_proj" --resume_from_checkpoint /path/to/stage/one/model/
+```
+
+## Why named K2 ?
+
+K2 is original from the name of the second highest mountain in the world, which we believe in the future larger and more powerful geoscience language models will be created. What's more, to train a model to shift to a disciplain with a large domain barrier, we have encountered many difficulties *(collecting corpus, clean academic data, computing power, ...)*, which shares with the fact that climbing K2 is no less difficult than Mount Everest🏔️.
+
+## Contributors
+
+This project was founded by the Acemap at Shanghai Jiao Tong University, the including [Cheng Deng](https://github.com/davendw49), [Tianhang Zhang](https://github.com/zthang), [Zhongmou He](https://github.com/twelfth-star), [Qiyuan Chen](), [Yuanyuan Shi](), [Le Zhou](), supervised by Weinan Zhang, Luoyi Fu, Zhouhan Lin and Junxian He, Xinbing Wang. The whole project is under the support from Chenghu Zhou and Institute of Geographical Science, Natural Resources Research, Chinese Academy of Sciences and [Deep-time Digital Earth Big Science Project](https://www.iugs.org/dde). 
+
+
+## Acknowledgements
+
+K2 has referred the following open-source projects. We would like to express our gratitude and respect to the researchers of the projects.
+
+- Facebook LLaMA: https://github.com/facebookresearch/llama
+- Stanford Alpaca: https://github.com/tatsu-lab/stanford_alpaca
+- alpaca-lora by @tloen: https://github.com/tloen/alpaca-lora
+
+K2 is under the support of [Deep-time Digital Earth Big Science Project](https://www.iugs.org/dde). 
+
+## TO-DO
+- [ ] Release the full version of GeoSignal.
+- [ ] Release the evaluation code over GeoBenchmark.
+- [ ] Series of applications with K2.
+
+## License
+k2 is a research preview intended for non-commercial use only, subject to the model License of LLaMA, Terms of Use of the data generated by OpenAI. Please contact us if you find any potential violation. The code is released under the Apache License 2.0.
+
+## Citation
+If you use the code or data of **k2**, please declare the reference with:
+
+```
+@misc{deng2023k2,
+      title={Learning A Foundation Language Model for Geoscience Knowledge Understanding and Utilization}, 
+      author={Cheng Deng, Tianhang Zhang, Zhongmou He, Qiyuan Chen, Yuanyuan Shi, Le Zhou, Luoyi Fu, Weinan Zhang, Xinbing Wang, Chenghu Zhou, Zhouhan Lin and Junxian He},
+      year={2023}
+}
+```
@@ -0,0 +1,165 @@
+"""
+Apply the delta weights on top of a base model.
+
+Usage:
+python3 -m fastchat.model.apply_delta --base ~/model_weights/llama-7b --target ~/model_weights/vicuna-7b --delta lmsys/vicuna-7b-delta-v1.1
+"""
+import argparse
+import gc
+import glob
+import json
+import os
+import shutil
+import tempfile
+
+from huggingface_hub import snapshot_download
+import torch
+from torch import nn
+from tqdm import tqdm
+from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig
+
+
+GB = 1 << 30
+
+
+def split_files(model_path, tmp_path, split_size):
+    if not os.path.exists(model_path):
+        model_path = snapshot_download(repo_id=model_path)
+    if not os.path.exists(tmp_path):
+        os.makedirs(tmp_path)
+
+    file_pattern = os.path.join(model_path, "pytorch_model-*.bin")
+    files = glob.glob(file_pattern)
+
+    part = 0
+    try:
+        for file_path in tqdm(files):
+            state_dict = torch.load(file_path)
+            new_state_dict = {}
+
+            current_size = 0
+            for name, param in state_dict.items():
+                param_size = param.numel() * param.element_size()
+
+                if current_size + param_size > split_size:
+                    new_file_name = f"pytorch_model-{part}.bin"
+                    new_file_path = os.path.join(tmp_path, new_file_name)
+                    torch.save(new_state_dict, new_file_path)
+                    current_size = 0
+                    new_state_dict = None
+                    gc.collect()
+                    new_state_dict = {}
+                    part += 1
+
+                new_state_dict[name] = param
+                current_size += param_size
+
+            new_file_name = f"pytorch_model-{part}.bin"
+            new_file_path = os.path.join(tmp_path, new_file_name)
+            torch.save(new_state_dict, new_file_path)
+            new_state_dict = None
+            gc.collect()
+            new_state_dict = {}
+            part += 1
+    except Exception as e:
+        print(f"An error occurred during split_files: {e}")
+        shutil.rmtree(tmp_path)
+        raise
+
+
+def apply_delta_low_cpu_mem(base_model_path, target_model_path, delta_path):
+    delta_tokenizer = AutoTokenizer.from_pretrained(delta_path, use_fast=False)
+    delta_config = AutoConfig.from_pretrained(delta_path)
+
+    if os.path.exists(target_model_path):
+        shutil.rmtree(target_model_path)
+    os.makedirs(target_model_path)
+
+    split_size = 4 * GB
+
+    with tempfile.TemporaryDirectory() as tmp_base_path, tempfile.TemporaryDirectory() as tmp_delta_path:
+        print(f"Split files for the base model to {tmp_base_path}")
+        split_files(base_model_path, tmp_base_path, split_size)
+        print(f"Split files for the delta weights to {tmp_delta_path}")
+        split_files(delta_path, tmp_delta_path, split_size)
+
+        base_pattern = os.path.join(tmp_base_path, "pytorch_model-*.bin")
+        base_files = glob.glob(base_pattern)
+        delta_pattern = os.path.join(tmp_delta_path, "pytorch_model-*.bin")
+        delta_files = glob.glob(delta_pattern)
+        delta_state_dict = torch.load(delta_files[0])
+
+        print("Applying the delta")
+        weight_map = {}
+        total_size = 0
+
+        for i, base_file in tqdm(enumerate(base_files)):
+            state_dict = torch.load(base_file)
+            file_name = f"pytorch_model-{i}.bin"
+            for name, param in state_dict.items():
+                if name not in delta_state_dict:
+                    for delta_file in delta_files:
+                        delta_state_dict = torch.load(delta_file)
+                        gc.collect()
+                        if name in delta_state_dict:
+                            break
+
+                state_dict[name] += delta_state_dict[name]
+                weight_map[name] = file_name
+                total_size += param.numel() * param.element_size()
+                gc.collect()
+            torch.save(state_dict, os.path.join(target_model_path, file_name))
+
+        with open(
+            os.path.join(target_model_path, "pytorch_model.bin.index.json"), "w"
+        ) as f:
+            json.dump(
+                {"weight_map": weight_map, "metadata": {"total_size": total_size}}, f
+            )
+
+    print(f"Saving the target model to {target_model_path}")
+    delta_tokenizer.save_pretrained(target_model_path)
+    delta_config.save_pretrained(target_model_path)
+
+
+def apply_delta(base_model_path, target_model_path, delta_path):
+    print(f"Loading the delta weights from {delta_path}")
+    delta_tokenizer = AutoTokenizer.from_pretrained(delta_path, use_fast=False)
+    delta = AutoModelForCausalLM.from_pretrained(
+        delta_path, torch_dtype=torch.float16, low_cpu_mem_usage=True
+    )
+
+    print(f"Loading the base model from {base_model_path}")
+    base = AutoModelForCausalLM.from_pretrained(
+        base_model_path, torch_dtype=torch.float16, low_cpu_mem_usage=True
+    )
+
+    print("Applying the delta")
+    for name, param in tqdm(base.state_dict().items(), desc="Applying delta"):
+        assert name in delta.state_dict()
+        param.data += delta.state_dict()[name]
+
+    print(f"Saving the target model to {target_model_path}")
+    base.save_pretrained(target_model_path)
+    delta_tokenizer.save_pretrained(target_model_path)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--base-model-path", type=str, required=True)
+    parser.add_argument("--target-model-path", type=str, required=True)
+    parser.add_argument("--delta-path", type=str, required=True)
+    parser.add_argument(
+        "--low-cpu-mem",
+        action="store_true",
+        help="Lower the cpu memory usage. This will split large files and use "
+        "disk as swap to reduce the memory usage below 10GB.",
+    )
+    args = parser.parse_args()
+
+    if args.low_cpu_mem:
+        apply_delta_low_cpu_mem(
+            args.base_model_path, args.target_model_path, args.delta_path
+        )
+    else:
+        apply_delta(args.base_model_path, args.target_model_path, args.delta_path)