Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TorchAcc][Experimental] Integrate more model in torchacc #683

Merged
merged 49 commits into from
May 22, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
49 commits
Select commit Hold shift + click to select a range
e2d3b44
[TorchAcc] Integrate TorchAcc and provide a sft example of qwen-72b-c…
baoleai Jan 25, 2024
6c899c9
Enhance TorchAcc support for dynamic sequence. (#382)
baoleai Feb 6, 2024
1321592
[TorchAcc] Add support for save/load checkpoint. (#444)
baoleai Feb 23, 2024
dba0c65
baichuan_patch
Zhikaiiii Feb 23, 2024
ef6e2d6
patch baichuan
Feb 23, 2024
432c070
modify baichuan
Zhikaiiii Feb 26, 2024
1d6f719
Merge branch 'torchacc' into torchacc2
Zhikaiiii Feb 26, 2024
a5a6fdc
[TorchAcc] Fix batch split when padding_to is not None. (#480)
baoleai Mar 3, 2024
37c4787
Merge branch 'torchacc' of https://github.com/modelscope/swift into t…
Zhikaiiii Mar 4, 2024
32cc090
metric warmup calculate
Zhikaiiii Mar 4, 2024
64246d3
fix conflict
Zhikaiiii Mar 4, 2024
290c2dd
fix
Zhikaiiii Mar 4, 2024
7e6b197
model patch
Zhikaiiii Mar 6, 2024
f0c7c8a
add profiler
Zhikaiiii Mar 15, 2024
c8dbfc6
add yi
Zhikaiiii Mar 28, 2024
669f5d9
[TorchAcc] Integrate TorchAcc and provide a sft example of qwen-72b-c…
baoleai Jan 25, 2024
b140759
Enhance TorchAcc support for dynamic sequence. (#382)
baoleai Feb 6, 2024
6faa7b3
[TorchAcc] Add support for save/load checkpoint. (#444)
baoleai Feb 23, 2024
1c3a258
fix patch
baoleai Apr 2, 2024
160a9d5
fix lint
baoleai Apr 2, 2024
e0fe1d4
code clean
baoleai Apr 2, 2024
0bb1797
add argument:fsdp num
Zhikaiiii Apr 8, 2024
f03aa00
[TorchAcc] rebase master
Zhikaiiii Apr 8, 2024
661def1
[TorchAcc] Integrate TorchAcc and provide a sft example of qwen-72b-c…
baoleai Jan 25, 2024
da6c94a
Enhance TorchAcc support for dynamic sequence. (#382)
baoleai Feb 6, 2024
73a843a
[TorchAcc] Add support for save/load checkpoint. (#444)
baoleai Feb 23, 2024
ee012b1
fix patch
baoleai Apr 2, 2024
f1b19a6
fix lint
baoleai Apr 2, 2024
0457fa4
code clean
baoleai Apr 2, 2024
d10901f
fix comments
baoleai Apr 8, 2024
30ad8c8
rebase
baoleai Apr 9, 2024
cd6e799
clean code
Zhikaiiii Apr 9, 2024
4400ea5
Merge remote-tracking branch 'origin_balole/features/rebase_0401' int…
Zhikaiiii Apr 9, 2024
f92274c
clean code
Zhikaiiii Apr 9, 2024
8e3cf24
Merge remote-tracking branch 'origin/main' into rebase_acc
Zhikaiiii Apr 11, 2024
8ee4bbf
format code
Zhikaiiii Apr 11, 2024
c3284ed
[fix]add mark_step to optimize speed
Zhikaiiii Apr 27, 2024
e38fc2e
add script
Zhikaiiii Apr 28, 2024
aa61d6f
add torchacc trim graph
Zhikaiiii Apr 28, 2024
40d18e9
remove useless code
Zhikaiiii Apr 28, 2024
0a173f1
remove useless files
Zhikaiiii Apr 28, 2024
6226edb
add qwen72b full script
Zhikaiiii Apr 28, 2024
5da5649
Merge branch 'main' into rebase_acc
Zhikaiiii Apr 28, 2024
6d68d29
fix bugs
Zhikaiiii Apr 28, 2024
a508e21
qwen15 and llama3 support
Zhikaiiii May 17, 2024
bf2a440
Merge branch 'main' into rebase_acc
Zhikaiiii May 17, 2024
c5c310a
remove prof callback
Zhikaiiii May 17, 2024
bd8d072
fix default value and add switch
Zhikaiiii May 21, 2024
df84f3f
update script
Zhikaiiii May 22, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# Experimental environment: 2 * A100
# 80GB GPU memory
# Note: TorchAcc is currently only available internally.
# torchacc dp
export USE_TORCHACC=1
export XLA_FLAGS='--xla_gpu_force_compilation_parallelism=32 --xla_multiheap_size_constraint_per_heap=4831838208 --xla_disable_hlo_passes=all-gather-combiner,all-reduce-combiner,reduce-scatter-combiner,gpu-convert-async-collectives-to-sync,rematerialization'
export XLA_IR_SHAPE_CACHE_SIZE=100000000
export XLA_ALLOCATOR_FRACTION=0.95
export XLA_EXPERIMENTAL=nonzero:masked_select

NPROC_PER_NODE=2 \
CUDA_VISIBLE_DEVICES=0,1 \
MASTER_PORT=27829 \
swift sft \
--model_id_or_path baichuan-inc/Baichuan2-13B-Chat \
--model_layer_cls_name BaichuanLayer \
--dataset codefuse-python-en \
--sft_type lora \
--output_dir output \
--num_train_epochs 1 \
--max_length 2048 \
--batch_size 12 \
--use_flash_attn true \
--gradient_accumulation_steps 1 \
--gradient_checkpointing no \
--tuner_backend 'peft' \
--dataset_test_ratio 0 \
--save_strategy no \
--eval_steps 2000000 \
--save_steps 2000000 \
--logging_steps 100 \
--preprocess_num_proc 1 \
--metric_warmup_step 0.1 \
--report_to 'none'
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# Experimental environment: 2 * A100
# 80GB GPU memory
# Note: TorchAcc is currently only available internally.
# torchacc fsdp
export USE_TORCHACC=1
export XLA_FLAGS='--xla_gpu_force_compilation_parallelism=32 --xla_multiheap_size_constraint_per_heap=4831838208 --xla_disable_hlo_passes=all-gather-combiner,all-reduce-combiner,reduce-scatter-combiner,gpu-convert-async-collectives-to-sync,rematerialization'
export XLA_IR_SHAPE_CACHE_SIZE=100000000
export XLA_ALLOCATOR_FRACTION=0.95
export XLA_EXPERIMENTAL=nonzero:masked_select

NPROC_PER_NODE=2 \
CUDA_VISIBLE_DEVICES=0,1 \
swift sft \
--model_id_or_path baichuan-inc/Baichuan2-13B-Chat \
--model_layer_cls_name BaichuanLayer \
--dataset codefuse-python-en \
--sft_type lora \
--output_dir output \
--num_train_epochs 1 \
--max_length 2048 \
--batch_size 16 \
--use_flash_attn true \
--gradient_accumulation_steps 1 \
--gradient_checkpointing no \
--tuner_backend 'peft' \
--dataset_test_ratio 0 \
--save_strategy no \
--eval_steps 2000000 \
--save_steps 2000000 \
--logging_steps 100 \
--preprocess_num_proc 1 \
--metric_warmup_step 0.1 \
--fsdp_num 2 \
--report_to 'none'
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# Experimental environment: 2 * A100
# 80GB GPU memory
# Note: TorchAcc is currently only available internally.

# MASTER_ADDR=127.0.0.1 \

NPROC_PER_NODE=2 \
CUDA_VISIBLE_DEVICES=0,1 \
swift sft \
--model_id_or_path baichuan-inc/Baichuan2-13B-Chat \
--dataset codefuse-python-en \
--sft_type lora \
--dtype AUTO \
--output_dir output \
--num_train_epochs 1 \
--max_length 2048 \
--batch_size 2 \
--use_flash_attn true \
--gradient_accumulation_steps 1 \
--dataset_test_ratio 0 \
--save_strategy no \
--eval_steps 2000000 \
--save_steps 2000000 \
--logging_steps 100 \
--preprocess_num_proc 1 \
--metric_warmup_step 0.1 \
--report_to 'none'
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# Experimental environment: 2 * A100
# 80GB GPU memory
# Note: TorchAcc is currently only available internally.
# torchacc dp
export USE_TORCHACC=1
export XLA_FLAGS='--xla_gpu_force_compilation_parallelism=32 --xla_multiheap_size_constraint_per_heap=4831838208 --xla_disable_hlo_passes=all-gather-combiner,all-reduce-combiner,reduce-scatter-combiner,gpu-convert-async-collectives-to-sync,rematerialization'
export XLA_IR_SHAPE_CACHE_SIZE=100000000
export XLA_ALLOCATOR_FRACTION=0.95
export XLA_EXPERIMENTAL=nonzero:masked_select


NPROC_PER_NODE=2 \
CUDA_VISIBLE_DEVICES=0,1 \
MASTER_PORT=27829 \
swift sft \
--model_id_or_path ZhipuAI/chatglm3-6b \
--model_layer_cls_name GLMBlock \
--dataset codefuse-python-en \
--sft_type lora \
--output_dir output \
--num_train_epochs 1 \
--max_length 2048 \
--batch_size 16 \
--use_flash_attn true \
--gradient_accumulation_steps 1 \
--gradient_checkpointing no \
--tuner_backend 'peft' \
--dataset_test_ratio 0 \
--save_strategy no \
--eval_steps 2000000 \
--save_steps 2000000 \
--logging_steps 100 \
--preprocess_num_proc 1 \
--metric_warmup_step 0.1 \
--report_to 'none'
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# Experimental environment: 2 * A100
# 80GB GPU memory
# Note: TorchAcc is currently only available internally.
# torchacc fsdp
export USE_TORCHACC=1
export XLA_FLAGS='--xla_gpu_force_compilation_parallelism=32 --xla_multiheap_size_constraint_per_heap=4831838208 --xla_disable_hlo_passes=all-gather-combiner,all-reduce-combiner,reduce-scatter-combiner,gpu-convert-async-collectives-to-sync,rematerialization'
export XLA_IR_SHAPE_CACHE_SIZE=100000000
export XLA_ALLOCATOR_FRACTION=0.95
export XLA_EXPERIMENTAL=nonzero:masked_select


NPROC_PER_NODE=2 \
CUDA_VISIBLE_DEVICES=0,1 \
swift sft \
--model_id_or_path ZhipuAI/chatglm3-6b \
--model_layer_cls_name GLMBlock \
--dataset codefuse-python-en \
--sft_type lora \
--output_dir output \
--num_train_epochs 1 \
--max_length 2048 \
--batch_size 16 \
--use_flash_attn true \
--gradient_accumulation_steps 1 \
--gradient_checkpointing no \
--tuner_backend 'peft' \
--dataset_test_ratio 0 \
--save_strategy no \
--eval_steps 2000000 \
--save_steps 2000000 \
--logging_steps 100 \
--preprocess_num_proc 1 \
--metric_warmup_step 0.1 \
--fsdp_num 2 \
--report_to 'none'
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# Experimental environment: 2 * A100
# 80GB GPU memory
# Note: TorchAcc is currently only available internally.

# MASTER_ADDR=127.0.0.1 \
# MASTER_PORT=12356 \
NPROC_PER_NODE=2 \
CUDA_VISIBLE_DEVICES=0,1 \
swift sft \
--model_id_or_path ZhipuAI/chatglm3-6b \
--dataset codefuse-python-en \
--sft_type lora \
--dtype AUTO \
--output_dir output \
--num_train_epochs 1 \
--max_length 2048 \
--batch_size 4 \
--use_flash_attn true \
--gradient_accumulation_steps 1 \
--dataset_test_ratio 0 \
--save_strategy no \
--eval_steps 2000000 \
--save_steps 2000000 \
--logging_steps 100 \
--preprocess_num_proc 1 \
--metric_warmup_step 0.1 \
--report_to 'none'
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# Experimental environment: 2 * A100
# 80GB GPU memory
# Note: TorchAcc is currently only available internally.

export USE_TORCHACC=1
export TORCHACC_TRIM_GRAPH=1
export XLA_FLAGS='--xla_gpu_force_compilation_parallelism=32 --xla_multiheap_size_constraint_per_heap=4831838208 --xla_disable_hlo_passes=all-gather-combiner,all-reduce-combiner,reduce-scatter-combiner,gpu-convert-async-collectives-to-sync,rematerialization'
export XLA_IR_SHAPE_CACHE_SIZE=100000000
export XLA_ALLOCATOR_FRACTION=0.95
export XLA_EXPERIMENTAL=nonzero:masked_select

NPROC_PER_NODE=2 \
CUDA_VISIBLE_DEVICES=0,1 \
swift sft \
--model_id_or_path modelscope/Llama-2-13b-chat-ms \
--model_layer_cls_name LlamaDecoderLayer \
--dataset codefuse-python-en \
--template_type llama \
--sft_type lora \
--output_dir output \
--num_train_epochs 1 \
--max_length 2048 \
--batch_size 16 \
--use_flash_attn true \
--gradient_accumulation_steps 1 \
--gradient_checkpointing no \
--tuner_backend 'peft' \
--dataset_test_ratio 0 \
--save_strategy no \
--eval_steps 2000000 \
--save_steps 2000000 \
--logging_steps 100 \
--preprocess_num_proc 1 \
--metric_warmup_step 0.1 \
--report_to 'none'
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# Experimental environment: 2 * A100
# 80GB GPU memory
# Note: TorchAcc is currently only available internally.
export USE_TORCHACC=1
export TORCHACC_TRIM_GRAPH=1
export XLA_FLAGS='--xla_gpu_force_compilation_parallelism=32 --xla_multiheap_size_constraint_per_heap=4831838208 --xla_disable_hlo_passes=all-gather-combiner,all-reduce-combiner,reduce-scatter-combiner,gpu-convert-async-collectives-to-sync,rematerialization'
export XLA_IR_SHAPE_CACHE_SIZE=100000000
export XLA_ALLOCATOR_FRACTION=0.95
export XLA_EXPERIMENTAL=nonzero:masked_select

NPROC_PER_NODE=2 \
CUDA_VISIBLE_DEVICES=0,1 \
MASTER_PORT=27829 \
swift sft \
--model_id_or_path modelscope/Llama-2-13b-chat-ms \
--model_layer_cls_name LlamaDecoderLayer \
--dataset codefuse-python-en \
--template_type llama \
--sft_type lora \
--output_dir output \
--num_train_epochs 1 \
--max_length 2048 \
--batch_size 24 \
--use_flash_attn true \
--gradient_accumulation_steps 1 \
--gradient_checkpointing no \
--tuner_backend 'peft' \
--dataset_test_ratio 0 \
--save_strategy no \
--eval_steps 2000000 \
--save_steps 2000000 \
--logging_steps 100 \
--preprocess_num_proc 1 \
--metric_warmup_step 0.1 \
--fsdp_num 2 \
--report_to 'none'
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# Experimental environment: 2 * A100
# 80GB GPU memory
# Note: TorchAcc is currently only available internally.

# MASTER_ADDR=127.0.0.1 \

NPROC_PER_NODE=2 \
CUDA_VISIBLE_DEVICES=0,1 \
swift sft \
--model_id_or_path modelscope/Llama-2-13b-chat-ms \
--dataset codefuse-python-en \
--sft_type lora \
--dtype AUTO \
--output_dir output \
--num_train_epochs 1 \
--max_length 2048 \
--batch_size 16 \
--use_flash_attn true \
--gradient_accumulation_steps 1 \
--dataset_test_ratio 0 \
--save_strategy no \
--eval_steps 2000000 \
--save_steps 2000000 \
--logging_steps 100 \
--preprocess_num_proc 1 \
--metric_warmup_step 0.1 \
--report_to 'none'
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# Experimental environment: 2 * A100
# 80GB GPU memory
# Note: TorchAcc is currently only available internally.

export USE_TORCHACC=1
export TORCHACC_TRIM_GRAPH=1
export XLA_FLAGS='--xla_gpu_force_compilation_parallelism=32 --xla_multiheap_size_constraint_per_heap=4831838208 --xla_disable_hlo_passes=all-gather-combiner,all-reduce-combiner,reduce-scatter-combiner,gpu-convert-async-collectives-to-sync,rematerialization'
export XLA_IR_SHAPE_CACHE_SIZE=100000000
export XLA_ALLOCATOR_FRACTION=0.95
export XLA_EXPERIMENTAL=nonzero:masked_select
export XLA_COORDINATOR_PORT=12457

NPROC_PER_NODE=2 \
CUDA_VISIBLE_DEVICES=0,1 \
MASTER_PORT=21779 \
swift sft \
--model_id_or_path LLM-Research/Meta-Llama-3-8B-Instruct \
--model_layer_cls_name LlamaDecoderLayer \
--dataset codefuse-python-en \
--template_type llama3 \
--sft_type lora \
--output_dir output \
--num_train_epochs 1 \
--max_length 2048 \
--batch_size 12 \
--use_flash_attn true \
--gradient_accumulation_steps 1 \
--gradient_checkpointing no \
--tuner_backend 'peft' \
--dataset_test_ratio 0 \
--save_strategy no \
--eval_steps 2000000 \
--save_steps 2000000 \
--logging_steps 100 \
--preprocess_num_proc 1 \
--metric_warmup_step 0.1 \
--report_to 'none'
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# Experimental environment: 2 * A100
# 80GB GPU memory
# Note: TorchAcc is currently only available internally.
export USE_TORCHACC=1
export TORCHACC_TRIM_GRAPH=1
export XLA_FLAGS='--xla_gpu_force_compilation_parallelism=32 --xla_multiheap_size_constraint_per_heap=4831838208 --xla_disable_hlo_passes=all-gather-combiner,all-reduce-combiner,reduce-scatter-combiner,gpu-convert-async-collectives-to-sync,rematerialization'
export XLA_IR_SHAPE_CACHE_SIZE=100000000
export XLA_ALLOCATOR_FRACTION=0.95
export XLA_EXPERIMENTAL=nonzero:masked_select
# export XLA_COORDINATOR_PORT=12457

NPROC_PER_NODE=2 \
CUDA_VISIBLE_DEVICES=0,1 \
MASTER_PORT=27829 \
swift sft \
--model_id_or_path LLM-Research/Meta-Llama-3-8B-Instruct \
--model_layer_cls_name LlamaDecoderLayer \
--dataset codefuse-python-en \
--template_type llama3 \
--sft_type lora \
--output_dir output \
--num_train_epochs 1 \
--max_length 2048 \
--batch_size 12 \
--use_flash_attn true \
--gradient_accumulation_steps 1 \
--gradient_checkpointing no \
--tuner_backend 'peft' \
--dataset_test_ratio 0 \
--save_strategy no \
--eval_steps 2000000 \
--save_steps 2000000 \
--logging_steps 100 \
--preprocess_num_proc 1 \
--metric_warmup_step 0.1 \
--fsdp_num 2 \
--report_to 'none'
Loading
Loading