Modular Embedding (#366)

* Modular Embedding 1. Transform uer/embeddings into a modular design 2. Modify the corresponding config file and opt settings 3. Modify the emb construction method of the model class used for embedding, and uer/model_builder * Update lm_target.py Modify lm_target to consider ignore_index, label_smoothing and seg * Eliminate Deepspeed Eliminate Deepspeed to keep the UER-py more clear * Update README.md * Update github-actions.yml * Update github-actions.yml * Update github-actions.yml * Update target.py * target * Update model.py * Update lm_target.py * default_args for encoder_decoder_model * Update model.py * scripts/convert converting scripts modified due to modular embedding component
dbiir · Jul 15, 2023 · b80a5e2 · b80a5e2
1 parent 7c1f731
commit b80a5e2
Show file tree

Hide file tree

Showing 118 changed files with 586 additions and 1,007 deletions.
diff --git a/.github/workflows/github-actions.yml b/.github/workflows/github-actions.yml
@@ -8,7 +8,7 @@ jobs:
     runs-on: ubuntu-latest
     strategy:
       matrix:
-        python-version: [3.6.13]
+        python-version: [3.7.13]
 
     steps:
       - uses: actions/checkout@v2
@@ -40,7 +40,7 @@ jobs:
               python pretrain.py --dataset_path cls_dataset.pt --vocab_path models/google_zh_vocab.txt --config_path models/bert/mini_config.json --output_model_path models/cls_model.bin --total_steps 10 --save_checkpoint_steps 10 --report_steps 2 --batch_size 2 --labels_num 2 --data_processor cls --target cls
               mv models/cls_model.bin-10 models/cls_model.bin
               python preprocess.py --corpus_path corpora/parallel_corpus_en_zh.txt --vocab_path models/google_uncased_en_vocab.txt --tgt_vocab_path models/google_zh_vocab.txt --dataset_path mt_dataset.pt --processes_num 8 --seq_length 64 --tgt_seq_length 64 --data_processor mt
-              python pretrain.py --dataset_path mt_dataset.pt --vocab_path models/google_uncased_en_vocab.txt --tgt_vocab_path models/google_zh_vocab.txt --config_path models/encoder_decoder_config.json --output_model_path models/mt_model.bin --total_steps 10 --save_checkpoint_steps 10 --report_steps 2 --batch_size 2
+              python pretrain.py --dataset_path mt_dataset.pt --vocab_path models/google_uncased_en_vocab.txt --tgt_vocab_path models/google_zh_vocab.txt --config_path models/transformer/base_config.json --output_model_path models/mt_model.bin --total_steps 10 --save_checkpoint_steps 10 --report_steps 2 --batch_size 2
               mv models/mt_model.bin-10 models/mt_model.bin
               python preprocess.py --corpus_path corpora/CLUECorpusSmall_5000_lines_bert.txt --vocab_path models/google_zh_vocab.txt --dataset_path pegasus_dataset.pt --processes_num 8 --seq_length 128 --tgt_seq_length 128 --dup_factor 1 --sentence_selection_strategy random --data_processor gsg
               python pretrain.py --dataset_path pegasus_dataset.pt --vocab_path models/google_zh_vocab.txt --config_path models/pegasus/base_config.json --output_model_path models/pegasus_model.bin --total_steps 10 --save_checkpoint_steps 10 --report_steps 2 --batch_size 2

diff --git a/README.md b/README.md
@@ -37,7 +37,7 @@ Table of Contents
 UER-py has the following features:
 - __Reproducibility__ UER-py has been tested on many datasets and should match the performances of the original pre-training model implementations such as BERT, GPT-2, ELMo, and T5.
 - __Model modularity__ UER-py is divided into the following components: embedding, encoder, target embedding (optional), decoder (optional), and target. Ample modules are implemented in each component. Clear and robust interface allows users to combine modules to construct pre-training models with as few restrictions as possible.
-- __Model training__ UER-py supports CPU mode, single GPU mode, distributed training mode, and gigantic model training with DeepSpeed.
+- __Model training__ UER-py supports CPU mode, single GPU mode and distributed training mode.
 - __Model zoo__ With the help of UER-py, we pre-train and release models of different properties. Proper selection of pre-trained models is important to the performances of downstream tasks.
 - __SOTA results__ UER-py supports comprehensive downstream tasks (e.g. classification and machine reading comprehension) and provides winning solutions of many NLP competitions.
 - __Abundant functions__ UER-py provides abundant functions related with pre-training, such as feature extractor and text generation.
@@ -58,7 +58,6 @@ UER-py has the following features:
 * For developing a stacking model you will need LightGBM and [BayesianOptimization](https://github.com/fmfn/BayesianOptimization)
 * For the pre-training with whole word masking you will need word segmentation tool such as [jieba](https://github.com/fxsjy/jieba)
 * For the use of CRF in sequence labeling downstream task you will need [pytorch-crf](https://github.com/kmkurn/pytorch-crf)
-* For the gigantic model training you will need [DeepSpeed](https://github.com/microsoft/DeepSpeed)
 
 
 <br/>

diff --git a/README_ZH.md b/README_ZH.md
@@ -30,7 +30,7 @@
 UER-py有如下几方面优势:
 - __可复现__ UER-py已在许多数据集上进行了测试，与原始预训练模型实现（例如BERT、GPT-2、ELMo、T5）的表现相匹配
 - __模块化__ UER-py使用解耦的模块化设计框架。框架分成Embedding、Encoder、Target等多个部分。各个部分之间有着清晰的接口并且每个部分包括了丰富的模块。可以对不同模块进行组合，构建出性质不同的预训练模型
-- __模型训练__ UER-py支持CPU、单机单GPU、单机多GPU、多机多GPU训练模式，并支持使用DeepSpeed优化库进行超大模型训练
+- __模型训练__ UER-py支持CPU、单机单GPU、单机多GPU、多机多GPU训练模式
 - __模型仓库__ 我们维护并持续发布预训练模型。用户可以根据具体任务的要求，从中选择合适的预训练模型使用
 - __SOTA结果__ UER-py支持全面的下游任务，包括文本分类、文本对分类、序列标注、阅读理解等，并提供了多个竞赛获胜解决方案
 - __预训练相关功能__ UER-py提供了丰富的预训练相关的功能和优化，包括特征抽取、近义词检索、预训练模型转换、模型集成、文本生成等
@@ -51,7 +51,6 @@ UER-py有如下几方面优势:
 * 如果使用模型集成stacking，需要安装LightGBM和[BayesianOptimization](https://github.com/fmfn/BayesianOptimization)
 * 如果使用全词遮罩（whole word masking）预训练，需要安装分词工具，例如[jieba](https://github.com/fxsjy/jieba)
 * 如果在序列标注下游任务中使用CRF，需要安装[pytorch-crf](https://github.com/kmkurn/pytorch-crf)
-* 如果使用超大模型，需要安装[DeepSpeed](https://github.com/microsoft/DeepSpeed)
 
 
 <br/>

diff --git a/finetune/run_c3.py b/finetune/run_c3.py
@@ -28,7 +28,10 @@
 class MultipleChoice(nn.Module):
     def __init__(self, args):
         super(MultipleChoice, self).__init__()
-        self.embedding = str2embedding[args.embedding](args, len(args.tokenizer.vocab))
+        self.embedding = Embedding(args)
+        for embedding_name in args.embedding:
+            tmp_emb = str2embedding[embedding_name](args, len(args.tokenizer.vocab))
+            self.embedding.update(tmp_emb, embedding_name)
         self.encoder = str2encoder[args.encoder](args)
         self.dropout = nn.Dropout(args.dropout)
         self.output_layer = nn.Linear(args.hidden_size, 1)

diff --git a/finetune/run_classifier.py b/finetune/run_classifier.py
@@ -28,7 +28,10 @@
 class Classifier(nn.Module):
     def __init__(self, args):
         super(Classifier, self).__init__()
-        self.embedding = str2embedding[args.embedding](args, len(args.tokenizer.vocab))
+        self.embedding = Embedding(args)
+        for embedding_name in args.embedding:
+            tmp_emb = str2embedding[embedding_name](args, len(args.tokenizer.vocab))
+            self.embedding.update(tmp_emb, embedding_name)
         self.encoder = str2encoder[args.encoder](args)
         self.labels_num = args.labels_num
         self.pooling_type = args.pooling

diff --git a/finetune/run_classifier_deepspeed.py b/finetune/run_classifier_deepspeed.py
diff --git a/finetune/run_classifier_mt.py b/finetune/run_classifier_mt.py
@@ -28,7 +28,10 @@
 class MultitaskClassifier(nn.Module):
     def __init__(self, args):
         super(MultitaskClassifier, self).__init__()
-        self.embedding = str2embedding[args.embedding](args, len(args.tokenizer.vocab))
+        self.embedding = Embedding(args)
+        for embedding_name in args.embedding:
+            tmp_emb = str2embedding[embedding_name](args, len(args.tokenizer.vocab))
+            self.embedding.update(tmp_emb, embedding_name)
         self.encoder = str2encoder[args.encoder](args)
         self.pooling_type = args.pooling
         self.output_layers_1 = nn.ModuleList([nn.Linear(args.hidden_size, args.hidden_size) for _ in args.labels_num_list])

diff --git a/finetune/run_classifier_multi_label.py b/finetune/run_classifier_multi_label.py
@@ -32,7 +32,10 @@
 class MultilabelClassifier(nn.Module):
     def __init__(self, args):
         super(MultilabelClassifier, self).__init__()
-        self.embedding = str2embedding[args.embedding](args, len(args.tokenizer.vocab))
+        self.embedding = Embedding(args)
+        for embedding_name in args.embedding:
+            tmp_emb = str2embedding[embedding_name](args, len(args.tokenizer.vocab))
+            self.embedding.update(tmp_emb, embedding_name)
         self.encoder = str2encoder[args.encoder](args)
         self.labels_num = args.labels_num
         self.pooling_type = args.pooling

diff --git a/finetune/run_classifier_prompt.py b/finetune/run_classifier_prompt.py
@@ -20,11 +20,14 @@
 class ClozeTest(nn.Module):
     def __init__(self, args):
         super(ClozeTest, self).__init__()
-        self.embedding = str2embedding[args.embedding](args, len(args.tokenizer.vocab))
+        self.embedding = Embedding(args)
+        for embedding_name in args.embedding:
+            tmp_emb = str2embedding[embedding_name](args, len(args.tokenizer.vocab))
+            self.embedding.update(tmp_emb, embedding_name)
         self.encoder = str2encoder[args.encoder](args)
         self.target = MlmTarget(args, len(args.tokenizer.vocab))
         if args.tie_weights:
-            self.target.mlm_linear_2.weight = self.embedding.word_embedding.weight
+            self.target.mlm_linear_2.weight = self.embedding.word.embedding.weight
         self.answer_position = args.answer_position
         self.device = args.device
 

diff --git a/finetune/run_classifier_siamese.py b/finetune/run_classifier_siamese.py
@@ -31,7 +31,10 @@
 class SiameseClassifier(nn.Module):
     def __init__(self, args):
         super(SiameseClassifier, self).__init__()
-        self.embedding = DualEmbedding(args, len(args.tokenizer.vocab))
+        self.embedding = Embedding(args)
+        for embedding_name in args.embedding:
+            tmp_emb = str2embedding[embedding_name](args, len(args.tokenizer.vocab))
+            self.embedding.update(tmp_emb, embedding_name)
         self.encoder = DualEncoder(args)
 
         self.classifier = nn.Linear(4 * args.stream_0["hidden_size"], args.labels_num)

diff --git a/finetune/run_cmrc.py b/finetune/run_cmrc.py
@@ -29,7 +29,10 @@
 class MachineReadingComprehension(nn.Module):
     def __init__(self, args):
         super(MachineReadingComprehension, self).__init__()
-        self.embedding = str2embedding[args.embedding](args, len(args.tokenizer.vocab))
+        self.embedding = Embedding(args)
+        for embedding_name in args.embedding:
+            tmp_emb = str2embedding[embedding_name](args, len(args.tokenizer.vocab))
+            self.embedding.update(tmp_emb, embedding_name)
         self.encoder = str2encoder[args.encoder](args)
         self.output_layer = nn.Linear(args.hidden_size, 2)
 

diff --git a/finetune/run_ner.py b/finetune/run_ner.py
@@ -30,7 +30,10 @@
 class NerTagger(nn.Module):
     def __init__(self, args):
         super(NerTagger, self).__init__()
-        self.embedding = str2embedding[args.embedding](args, len(args.tokenizer.vocab))
+        self.embedding = Embedding(args)
+        for embedding_name in args.embedding:
+            tmp_emb = str2embedding[embedding_name](args, len(args.tokenizer.vocab))
+            self.embedding.update(tmp_emb, embedding_name)
         self.encoder = str2encoder[args.encoder](args)
         self.labels_num = args.labels_num
         self.output_layer = nn.Linear(args.hidden_size, self.labels_num)

diff --git a/finetune/run_regression.py b/finetune/run_regression.py
@@ -18,7 +18,10 @@
 class Regression(nn.Module):
     def __init__(self, args):
         super(Regression, self).__init__()
-        self.embedding = str2embedding[args.embedding](args, len(args.tokenizer.vocab))
+        self.embedding = Embedding(args)
+        for embedding_name in args.embedding:
+            tmp_emb = str2embedding[embedding_name](args, len(args.tokenizer.vocab))
+            self.embedding.update(tmp_emb, embedding_name)
         self.encoder = str2encoder[args.encoder](args)
         self.pooling_type = args.pooling
         self.output_layer_1 = nn.Linear(args.hidden_size, args.hidden_size)

diff --git a/finetune/run_simcse.py b/finetune/run_simcse.py
@@ -33,7 +33,10 @@
 class SimCSE(nn.Module):
     def __init__(self, args):
         super(SimCSE, self).__init__()
-        self.embedding = str2embedding[args.embedding](args, len(args.tokenizer.vocab))
+        self.embedding = Embedding(args)
+        for embedding_name in args.embedding:
+            tmp_emb = str2embedding[embedding_name](args, len(args.tokenizer.vocab))
+            self.embedding.update(tmp_emb, embedding_name)
         self.encoder = str2encoder[args.encoder](args)
 
         self.pooling_type = args.pooling