Skip to content

Commit

Permalink
Modular Embedding (#366)
Browse files Browse the repository at this point in the history
* Modular Embedding

1. Transform uer/embeddings into a modular design
2. Modify the corresponding config file and opt settings
3. Modify the emb construction method of the model class used for embedding, and uer/model_builder

* Update lm_target.py

Modify lm_target to consider ignore_index, label_smoothing and seg

* Eliminate Deepspeed

Eliminate Deepspeed to keep the UER-py more clear

* Update README.md

* Update github-actions.yml

* Update github-actions.yml

* Update github-actions.yml

* Update target.py

* target

* Update model.py

* Update lm_target.py

* default_args for encoder_decoder_model

* Update model.py

* scripts/convert

converting scripts modified due to modular embedding component
  • Loading branch information
Eric8932 authored Jul 15, 2023
1 parent 7c1f731 commit b80a5e2
Show file tree
Hide file tree
Showing 118 changed files with 586 additions and 1,007 deletions.
4 changes: 2 additions & 2 deletions .github/workflows/github-actions.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ jobs:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: [3.6.13]
python-version: [3.7.13]

steps:
- uses: actions/checkout@v2
Expand Down Expand Up @@ -40,7 +40,7 @@ jobs:
python pretrain.py --dataset_path cls_dataset.pt --vocab_path models/google_zh_vocab.txt --config_path models/bert/mini_config.json --output_model_path models/cls_model.bin --total_steps 10 --save_checkpoint_steps 10 --report_steps 2 --batch_size 2 --labels_num 2 --data_processor cls --target cls
mv models/cls_model.bin-10 models/cls_model.bin
python preprocess.py --corpus_path corpora/parallel_corpus_en_zh.txt --vocab_path models/google_uncased_en_vocab.txt --tgt_vocab_path models/google_zh_vocab.txt --dataset_path mt_dataset.pt --processes_num 8 --seq_length 64 --tgt_seq_length 64 --data_processor mt
python pretrain.py --dataset_path mt_dataset.pt --vocab_path models/google_uncased_en_vocab.txt --tgt_vocab_path models/google_zh_vocab.txt --config_path models/encoder_decoder_config.json --output_model_path models/mt_model.bin --total_steps 10 --save_checkpoint_steps 10 --report_steps 2 --batch_size 2
python pretrain.py --dataset_path mt_dataset.pt --vocab_path models/google_uncased_en_vocab.txt --tgt_vocab_path models/google_zh_vocab.txt --config_path models/transformer/base_config.json --output_model_path models/mt_model.bin --total_steps 10 --save_checkpoint_steps 10 --report_steps 2 --batch_size 2
mv models/mt_model.bin-10 models/mt_model.bin
python preprocess.py --corpus_path corpora/CLUECorpusSmall_5000_lines_bert.txt --vocab_path models/google_zh_vocab.txt --dataset_path pegasus_dataset.pt --processes_num 8 --seq_length 128 --tgt_seq_length 128 --dup_factor 1 --sentence_selection_strategy random --data_processor gsg
python pretrain.py --dataset_path pegasus_dataset.pt --vocab_path models/google_zh_vocab.txt --config_path models/pegasus/base_config.json --output_model_path models/pegasus_model.bin --total_steps 10 --save_checkpoint_steps 10 --report_steps 2 --batch_size 2
Expand Down
3 changes: 1 addition & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ Table of Contents
UER-py has the following features:
- __Reproducibility__ UER-py has been tested on many datasets and should match the performances of the original pre-training model implementations such as BERT, GPT-2, ELMo, and T5.
- __Model modularity__ UER-py is divided into the following components: embedding, encoder, target embedding (optional), decoder (optional), and target. Ample modules are implemented in each component. Clear and robust interface allows users to combine modules to construct pre-training models with as few restrictions as possible.
- __Model training__ UER-py supports CPU mode, single GPU mode, distributed training mode, and gigantic model training with DeepSpeed.
- __Model training__ UER-py supports CPU mode, single GPU mode and distributed training mode.
- __Model zoo__ With the help of UER-py, we pre-train and release models of different properties. Proper selection of pre-trained models is important to the performances of downstream tasks.
- __SOTA results__ UER-py supports comprehensive downstream tasks (e.g. classification and machine reading comprehension) and provides winning solutions of many NLP competitions.
- __Abundant functions__ UER-py provides abundant functions related with pre-training, such as feature extractor and text generation.
Expand All @@ -58,7 +58,6 @@ UER-py has the following features:
* For developing a stacking model you will need LightGBM and [BayesianOptimization](https://github.com/fmfn/BayesianOptimization)
* For the pre-training with whole word masking you will need word segmentation tool such as [jieba](https://github.com/fxsjy/jieba)
* For the use of CRF in sequence labeling downstream task you will need [pytorch-crf](https://github.com/kmkurn/pytorch-crf)
* For the gigantic model training you will need [DeepSpeed](https://github.com/microsoft/DeepSpeed)


<br/>
Expand Down
3 changes: 1 addition & 2 deletions README_ZH.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@
UER-py有如下几方面优势:
- __可复现__ UER-py已在许多数据集上进行了测试,与原始预训练模型实现(例如BERT、GPT-2、ELMo、T5)的表现相匹配
- __模块化__ UER-py使用解耦的模块化设计框架。框架分成Embedding、Encoder、Target等多个部分。各个部分之间有着清晰的接口并且每个部分包括了丰富的模块。可以对不同模块进行组合,构建出性质不同的预训练模型
- __模型训练__ UER-py支持CPU、单机单GPU、单机多GPU、多机多GPU训练模式,并支持使用DeepSpeed优化库进行超大模型训练
- __模型训练__ UER-py支持CPU、单机单GPU、单机多GPU、多机多GPU训练模式
- __模型仓库__ 我们维护并持续发布预训练模型。用户可以根据具体任务的要求,从中选择合适的预训练模型使用
- __SOTA结果__ UER-py支持全面的下游任务,包括文本分类、文本对分类、序列标注、阅读理解等,并提供了多个竞赛获胜解决方案
- __预训练相关功能__ UER-py提供了丰富的预训练相关的功能和优化,包括特征抽取、近义词检索、预训练模型转换、模型集成、文本生成等
Expand All @@ -51,7 +51,6 @@ UER-py有如下几方面优势:
* 如果使用模型集成stacking,需要安装LightGBM和[BayesianOptimization](https://github.com/fmfn/BayesianOptimization)
* 如果使用全词遮罩(whole word masking)预训练,需要安装分词工具,例如[jieba](https://github.com/fxsjy/jieba)
* 如果在序列标注下游任务中使用CRF,需要安装[pytorch-crf](https://github.com/kmkurn/pytorch-crf)
* 如果使用超大模型,需要安装[DeepSpeed](https://github.com/microsoft/DeepSpeed)


<br/>
Expand Down
5 changes: 4 additions & 1 deletion finetune/run_c3.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,10 @@
class MultipleChoice(nn.Module):
def __init__(self, args):
super(MultipleChoice, self).__init__()
self.embedding = str2embedding[args.embedding](args, len(args.tokenizer.vocab))
self.embedding = Embedding(args)
for embedding_name in args.embedding:
tmp_emb = str2embedding[embedding_name](args, len(args.tokenizer.vocab))
self.embedding.update(tmp_emb, embedding_name)
self.encoder = str2encoder[args.encoder](args)
self.dropout = nn.Dropout(args.dropout)
self.output_layer = nn.Linear(args.hidden_size, 1)
Expand Down
5 changes: 4 additions & 1 deletion finetune/run_classifier.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,10 @@
class Classifier(nn.Module):
def __init__(self, args):
super(Classifier, self).__init__()
self.embedding = str2embedding[args.embedding](args, len(args.tokenizer.vocab))
self.embedding = Embedding(args)
for embedding_name in args.embedding:
tmp_emb = str2embedding[embedding_name](args, len(args.tokenizer.vocab))
self.embedding.update(tmp_emb, embedding_name)
self.encoder = str2encoder[args.encoder](args)
self.labels_num = args.labels_num
self.pooling_type = args.pooling
Expand Down
212 changes: 0 additions & 212 deletions finetune/run_classifier_deepspeed.py

This file was deleted.

5 changes: 4 additions & 1 deletion finetune/run_classifier_mt.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,10 @@
class MultitaskClassifier(nn.Module):
def __init__(self, args):
super(MultitaskClassifier, self).__init__()
self.embedding = str2embedding[args.embedding](args, len(args.tokenizer.vocab))
self.embedding = Embedding(args)
for embedding_name in args.embedding:
tmp_emb = str2embedding[embedding_name](args, len(args.tokenizer.vocab))
self.embedding.update(tmp_emb, embedding_name)
self.encoder = str2encoder[args.encoder](args)
self.pooling_type = args.pooling
self.output_layers_1 = nn.ModuleList([nn.Linear(args.hidden_size, args.hidden_size) for _ in args.labels_num_list])
Expand Down
5 changes: 4 additions & 1 deletion finetune/run_classifier_multi_label.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,10 @@
class MultilabelClassifier(nn.Module):
def __init__(self, args):
super(MultilabelClassifier, self).__init__()
self.embedding = str2embedding[args.embedding](args, len(args.tokenizer.vocab))
self.embedding = Embedding(args)
for embedding_name in args.embedding:
tmp_emb = str2embedding[embedding_name](args, len(args.tokenizer.vocab))
self.embedding.update(tmp_emb, embedding_name)
self.encoder = str2encoder[args.encoder](args)
self.labels_num = args.labels_num
self.pooling_type = args.pooling
Expand Down
7 changes: 5 additions & 2 deletions finetune/run_classifier_prompt.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,11 +20,14 @@
class ClozeTest(nn.Module):
def __init__(self, args):
super(ClozeTest, self).__init__()
self.embedding = str2embedding[args.embedding](args, len(args.tokenizer.vocab))
self.embedding = Embedding(args)
for embedding_name in args.embedding:
tmp_emb = str2embedding[embedding_name](args, len(args.tokenizer.vocab))
self.embedding.update(tmp_emb, embedding_name)
self.encoder = str2encoder[args.encoder](args)
self.target = MlmTarget(args, len(args.tokenizer.vocab))
if args.tie_weights:
self.target.mlm_linear_2.weight = self.embedding.word_embedding.weight
self.target.mlm_linear_2.weight = self.embedding.word.embedding.weight
self.answer_position = args.answer_position
self.device = args.device

Expand Down
5 changes: 4 additions & 1 deletion finetune/run_classifier_siamese.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,10 @@
class SiameseClassifier(nn.Module):
def __init__(self, args):
super(SiameseClassifier, self).__init__()
self.embedding = DualEmbedding(args, len(args.tokenizer.vocab))
self.embedding = Embedding(args)
for embedding_name in args.embedding:
tmp_emb = str2embedding[embedding_name](args, len(args.tokenizer.vocab))
self.embedding.update(tmp_emb, embedding_name)
self.encoder = DualEncoder(args)

self.classifier = nn.Linear(4 * args.stream_0["hidden_size"], args.labels_num)
Expand Down
5 changes: 4 additions & 1 deletion finetune/run_cmrc.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,10 @@
class MachineReadingComprehension(nn.Module):
def __init__(self, args):
super(MachineReadingComprehension, self).__init__()
self.embedding = str2embedding[args.embedding](args, len(args.tokenizer.vocab))
self.embedding = Embedding(args)
for embedding_name in args.embedding:
tmp_emb = str2embedding[embedding_name](args, len(args.tokenizer.vocab))
self.embedding.update(tmp_emb, embedding_name)
self.encoder = str2encoder[args.encoder](args)
self.output_layer = nn.Linear(args.hidden_size, 2)

Expand Down
5 changes: 4 additions & 1 deletion finetune/run_ner.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,10 @@
class NerTagger(nn.Module):
def __init__(self, args):
super(NerTagger, self).__init__()
self.embedding = str2embedding[args.embedding](args, len(args.tokenizer.vocab))
self.embedding = Embedding(args)
for embedding_name in args.embedding:
tmp_emb = str2embedding[embedding_name](args, len(args.tokenizer.vocab))
self.embedding.update(tmp_emb, embedding_name)
self.encoder = str2encoder[args.encoder](args)
self.labels_num = args.labels_num
self.output_layer = nn.Linear(args.hidden_size, self.labels_num)
Expand Down
5 changes: 4 additions & 1 deletion finetune/run_regression.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,10 @@
class Regression(nn.Module):
def __init__(self, args):
super(Regression, self).__init__()
self.embedding = str2embedding[args.embedding](args, len(args.tokenizer.vocab))
self.embedding = Embedding(args)
for embedding_name in args.embedding:
tmp_emb = str2embedding[embedding_name](args, len(args.tokenizer.vocab))
self.embedding.update(tmp_emb, embedding_name)
self.encoder = str2encoder[args.encoder](args)
self.pooling_type = args.pooling
self.output_layer_1 = nn.Linear(args.hidden_size, args.hidden_size)
Expand Down
5 changes: 4 additions & 1 deletion finetune/run_simcse.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,10 @@
class SimCSE(nn.Module):
def __init__(self, args):
super(SimCSE, self).__init__()
self.embedding = str2embedding[args.embedding](args, len(args.tokenizer.vocab))
self.embedding = Embedding(args)
for embedding_name in args.embedding:
tmp_emb = str2embedding[embedding_name](args, len(args.tokenizer.vocab))
self.embedding.update(tmp_emb, embedding_name)
self.encoder = str2encoder[args.encoder](args)

self.pooling_type = args.pooling
Expand Down
Loading

0 comments on commit b80a5e2

Please sign in to comment.