🎉🎉🎉 We are proud to announce that we entirely rewrote Kashgari with tf.keras, now Kashgari comes with easier to understand API and is faster! 🎉🎉🎉
Kashgari is a simple and powerful NLP Transfer learning framework, build a state-of-art model in 5 minutes for named entity recognition (NER), part-of-speech tagging (PoS), and text classification tasks.
- Human-friendly. Kashgari's code is straightforward, well documented and tested, which makes it very easy to understand and modify.
- Powerful and simple. Kashgari allows you to apply state-of-the-art natural language processing (NLP) models to your text, such as named entity recognition (NER), part-of-speech tagging (PoS) and classification.
- Built-in transfer learning. Kashgari built-in pre-trained BERT and Word2vec embedding models, which makes it very simple to transfer learning to train your model.
- Fully scalable. Kashgari provides a simple, fast, and scalable environment for fast experimentation, train your models and experiment with new approaches using different embeddings and model structure.
- Production Ready. Kashgari could export model with
SavedModel
format for tensorflow serving, you could directly deploy it on the cloud.
- Academic users Easier experimentation to prove their hypothesis without coding from scratch.
- NLP beginners Learn how to build an NLP project with production level code quality.
- NLP developers Build a production level classification/labeling model within minutes.
Task | Language | Dataset | Score | Detail |
---|---|---|---|---|
Named Entity Recognition | Chinese | People's Daily Ner Corpus | 94.46 (F1) | Text Labeling Performance Report |
Here is a set of quick tutorials to get you started with the library:
- Tutorial 1: Text Classification
- Tutorial 2: Text Labeling
- Tutorial 3: Text Scoring
- Tutorial 4: Language Embedding
There are also articles and posts that illustrate how to use Kashgari:
- 15 分钟搭建中文文本分类模型
- 基于 BERT 的中文命名实体识别(NER)
- BERT/ERNIE 文本分类和部署
- 五分钟搭建一个基于BERT的NER模型
- Multi-Class Text Classification with Kashgari in 15 minutes
🎉🎉🎉 We renamed again for consistency and clarity. From now on, it is all kashgari
. 🎉🎉🎉
The project is based on Python 3.6+, because it is 2019 and type hinting is cool.
Backend | pypi version | desc |
---|---|---|
TensorFlow 2.x | pip install 'kashgari>=2.0.0' |
coming soon |
TensorFlow 1.14+ | pip install 'kashgari>=1.0.0,<2.0.0' |
current version |
Keras | pip install 'kashgari<1.0.0' |
legacy version |
Find more info about the name changing.
Let's run an NER labeling model with Bi_LSTM Model.
from kashgari.corpus import ChineseDailyNerCorpus
from kashgari.tasks.labeling import BiLSTM_Model
train_x, train_y = ChineseDailyNerCorpus.load_data('train')
test_x, test_y = ChineseDailyNerCorpus.load_data('test')
valid_x, valid_y = ChineseDailyNerCorpus.load_data('valid')
model = BiLSTM_Model()
model.fit(train_x, train_y, valid_x, valid_y, epochs=50)
"""
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input (InputLayer) (None, 97) 0
_________________________________________________________________
layer_embedding (Embedding) (None, 97, 100) 320600
_________________________________________________________________
layer_blstm (Bidirectional) (None, 97, 256) 235520
_________________________________________________________________
layer_dropout (Dropout) (None, 97, 256) 0
_________________________________________________________________
layer_time_distributed (Time (None, 97, 8) 2056
_________________________________________________________________
activation_7 (Activation) (None, 97, 8) 0
=================================================================
Total params: 558,176
Trainable params: 558,176
Non-trainable params: 0
_________________________________________________________________
Train on 20864 samples, validate on 2318 samples
Epoch 1/50
20864/20864 [==============================] - 9s 417us/sample - loss: 0.2508 - acc: 0.9333 - val_loss: 0.1240 - val_acc: 0.9607
"""
from kashgari.embeddings import GPT2Embedding
from kashgari.corpus import ChineseDailyNerCorpus
from kashgari.tasks.labeling import BiGRU_Model
train_x, train_y = ChineseDailyNerCorpus.load_data('train')
valid_x, valid_y = ChineseDailyNerCorpus.load_data('valid')
gpt2_embedding = GPT2Embedding('<path-to-gpt-model-folder>', sequence_length=30)
model = BiGRU_Model(gpt2_embedding)
model.fit(train_x, train_y, valid_x, valid_y, epochs=50)
from kashgari.embeddings import BERTEmbedding
from kashgari.tasks.labeling import BiGRU_Model
from kashgari.corpus import ChineseDailyNerCorpus
bert_embedding = BERTEmbedding('<bert-model-folder>', sequence_length=30)
model = BiGRU_Model(bert_embedding)
train_x, train_y = ChineseDailyNerCorpus.load_data()
model.fit(train_x, train_y)
Support this project by becoming a sponsor. Your issues and feature request will be prioritized.[Become a sponsor]
Thanks goes to these wonderful people. And there are many ways to get involved. Start with the contributor guidelines and then check these open issues for specific tasks.
Eliyar Eziz 📖 |
Alex Wang 💻 |
Yusup 💻 |
Feel free to join the Slack group if you want to more involved in Kashgari's development.
This library is inspired by and references following frameworks and papers.
- flair - A very simple framework for state-of-the-art Natural Language Processing (NLP)
- anago - Bidirectional LSTM-CRF and ELMo for Named-Entity Recognition, Part-of-Speech Tagging
- Chinese-Word-Vectors
This project follows the all-contributors specification. Contributions of any kind welcome!
This project exists thanks to all the people who contribute. [Contribute].
Become a financial contributor and help us sustain our community. [Contribute]
Support this project with your organization. Your logo will show up here with a link to your website. [Contribute]