Kashgari provides several embeddings for language representation. Embedding layers will convert input sequence to tensor for downstream task. Availabel embeddings list:
class name | description |
---|---|
BareEmbedding | random init tf.keras.layers.Embedding layer for text sequence embedding |
WordEmbedding | pre-trained Word2Vec embedding |
BERTEmbedding | pre-trained BERT embedding |
GPT2Embedding | pre-trained GPT-2 embedding |
NumericFeaturesEmbedding | random init tf.keras.layers.Embedding layer for numeric feature embedding |
StackedEmbedding | stack other embeddings for multi-input model |
All embedding classes inherit from the Embedding
class and implement the embed()
to embed your input sequence and embed_model
property which you need to build you own Model. By providing the embed()
function and embed_model
property, Kashgari hides the the complexity of different language embedding from users, all you need to care is which language embedding you need.
You could check out the Embedding API here: link
Feature Extraction is one of the major way to use pre-trained language embedding. Kashgari provides simple API for this task. All you need to is init a embedding object then call embed
function. Here is the example. All embedding shares same embed API.
import kashgari
from kashgari.embeddings import BERTEmbedding
# need to spesify task for the downstream task,
# if use embedding for feature extraction, just set `task=kashgari.CLASSIFICATION`
bert = BERTEmbedding('<BERT_MODEL_FOLDER>',
task=kashgari.CLASSIFICATION,
sequence_length=100)
# call for bulk embed
embed_tensor = bert.embed([['语', '言', '模', '型']])
# call for single embed
embed_tensor = bert.embed_one(['语', '言', '模', '型'])
print(embed_tensor)
# array([[-0.5001117 , 0.9344998 , -0.55165815, ..., 0.49122602,
# -0.2049343 , 0.25752577],
# [-1.05762 , -0.43353617, 0.54398274, ..., -0.61096823,
# 0.04312163, 0.03881482],
# [ 0.14332692, -0.42566583, 0.68867105, ..., 0.42449307,
# 0.41105768, 0.08222893],
# ...,
# [-0.86124015, 0.08591427, -0.34404194, ..., 0.19915134,
# -0.34176797, 0.06111742],
# [-0.73940575, -0.02692179, -0.5826528 , ..., 0.26934686,
# -0.29708537, 0.01855129],
# [-0.85489404, 0.007399 , -0.26482674, ..., 0.16851354,
# -0.36805922, -0.0052386 ]], dtype=float32)
See details at classification and labeling tutorial.
You can access the tf.keras model of embedding and add your own layers or any kind customizion. Just need to access the embed_model
property of the embedding object.