Generation of Realistic Tabular data
with pretrained Transformer-based language models
Our GReaT framework leverages the power of advanced pretrained Transformer language models to produce high-quality synthetic tabular data. Generate new data samples effortlessly with our user-friendly API in just a few lines of code. Please see our publication for more details.
The GReaT framework can be easily installed using with pip - requires a Python version >= 3.9:
pip install be-great
In the example below, we show how the GReaT approach is used to generate synthetic tabular data for the California Housing dataset.
from be_great import GReaT
from sklearn.datasets import fetch_california_housing
data = fetch_california_housing(as_frame=True).frame
model = GReaT(llm='distilgpt2', batch_size=32, epochs=50, fp16=True)
model.fit(data)
synthetic_data = model.sample(n_samples=100)
GReaT also features an interface to impute, i.e., fill in, missing values in arbitrary combinations. This requires a trained model
, for instance one obtained using the code snippet above, and a pd.DataFrame
where missing values are set to NaN.
A minimal example is provided below:
# test_data: pd.DataFrame with samples from the distribution
# model: GReaT trained on the data distribution that should be imputed
# Drop values randomly from test_data
import numpy as np
for clm in test_data.columns:
test_data[clm]=test_data[clm].apply(lambda x: (x if np.random.rand() > 0.5 else np.nan))
imputed_data = model.impute(test_data, max_length=200)
GReaT provides methods for saving a model checkpoint (besides the checkpoints stored by the huggingface transformers Trainer) and loading the checkpoint again.
model = GReaT(llm='distilgpt2', batch_size=32, epochs=50, fp16=True)
model.fit(data)
model.save("my_directory") # saves a "model.pt" and a "config.json" file
model = GReaT.load_from_dir("my_directory") # loads the model again
# supports remote file systems via fsspec
model.save("s3://my_bucket")
model = GReaT.load_from_dir("s3://my_bucket")
If you use GReaT, please link or cite our work:
@inproceedings{borisov2023language,
title={Language Models are Realistic Tabular Data Generators},
author={Vadim Borisov and Kathrin Sessler and Tobias Leemann and Martin Pawelczyk and Gjergji Kasneci},
booktitle={The Eleventh International Conference on Learning Representations },
year={2023},
url={https://openreview.net/forum?id=cEygmQNOeI}
}
We sincerely thank the HuggingFace 🤗 framework.