Welcome to LINDA

MELINDA is a python library for creating tabular synthetic data. It uses various generative models in artificial intelligence to learn statistical properties from your real data and use them to generate synthetic data.

Installation

git clone https://github.com/hse-cs/LINDA.git
cd LINDA
pip install -e .

or

poetry install

Basic usage

The following code snippet creates an example of real data, fits a generative model, and samples synthetic data.

import numpy as np
import pandas as pd
from melinda.models import ProbaformsSynthesizer
from probaforms.models import CVAE

# generate an example of real data
n = 100
data_real = pd.DataFrame()
data_real['col_1'] = np.random.rand(n)
data_real['col_2'] = np.random.rand(n)
data_real['col_3'] = [str(i) for i in np.random.randint(0, 10, n)]
data_real['col_4'] = [str(i) for i in np.random.randint(0, 5, n)]

num_cols = ['col_1', 'col_2']
cat_cols = ['col_3', 'col_4']
lab_cols = None

# fit a generative model
model = CVAE(latent_dim=10, hidden=(10,), lr=0.001, n_epochs=10)
gen = ProbaformsSynthesizer(model, num_cols, cat_cols, lab_cols, cat_transform='OneHotEncoder')
gen.fit(data_real)

# sample synthetic data
data_synthetic = gen.sample(n_samples=10)
data_synthetic.head()

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Welcome to LINDA

Installation

Basic usage

Files

README.md

Latest commit

History

README.md

File metadata and controls

Welcome to LINDA

Installation

Basic usage