This is a dbt
package for creating synthetic data. Currently it supports Snowflake, Postgres, DuckDB, and SQLite (with the stats
extension). Other backends may be added eventually.
All the magic happens in macros/*
.
- About
- Installation
- Architecture
- Simple example
- Distributions
- Column types
- Advanced usage
- Datasets
- Performance
- Changelog
- Contributing
- License
Robert Fehrmann (CTO at Snowflake) has a couple good blog posts about generating random integers and strings or dates and times in Snowflake, which the base column types in this package emulate.
However, creating more realistic synthetic data requires more complex data column types, advanced random distributions, supporting datasets, and references to other models.
dbt_synth_data
provides many macros to facilitate building out realistic synthetic data. It builds up a series of CTEs and joins from a base of randomly-generated values - see Architecture for details. dbt_synth_data
is powerful, especially on Snowflake - it can create billions of rows and hundreds of GB of data. See Performance for details.
There are generally two approaches to creating synthetic or "fake" data:
- start with real data, de-identify it, and possibly "fuzz" or "jitter" some values
- start with nothing and synthesize data by describing it, including and distributions and correlations in the data
(Recent research has proposed a hybrid approach, where a "nearby" or similar synthetic data row (2) is selected for each row of a real, de-dentified row (1)... but adequately defining "nearby" is difficult.)
Approach (1) can be dangerous, suscpetible to re-identification and other adversarial attacks. dbt_synth_data
implements approach (2) only.
Synthetic data generated with dbt_synth_data
can be useful for testing user interfaces, demoing applications, performance-tuning operational systems, preparing training and other materials with realistic data, and potentially other uses.
The synthetic data created using dbt_synth_data
should not be mistaken as being fully realistic, reflecting all correlations that may be present in the real world. Therefore please do not use data generated using this package to train ML models!
This package currently supports the following backends:
snowflake
(withpip install dbt-snowflake
)postgres
(withpip install dbt-postgres
)sqlite
(withpip install dbt-sqlite
)duckdb
(withpip install dbt-duckdb
)
- add
dbt_synth_data
to yourpackages.yml
- run
dbt deps
- run
dbt seed
- add
"dbt_packages/dbt_synth/macros"
to yourdbt_project.yml
'smacro-paths
- build your synthetic models as documented below
dbt run
CTEs, joins, and fields defined by synth_column_*()
are temporarily stored in dbt
's target
object during parse/run time, as this is one of few dbt objects that persist and are scoped across macro
s. Finally, synth_table()
stitches everything together into a query of the general form
-- [various CTEs as required for selecting seed data or values from other models]
base as (
select
-- base CTE includes a row_number, which facilitates generating integer or date sequences, primary keys, and more
row_number() over (order by 1) as __row_number
from table(generator( rowcount => [rows] )) -- snowflake
-- from generate_series( 1, [rows] ) as s(idx) -- postgres, sqlite
),
join0 as (
select
base.__row_number,
-- randomness source fields, such as
UNIFORM(0::float, 1::float, RANDOM()) as field1__rand, -- snowflake
-- RANDOM() as field1__rand, -- postgres
-- [similar *__rand fields for other columns as required]
from base
),
-- arbitrarily many further joins to the CTEs defined above
joinN as (
select
join[N-1].*, -- all fields from prior joins, plus:
CTEx.field2,
CTEx.field3
from join[N-1]
left join CTEx on ... -- something involving join[N-1].*__rand
),
synth_table as (
select
field1,
field2,
field3
-- only the fields we actually want to keep in the final table
-- (intermediate fields, including *__rand, are dropped)
from joinN
)
Note: with SQLite, the behavior of random()
within CTEs and joins is non-deterministic, due to how the query optimizer works - see this link for details. Therefore, on SQLite only, temporary tables are created (CREATE TEMP TABLE ...
) instead of most of the CTEs mentioned above. Only the final synth_table
CTE is created, so the with
syntax shown below still works. Temporary tables are deleted when the dbt run
completes.
Consider the example model orders.sql
below:
with
{{ synth_column_primary_key(name='order_id') }}
{{ synth_column_foreign_key(name='product_id', model_name='products', column='product_id') }}
{{ synth_column_distribution(name='status',
distribution=synth_distribution(class='discrete', type='probabilities',
probabilities={"New":0.2, "Shipped":0.5, "Returned":0.2, "Lost":0.1}
)
) }}
{{ synth_column_integer(name='quantity', min=1, max=10) }}
{{ synth_table(rows = 5000) }}
select * from synth_table
The model begins by defining the columns we want in the table, including:
order_id
is the primary key on the table - it wil contain a unique hash value per rowproduct_id
is a foreign key to theproducts
table - values in this column will be uniformly-distributed, valid primary keys of theproducts
table- each order has a
status
with several possible values, whose prevalence/likelihoods are given by a discrete probability distribution quantity
is the count of how many of the product were ordered, a uniformly-distributed integer from 1-10
Then a CTE called synth_table
with 5000 rows of synthetic data is created and we select the results.
Note that the user must provide the opening with
for CTEs and a final select * from synth_table
- this allows flexibility to add your own CTEs at the top or bottom of the model, as well as arbitrary post-processing of columns produced by dbt_synth_data
- see Advanced Usage for more details.
This package provides the following distributions:
uniform
Generates uniformly-distributed real numbers.
synth_distribution_continuous_uniform(min=0.6, max=7.9)
Default min
is 0.0
. Default max
is 1.0
. min
and max
are inclusive.
Above: Histogram of a continuous uniform distribution (1M values).
normal
Generates normally-distributed (Gaussian) real numbers.
synth_distribution_continuous_normal(mean=5, stddev=0.5)
Default mean
is 0.0
, default stddev
is 1.0
.
Above: Histogram of a continuous normal distribution (1M values).
exponential
Generates exponentially-distributed real numbers.
synth_distribution_continuous_exponential(lambda=5.0)
Default lambda
is 1.0
.
Above: Histogram of a continuous exponential distribution (1M values).
bernoulli
Generates integers (0
and 1
) according to a Bernoulli distribution.
synth_distribution_discrete_bernoulli(p=0.3)
Default p
is 0.5
.
binomial
Generates integers according to a Binomial distribution.
synth_distribution_discrete_binomial(n=100, p=0.3)
Default n
is 10
, default p
is 0.5
.
Note that the implementation is approximate, based on a normal distribution (see here). For small n
or p
near 0
or 1
, normally-distributed values may be < 0
or > n
, which is impossible in a binomial distribution. These long-tail values are rare, so, while not completely correct, we use
abs()
to shift those< 0
mod(..., n+1)
to shift those> n
This may artificially increase small values. However, the approximation is close if n*p
and n*(1-p)
are large.
weights
Generates discrete values according to a user-defined probability set.
synth_distribution_discrete_weights(values=[1,3,5,7,9], weights=[1,1,6,3,1])
values
is a required list of strings, floats, or integers; it has no default.
weights
is an optional list of integers. It's length should be the same the length of values
. If weights
is omitted, each of the values
will be equally likely. Otherwise, the integers indicate likelihood; in the example above, the value 5
will be about six times as prevalent as the value 9
.
Avoid using weights
with a large sum; this will generate long case
statements which can run slowly.
probabilities
Generates discrete values according to a user-defined probability set.
synth_distribution_discrete_probabilities(probabilities={"1":0.15, "5":0.5, "8": 0.35})
probabilities
is required and has no default. It may be
- a list (array) such as
[0.05, 0.8, 0.15]
, in which case the (zero-based) indices are the integer values generated - or a dictionary (key-value) structure such as
{ "1":0.05, "3":0.8, "7":0.15 }
with integer keys (specified as strings in order to be valud JSON), in which case the keys are the integers generated
You may actually specify string or float keys in your probabilities
dict to generate those values instead of integers, however you must specify the additional parameter keys_type="varchar"
(or similar) so the the value types are correct. For example:
synth_distributions_discrete_probabilities(probabilities={"cat":0.3, "dog":0.5, "parrot":0.2}, keys_type="varchar")
probabilities
must sum to 1.0
.
Note that, because values are generated using case
statements, supplying probabilities
with many digits of specificity will run slower, i.e., probabilities=[0.1, 0.3, 0.6]
will generate something like
case floor( 10*random() )
when 0 then 0
when 1 then 1
when 2 then 1
when 3 then 1
when 4 then 2
when 5 then 2
when 6 then 2
when 7 then 2
when 8 then 2
when 9 then 2
end
while probabilities=[0.101, 0.301, 0.598]
will generate something like
case floor( 1000*random() )
when 0 then 0
when ...
when 99 then 0
when 100 then 0
when 101 then 1
when ...
when 400 then 1
when 401 then 1
when 402 then 1
when 403 then 2
...
when 998 then 2
when 999 then 2
end
which takes longer for the database engine to evaluate.
Really you should avoid specifiying probabilities
of more than 4 digits at the most.
Any of the continuous distributions listed above can be made discrete using the following mechanisms:
discretize_floor
Converts values from continuous distributions to (discrete) integers by applying the floor()
function.
synth_distribution_discretize_floor(
distribution=synth_distribution(class='...', type='...', ...),
)
discretize_ceil
Converts values from continuous distributions to (discrete) integers by applying the ceil()
function.
synth_distribution_discretize_ceil(
distribution=synth_distribution(class='...', type='...', ...),
)
discretize_round
Converts values from continuous distributions to discrete values by applying the round()
function.
synth_distribution_discretize_round(
distribution=synth_distribution(class='...', type='...', ...),
precision=0
)
precision
indicates the number of digits to round to.
discretize_width_bucket
Note that SQLite doesn't support width_bucket()
; you will get an error if you try to use this function on SQLite.
Converts values from continuous distributions to discrete values by bucketing them. Buckets are specified by from
and to
bounds and either count
(the number of buckets) or size
(the target bucket size).
For some distributions (like uniform
), the bounds may be strict - values outside the bounds are impossible. For other distributions (like exponential), specifying strict from
and to
bounds may be difficult. For this reason, if strict_bounds=False
, the first bucket (index 0
) will represent values below from
. Likewise the last bucket (index count
) will represent values above to
. (strict_bounds
defaults to True
.) It is up to you to chose reasonable and useful from
and to
bounds for discretization.
labels
may be
- unspecified, in which case values will be mapped to the (1-based) bucket index (0-based if
strict_bounds=False
) - the string "lower_bound", in which case values will be mapped to the lower bound of the bucket (or
-Infinity
for the first bucket, ifstrict_bounds=False
) - the string "upper_bound", in which case values will be mapped to the upper bound of the bucket (or
+Infinity
for the last bucket, ifstrict_bounds=False
) - the string "bucket_range", in which case values will be mapped to a string of the format "[lower_bound] - [upper_bound]" for each bucket (
lower_bound
may be-Infinity
andupper_bound
may beInfinity
ifstrict_bounds=False
)- optionally specify the
bucket_range_separator
string that separates the upper and lower bucket bounds (default is " - ")
- optionally specify the
- the string "bucket_average", in which case values will be mapped to bucket middle or average (or
from
for the first bucket andto
for the last bucket, ifstrict_bounds=False
) - a list of (string or numeric) bucket labels (the list must be equal in length to the number of buckets)
For all but the last option, you may optionally specify a label_precision
, which is the number of digits bounds get rounded to. (Default is 4
.)
Examples:
synth_distribution_discretize_width_bucket(
distribution=synth_distribution(class='...', type='...', ...),
from=0.0, to=1.5, count=20, labels='lower_bound'
)
synth_distribution_discretize_width_bucket(
distribution=synth_distribution(class='...', type='...', ...),
from=0.0, to=1.5, size=0.1
)
synth_distribution_discretize_width_bucket(
distribution=synth_distribution(class='...', type='...', ...),
from=0.0, to=1.5, count=5, strict_bounds=False,
labels=['< 0.0', '0.0 to 0.5', '0.5 to 1.0', '1.0 to 1.5', '> 1.5']
)
This package provides the following mechanisms for composing several distributions:
union
Generates values from several distributions with optional weights
. If weights
is omitted, each distribution is equally likely.
{{ synth_distribution_union(
synth_distribution(class='...', type='...', ...),
synth_distribution(class='...', type='...', ...),
weights=[1, 2, ...]
) }}
Up to 10 distributions may be unioned. (Compose the macro to union more.)
For example, make a bimodal distribution as follows:
{{ synth_table(
rows = 100000,
columns = [
synth_column_distribution(name="continuous_bimodal",
distribution=synth_distribution_union(
synth_distribution(class='continuous', type='normal', mean=5.0, stddev=1.0),
synth_distribution(class='continuous', type='normal', mean=8.0, stddev=1.0),
weights=[1, 2]
)
),
]
) }}
{{ config(post_hook=synth_get_post_hooks())}}
Here, values will come from the union of the two normal distributions, with the second distribution twice as likely as the first.
Above: Histogram of a continuous bimodal distribution composed of the union of two normal distributions (1M values).
Above: Histogram of the union of three continuous normal distributions (1M values).
average
Generates values from the (optionally weighted) average of values from several distributions. If weights
is omitted, each distribution contributes equally to the average.
{{ synth_distribution_average(
synth_distribution(class='...', type='...', ...),
synth_distribution(class='...', type='...', ...),
weights=[1, 2, ...]
) }}
Up to 10 distributions may be averaged. (Compose the macro to average more.)
Above: Histogram of a continuous average distribution composed of a normal and an exponential distribution (1M values).
dbt
doesn't allow macro calls in project vars
, but dbt_synth_data
gets around this limitation and allows you to configure distributions in your vars
and then parse and use them in your models. Consider the following example:
...
vars:
teacher_student_ratio:
synth_distribution_union():
d0:
synth_distribution_continuous_normal():
mean: 15
stddev: 5
d1:
synth_distribution_continuous_normal():
mean: 20
stddev: 5
weights: [1, 2]
You can use this distribution via synth_var()
in models/schools.sql
as follows:
with
{{ synth_column_primary_key(name='school_id') }}
{{ synth_column_integer(name="current_enrollment", min=100, max=2000) }}
{{ synth_column_distribution(name='teacher_student_ratio',
distribution=synth_var('teacher_student_ratio'))
) }}
{{ synth_column_integer(name='year_founded', min=1937, max=2022) }}
{{ synth_table(rows = 500) }}
select * from synth_table
Besides using synth_distribution_union()
and synth_distribution_average()
, you can also combine and compose distributions using synth_expression()
like so:
...
vars:
teacher_student_ratio:
synth_expression:
expression: greatest( 5, 10 + $0 + ln($1) )
p0:
synth_distribution_continuous_normal():
mean: 5
stddev: 1.5
p1:
synth_distribution_continuous_normal():
mean: 10
stddev: 2
This package provides the following data types:
Basic column types, which are quite performant.
boolean
Generates boolean values.
{{ synth_column_boolean(name="is_complete", pct_true=0.2) }}
integer
Generates integer values.
For uniformly-distributed values, simply specify min
and max
:
{{ synth_column_integer(name="event_year", min=2000, max=2020) }}
For non-uniformly-distributed values, specify a discretized distribution:
{{ synth_column_distribution(name="event_year",
distribution=synth_distribution_discretize_floor(
distribution=synth_distribution_continuous_normal(mean=2010, stddev=2.5,)
)
) }}
numeric
Generates numeric values.
{{ synth_column_numeric(name="price", min=1.99, max=999.99, precision=2) }}
For non-uniformly-distributed values, specify a distribution rounded to the desired precision
:
{{ synth_column_distribution(name="event_year",
distribution=synth_distribution_discretize_round(
distribution=synth_distribution_continuous_normal(mean=500, stddev=180,),
precision=2
)
) }}
string
Generates random strings.
{{ synth_column_string(name="password", min_length=10, max_length=20) }}
String characters will include A-Z
, a-z
, and 0-9
.
date
Generates date values.
{{ synth_column_date(name="birth_date", min='1938-01-01', max='1994-12-31') }}
integer sequence
Generates an integer sequence (value is incremented at each row).
{{ synth_column_integer_sequence(name="day_of_year", step=1, start=1) }}
date sequence
Generates a date sequence.
{{ synth_column_date_sequence(name="calendar_date", start_date='2020-08-10', step=3)}}
primary key
Generates a primary key column. (Values are distinct hash strings.)
{{ synth_column_primary_key(name="product_id") }}
value
Generates the same (single, static) value for every row.
{{ synth_column_value(name="is_registered", value='Yes') }}
values
Generates values from a list of possible values, with optional probability weighting.
{{ synth_column_values(name="academic_subject",
values=['Mathematics', 'Science', 'English Language Arts', 'Social Studies'],
probabilities=[0.2, 0.3, 0.15, 0.35]
) }}
If probabilities
are omitted, every value is equally likely.
(Uses synth_distribution_discrete_probabilities()
under the hood.)
expression
Generates values based on an expression (which may refer to other columns, or invoke SQL functions).
{{ synth_column_expression(name='week_of_calendar_year',
expression="DATE_PART('week', calendar_date)::int"
) }}
mapping
Generates values by mapping from an expression
involving existing columns to values in a dictionary.
{{ synth_column_mapping(name='day_type', expression='is_school_day',
mapping=({ true:'Instructional day', false:'Non-instructional day' })
) }}
Statistical column types can be used to make advanced statistical relationships between tables and columns.
correlation
Generates two or more columns with correlated values.
{% set birthyear_grade_correlations = ({
"columns": {
"birth_year": [ 2010, 2009, 2008, 2007, 2006, 2005, 2004 ],
"grade": [ 'Eighth grade', 'Ninth grade', 'Tenth grade', 'Eleventh grade', 'Twelfth grade' ]
},
"probabilities": [
[ 0.02, 0.00, 0.00, 0.00, 0.00 ],
[ 0.15, 0.02, 0.00, 0.00, 0.00 ],
[ 0.03, 0.15, 0.02, 0.00, 0.00 ],
[ 0.00, 0.03, 0.15, 0.02, 0.00 ],
[ 0.00, 0.00, 0.03, 0.15, 0.02 ],
[ 0.00, 0.00, 0.00, 0.03, 0.15 ],
[ 0.00, 0.00, 0.00, 0.00, 0.03 ]
]
})
%}
with
{{ synth_column_primary_key(name='k_student') }}
{{ synth_column_correlation(data=birthyear_grade_correlations, column='birth_year') }}
{{ synth_column_correlation(data=birthyear_grade_correlations, column='grade') }}
{{ synth_table(rows=var('num_students')) }}
select * from synth_table
To created correlated columns, you must specify a data
object representing the correlation, which contains
columns
is a list of column names and possible values.probabilities
is a hypercube, with dimension equal to the number ofcolumns
, the elements of which sum to1.0
, indicating the probability of each possible combination of values for thecolumns
. The outermost elements of theprobabilities
hypercube corresond to the values of the first column; the innermost elements of the hypercube correspond to the values of the last column. Each dimension of the hypercube must have the same size as the number of values for its corresponding column.
Constructing a probabilities
hypercube of dimension more than two or three can be difficult – we recommend adding (temporary) comments and using indentation to keep track of columns, values, and dimensions.
Column types which reference values in other columns of the same or different table.
foreign key
Generates values that are a primary key of another table.
{{ synth_column_foreign_key(name='product_id', model_name='products', column='id') }}
lookup
Generates values based on looking up values from one column in another table..
{{ synth_column_lookup(name='gender', model_name='synth_firstnames', value_cols='first_name', from_col='name', to_col='gender', do_ref=True) }}
do_ref
defaults to true, meaning that model_name
will be wrapped in dbt's {{ ref(model_name) }}
. However you can set do_ref=False
to reference a local CTE instead.
select
Generates values by selecting them from another table, optionally weighted using a specified column of the other table.
{{ synth_column_select(
name='random_ajective',
model_name="synth_words",
value_cols="word",
distribution="weighted",
weight_col="prevalence",
filter="part_of_speech like '%ADJ%'",
do_ref=True
) }}
The above will generate randomly-chosen adjectives (based on the specified filter
), weighted by prevalence.
do_ref
defaults to true, meaning that model_name
will be wrapped in dbt's {{ ref(model_name) }}
. However you can set do_ref=False
to reference a local CTE instead.
Advanced column types use real-world data which is maintained in the seeds/
directory. Some effort has been made to make these data sets
- Generalized, rather than specific to a particular country, region, language, etc. For example, the words dictionary contains common words from many common languages, not just English.
- Statistically rich, with associated metadata which makes the data more useful by capturing various distributions embedded in the data. For example, the countries list includes the (approximate) population and land area of each country, which facilitates generating country lists weighted according to these features. Likewise, the cities list has the latitude and longitude coordinates for each city, which facilitates generating fairly realistic coordinates for synthetic addresses.
Advanced column types may all specify a distribution="weighted"
and weight_col="population"
(or similar) to skew value distributions. They may also specify filter
, which is a SQL where
expression narrowing down the pool of data values that will be used. Finally, they may specify a filter_expressions
dictionary which allows dynamic filtering based on expressions which can involve row values from other columns. If, for example, we are creating a country column and pass filter_expressions
as
{
"country_name": "INITCAP(my_country_col)",
"geo_region_code": "my_geo_region_col"
}
then a WHERE
clause like this will result:
synth_countries.country_name=INITCAP(my_country_col)
AND synth_countries.geo_region_code=my_geo_region_col
(filter_expressions
and filter
- if any - are combined via logical AND
.)
city
Generates a city, selected from the synth_cities
seed table.
{{ synth_column_city(name='city', distribution="weighted", weight_col="population", filter="timezone like 'Europe/%'") }}
geo region
Generates a geo region (state, province, or territory), selected from the synth_geo_regions
seed table.
{{ synth_column_geo_region(name='geo_region', distribution="weighted", weight_col="population", filter="country='United States'") }}
country
Generates a country, selected from the synth_countries
seed table.
{{ synth_column_country(name='country', distribution="weighted", weight_col="population", filter="continent='Europe'") }}
first name
Generates a first name, selected from the synth_firstnames
seed table.
{{ synth_column_firstname(name='first_name', filter="gender='Male'") }}
last name
Generates a last name, selected from the synth_lastnames
seed table.
{{ synth_column_lastname(name='last_name') }}
word
Generates a single word, selected from the synth_words
seed table.
{{ synth_column_word(name='random_word', language_code="en", distribution="weighted", pos=["NOUN", "VERB"], filter="LENGTH(word)>3") }}
The above generates a randomly-selected English noun or verb, weighted according to frequency, of at least four characters.
Rather than language_code
you may specify language
(such as language="English"
), but a language must be specified with one of these parameters. See Words (Datasets) for a list of supported languages and parts of speech.
words
Generates several words, selected from the synth_words
seed table.
{{ synth_column_words(name='random_phrase', language_code="en", distribution="uniform", n=5) }}
The above generates a random string of five words, uniformly districbuted, with the first letter of each word capitalized.
Alternatively, you can generate words using format strings, for example
{{ synth_column_words(name='course_title', language_code="en", distribution="uniform", format_strings=[
"{ADV} learning for {ADJ} {NOUN}s",
"{ADV} {VERB} {NOUN} course"
]) }}
This will generate sets of words according to one of the format strings you specify.
Note that this data type is constructed by separately generating a single word n
times (or, for format_string
s, the set union of all word instances from any format_string
) and then concatenating them together, which can be slow if n
is large (or you have many tokens in your format_string
s).
Rather than language_code
you may specify language
(such as language="English"
), but a language must be specified with one of these parameters. See Words (Data Sets) for a list of supported languages and parts of speech.
language
Generates a spoken language (name or 2- or 3-letter code), selected from the synth_languages
seed table.
{{ synth_column_language(name='random_lang', type="name", distribution="weighted") }}
The optional type
(which defaults to name
) can take values name
(the full English name of the language, e.g. Spanish), code2
(the ISO 693-2 two-letter code for the langage, e.g. es
), or code3
(the ISO 693-3 three-letter code for the language, e.g. spa
).
Composite column types put together several other column types into a more complex data type.
address
Generates an address, based on city
, geo region
, country
, words
, and other values.
Creating a column myaddress
using this macro will also create intermediate columns myaddress__street_address
, myaddress__city
, myaddress__geo_region
, and myaddress__postal_code
(or whatever parts
you specify). You can then add_update_hook()
s that reference these intermediate columns if you'd like. For example:
{{ synth_column_primary_key(name='k_person') }}
{{ synth_column_firstname(name='first_name') }}
{{ synth_column_lastname(name='last_name') }}
{{ synth_column_address(name='home_address', countries=['United States'],
parts=['street_address', 'city', 'geo_region', 'country', 'postal_code']) }}
{{ synth_column_expression(name='home_address_street', expression="home_address__street_address") }}
{{ synth_column_expression(name='home_address_city', expression="home_address__city") }}
{{ synth_column_expression(name='home_address_geo_region', expression="home_address__geo_region") }}
{{ synth_column_expression(name='home_address_country', expression="home_address__country") }}
{{ synth_column_expression(name='home_address_postal_code', expression="home_address__postal_code") }}
{{ synth_table(rows = 100) }}
{{ synth_add_cleanup_hook("alter table {{this}} drop column home_address") or "" }}
Alternatively, you may use something like
{{ synth_column_primary_key(name='k_person') }}
{{ synth_column_firstname(name='first_name') }}
{{ synth_column_lastname(name='last_name') }}
{{ synth_column_address(name='home_address_street', countries=['United States'], parts=['street_address']) }}
{{ synth_column_address(name='home_address_city', countries=['United States'], parts=['city']) }}
{{ synth_column_address(name='home_address_geo_region', countries=['United States'], parts=['geo_region']) }}
{{ synth_column_address(name='home_address_country', countries=['United States'], parts=['country']) }}
{{ synth_column_address(name='home_address_postal_code', countries=['United States'], parts=['postal_code']) }}
{{ synth_table(rows = 100) }}
phone_number
Generates a phone number in the format (123) 456-7890
.
{{ synth_column_phone_number(name="phone_number") }}
Occasionally you may want to build up a more complex column's values from several simpler ones. This is easily done with an expression column, for example
{{ synth_column_primary_key(name="k_person") }}
{{ synth_column_firstname(name='first_name') }}
{{ synth_column_lastname(name='last_name') }}
{{ synth_column_expression(name='full_name', expression="first_name || ' ' || last_name") }}
{{ synth_remove(collection="final_fields", key="first_name") }}
{{ synth_remove(collection="final_fields", key="last_name") }}
{{ synth_table(rows = 100) }}
Note that you may want to "clean up" by supressing some of your intermediate columns, as shown with the synth_remove()
calls in the example above.
You may also want to modify another table only after this one is built. This is also possible using cleanup hooks.
For example, suppose you want to create products
and orders
, but you want some products
to be exponentially more popular (more orders
for) than others. This is possible by
- creating a
products
model with an extra popularity column:{{ synth_column_primary_key(name="k_product") }} {{ synth_column_string(name="name", min_length=10, max_length=20) }} {{ synth_column_distribution(name="popularity", distribution=synth_distribution(class='continuous', type='exponential', lambda=0.05) ) }} {{ synth_table(rows=50) }}
- creating an
orders
model with asynth_column_select()
toproducts
using your popularity column, then use a cleanup hook to drop thepopularity
column:Note that the cleanup hook must go after any column definitions that rely on it, and before the{{ synth_column_primary_key(name="k_order") }} {{ synth_column_select(name="k_product", lookup_table="products", value_col="k_product", distribution="weighted", weight_col="popularity") }} {{ synth_column_distribution(name="status", distribution=synth_distribution(class='discrete', type='probabilities', probabilities={"New":0.2, "Shipped":0.5, "Returned":0.2, "Lost":0.1} ) ) }} {{ synth_column_integer(name="num_ordered", min=1, max=10) }} {{ synth_add_cleanup_hook( 'alter table {{target.database}}.{{target.schema}}.products drop column popularity' ) }} {{ synth_table(rows=5000) }}
synth_table()
call.
With Snowflake only (not other backends), you can specify a random seed. This package uses the dbt var {{ var("synth_randseed") }}
(which defaults to 10000
) and increments it each time random()
is called. Snowflake asserts that even with a fixed seed, "there is no guarantee that RANDOM will generate the same set of values each time"; however in our testing it generally does. This means that (1) repeated dbt run
s with the same seed wil likely generate same/similar data and (2) if you want new/different data, you should consider changing the synth_randseed
var.
dbt
allows configuration to be defined in the vars
section of your dbt_project.yml
but dynamic values are not supported (they must be numbers, strings, lists, or dictionaries, but not macro invocations). However it can be very useful to make various distributions in your synthetic data configurable. This is possibly by defining them in the vars
section using a specific format and then referencing them using the synth_var()
macro provided by this package.
For example, in your dbt_project.yml
:
...
vars:
my_complicated_custom_distribution:
synth_distribution_discretize_ceil():
distribution:
synth_expression():
# this ensures that the value is >= 1
expression: greatest(1, 1 + $0)
p0:
# average of an exponential and normal distribution
# result is a skewed distribution, peaking around 1000
synth_distribution_average():
d0:
synth_distribution_continuous_exponential():
lambda: 0.0002
d1:
synth_distribution_continuous_normal():
mean: 1100
stddev: 400
weights: [1,2]
and then in your model:
with
...
{{ synth_column_distribution(name="my_column",
distribution=synth_var('my_complicated_custom_distribution')
) }}
{{ synth_table(rows=1000) }}
When defining vars
this way:
- reference a macro by name, with
()
at the end - you may only reference macros for available distributions and discretizations
- macro parameters must be passed by name
- macro invocations may be nested arbitraily deep
- values may be combined using
synth_expression()
with parametersexpression
andp0
up top9
whichexpression
references as$0
up to$9
The word list in seeds/synth_words.csv
contains 70k words – the top 5k most common words from each of the following 14 languages:
- Bulgarian (
bg
) - Czech (
cs
) - Danish (
da
) - Dutch (
nl
) - English (
en
) - Finnish (
fi
) - French (
fr
) - German (
de
) - Hungarian (
hu
) - Indonesian (
id
) - Italian (
it
) - Portuguese (
pt
) - Slovenian (
sv
) - Spanish (
es
)
With each word is associated a frequency, which is a value between 0 and 1 representing the frequency with which the word appears in common usage of the language, and a part of speech for the word, which is one of:
- ADJ: adjective
- ADP: adposition
- ADV: adverb
- AUX: auxiliary verb
- CONJ: coordinating conjunction
- DET: determiner
- INTJ: interjection
- NOUN: noun
- NUM: numeral
- PART: particle
- PRON: pronoun
- PROPN: proper noun
- PUNCT: punctuation
- SCONJ: subordinating conjunction
- SYM: symbol
- VERB: verb
- X: other
Some words may functionally belong to multiple parts of speech; this dataset uses only the single most common.
The dataset is constructed based on word lists and frequencies from wordfreq
and part-of-speech tagging from polyglot
. Language availability is based on the set intersection of the languages supported by these two libraries.
You may run into an error when loading this data using dbt seed
on SQLite - an issue has been raised with the dbt-sqlite
adapter to solve this, in the meantime, you'd have to manually edit the seed batch size (make it smaller) to load synth_words
in SQLite.
The language list in seeds/synth_languages.csv
contains 222 commonly-spoken (living) languages, with, for each, the ISO 693-2 and ISO 693-3 language codes, the approximate number of speakers, and a list of countries in which the language is predominantly spoken. Country names are consistent with those in the countries dataset at seeds/synth_countries.csv
.
The dataset is assembled primarily from Wikipedia, including this list of official languages by country, and the specific pages for each individual language.
Here we provide approximate benchmarks for synthetic data generation, using the models found in example_models/*.sql
, for the various supported backends.
Model | Columns | Rows | Snowflake runtime, size | Postgres runtime, size | SQLite runtime, size | DuckDB runtime, size |
---|---|---|---|---|---|---|
distributions | 13-15 | 10k | 1.95s, 804KB | 0.77s, 1.7MB | 0.29s, 1.13MB | 0.20s, 1.76MB |
distributions | 13-15 | 1M | 7.15s, 73MB | 8.93s, 166MB | 8.70s, 115MB | 16.0s, 189MB |
distributions | 13-15 | 100M | 66.19s, 7.2GB | 14.76min, 16GB | 16.6min, 11.2GB | - |
distributions | 13-15 | 10B | 95.5min, 765GB | - | - | - |
columns | 28 | 10k | 20.2s, 2.2MB | 6.5min, 4.6MB | 37.26s, 3.92MB | 0.82s, 2.25MB |
columns | 28 | 100k | 69.0s, 21.2MB | 64.9min, 46MB | 9.3min, 39.1MB | 12.44s, 18.2MB |
columns | 28 | 1M | 10.2min, 109MB | - | 77.3min, 392MB | - |
columns | 28 | 10M | 27.3min, 654MB | - | - | - |
customers | 8 | 100 | 7.07s, 36.5KB | 1.34s, 32KB | 0.67s, 10KB | 0.20s, 1.0MB |
products | 3 | 50 | 4.01s, 16.0KB | 1.09s, 16KB | 0.43s, 4KB | 0.11s, 256KB |
stores | 5 | 2 | 4.96s, 4.0KB | 0.68s, 16KB | 0.45s, 4KB | 0.11s, 256KB |
orders | 4 | 1000 | 5.26s, 59.5KB | 0.66s, 120KB | 0.26s, 24KB | 0.14s, 256KB |
inventory | 4 | 100 | 2.76s, 21.5KB | 0.58s, 24KB | 0.20s, 4KB | 0.13s, 256KB |
customers | 8 | 10k | 4.89s, 960KB | 58.11s, 1.7MB | 8.09s, 1.16MB | 0.43s, 2.0MB |
products | 3 | 5k | 2.57s, 275KB | 41.33s, 544KB | 3.63s, 248KB | 0.25s, 1.0MB |
stores | 5 | 200 | 2.25s, 32KB | 1.84s, 40KB | 0.79s, 20KB | 0.18s, 1.3MB |
orders | 4 | 100k | 3.63s, 5.3MB | 36.2min, 10MB | 19.52s, 2.2MB | 0.76s, 2.3MB |
inventory | 4 | 1M | 18.75s, 60.3MB | 35.9min, 134MB | 3.6min, 18.7MB | 19.3s, 25.9MB |
customers | 8 | 1M | 58.75s, 57.6MB | 1.55hr, 163MB | 11.0min, 118MB | 67.09s, 68.5MB |
products | 3 | 50k | 11.51s, 2.4MB | 6.76min, 4.9MB | 33.54s, 2.49MB | 0.56s, 2.75MB |
stores | 5 | 20k | 3.54s, 1.3MB | 1.86min, 2.5MB | 12.82s, 1.56MB | 0.28s, 2.0MB |
orders | 4 | 50M | 2.24hr, 1.0GB | - | - | - |
inventory | 4 | 100M | 6.3hr, 2.5GB | - | - | - |
Missing values in the table above denote either failed runs (DuckDB kills a process that uses too much memory) or runs that took too long (much more than a couple of hours).
Snowflake runtimes are using a single Xsmall warehouse. Postgres runtimes are using an AWS RDS small instance. SQLite and DuckDB runtimes are using a Lenovo laptop with Intel i-5 2.6GHz processor, 16GB RAM, and 500GB SSD.
Some takeaways from the above data include
- generating large data (> 50 GB) is really only possible using Snowflake
- generating small data (< 1GB) is usually fastest using DuckDB or SQLite
- model complexity (number of columns, and especially joins/references to other tables) significantly influences runtime
Coming soon!
Bugfixes and new features (such as additional transformation operations) are gratefully accepted via pull requests here on GitHub.
- Cover image created with DALL • E mini
See License.
- fix address so it selects a city, then uses the country (and geo_region) for that city, rather than a (different) random country (and geo_region)
- implement other distributions... Poisson, Gamma, Power law/Pareto, Multinomial?
- flesh out more seeds (and corresponding data columns) and composite columns (email address, IP address, user agent strings, file_name, URL, etc.)