
lea is a minimalist alternative to SQL orchestrators like dbt and SQLMesh.
lea aims to be simple and provides sane defaults. We happily use it every day at Carbonfact to manage our BigQuery data warehouse. We will actively maintain it and add features, while welcoming contributions.
- Jaffle shop π₯ͺ
- Incremental π
- School π«
- Compare development to production π―ββοΈ
- Using MotherDuck π¦
Use one of the following commands, depending on which warehouse you wish to use:
pip install lea-cli
This installs the lea
command. It also makes the lea
Python library available.
lea is configured via environment variables.
LEA_WAREHOUSE=duckdb
LEA_DUCKDB_PATH=duckdb.db
# Required
LEA_WAREHOUSE=bigquery
# Required
LEA_BQ_LOCATION=EU
# Required
LEA_BQ_DATASET_NAME=kaya
# Required, the project where the dataset is located
LEA_BQ_PROJECT_ID=carbonfact-dwh
# Optional, allows using a different project for compute
LEA_BQ_COMPUTE_PROJECT_ID=carbonfact-dwh-compute
# Not necessary if you're logged in with the gcloud CLI
LEA_BQ_SERVICE_ACCOUNT=<JSON dump of the service account file> # not a path β οΈ
# Defaults to https://www.googleapis.com/auth/bigquery
LEA_BQ_SCOPES=https://www.googleapis.com/auth/bigquery,https://www.googleapis.com/auth/drive
# LOGICAL or PHYSICAL, defaults to PHYSICAL
LEA_BQ_STORAGE_BILLING_MODEL=PHYSICAL
These parameters can be provided in an .env
file, or directly in the shell. Each command also has an --env
flag to provide a path to an .env
file.
This is the main command. It runs SQL queries stored in the scripts
directory:
lea run
You can indicate the directory where the scripts are stored:
lea run --scripts /path/to/scripts
The scripts are run concurrently. They are organized in a DAG, which is traversed in a topological order. The DAG's structure is determined automatically by analyzing the dependency between queries.
Each query is expected to be placed under a schema, represented by a directory. Schemas can have sub-schemas. Here's an example:
scripts/
schema_1/
table_1.sql
table_2.sql
schema_2/
table_3.sql
table_4.sql
sub_schema_2_1/
table_5.sql
table_6.sql
Each script is materialized into a table. The table is named according to the script's name, following the warehouse convention.
SQL queries can be templated with Jinja. A .sql.jinja
extension is necessary for lea to recognise them.
You have access to an env
variable within the template context, which is simply an access point to os.environ
.
By default, lea creates an isolation layer with production. The way this is done depends on your warehouse:
- BigQuery : by appending a
_<user>
suffix to schema names - DuckDB : by adding a suffix
_<user>
to database file.
In other words, a development environment is used by default. Use the --production
flag when executing lea run
to disable this behaviour, and instead target the product environment.
lea run --production
The <user>
is determined automatically from the login name. It can be overriden by setting the LEA_USERNAME
environment variable.
A single script can be run:
lea run --select core.users
Several scripts can be run:
lea run --select core.users --select core.orders
Similar to dbt, lea also supports graph operators:
lea run --select core.users+ # users and everything that depends on it
lea run --select +core.users # users and everything it depends on
lea run --select +core.users+ # users and all its dependencies
You can select all scripts in a schema:
lea run --select core/ # the trailing slash matters
This also work with sub-schemas:
lea run --select analytics.finance/
There are thus 8 possible operators:
schema.table (table by itself)
schema.table+ (table with its descendants)
+schema.table (table with its ancestors)
+schema.table+ (table with its ancestors and descendants)
schema/ (all tables in schema)
schema/+ (all tables in schema with their descendants)
+schema/ (all tables in schema with their ancestors)
+schema/+ (all tables in schema with their ancestors and descendants)
Combinations are possible:
lea run --select core.users+ --select +core.orders
There's an Easter egg that allows choosing scripts that have been committed or modified in the current Git branch:
lea run --select git
lea run --select git+ # includes all descendants
This becomes very handy when using lea in continuous integration.
WAP is a data engineering pattern that ensures data consistency and reliability. It's the data engineering equivalent of blue-green deployment in the software engineering world.
lea follows the WAP pattern by default. When you execute lea run
, it actually creates temporary tables that have an ___audit
suffix. The latter tables are promoted to replace the existing tables, once they have all been materialized without errors.
This is a good default behavior. Let's say you refresh table foo
. Then you refresh table bar
that depends on foo
. If the refresh of bar
fails, you're left with a corrupt state. This is what the WAP pattern solves. In WAP mode, when you run foo
's script, it creates a foo___audit
table. If bar
's script fails, then the run stops and foo
is not modified.
There is no lea test
command. Tests are run together with the regular script when lea run
is executed. The run stops whenever a test fails.
There are two types of tests:
- Singular tests β these are queries which return failing rows. They are stored in a
tests
directory. - Assertion tests β these are comment annotations in the queries themselves:
#NO_NULLS
β checks that all values in a column are not null.#UNIQUE
β checks that a column's values are unique.#UNIQUE_BY(<by>)
β checks that a column's values are unique within a group.#SET{<elements>}
β checks that a column's values are in a set of values.
Here's an example of a query annotated with assertion tests:
SELECT
-- #UNIQUE
-- #NO_NULLS
user_id,
-- #NO_NULLS
address,
-- #UNIQUE_BY(address)
full_name,
-- #SET{'A', 'B', 'AB', 'O'}
blood_type
FROM core.users
You can run a single test via the --select
flag:
lea run --select tests.check_n_users
Or even run all the tests, as so:
lea run --select tests/ # the trailing slash matters
βοΈ When you run a script that is not a test, all the applicable tests are run as well. For instance, the following command will run the core.users
script and all the tests that are applicable to it:
lea run --select core.users
You may decide to run all scripts without executing tests, which is obviously not advisable:
lea run --unselect tests/
lea run --select core.users --unselect tests/
When you call lea run
, it generates audit tables, which are then promoted to replace the original tables. This is done to ensure that the data is consistent and reliable. lea doesn't run scripts when the audit table already exists, and when the script hasn't modified since the last time the audit table was created. This is to avoid unnecessary re-runs of scripts that haven't changed.
For instance:
- You execute
lea run
to sync all tables from sources, no errors, all tables are materialized. - You modify a script named
core/expenses.sql
depending onstaging/customers.sql
andstaging/orders.sql
- You execute
lea run core.expenses+
to run again all impacted tables core__expenses___audit
is materialized in your data warehouse but the-- #NO_NULLS
assertion test on a column fails- After reviewing data in
core__expenses___audit
, you edit and fixcore/expenses.sql
to filter out results where NULLs are appearing - You execute
lea run
- The
staging/customers.sql
andstaging/orders.sql
scripts are skipped because they were modified beforestaging__customers
andstaging__orders
was last materialized - The
core/expenses.sql
script is run because it was modified aftercore__expenses
was last materialized - All audit tables are wipped out from database as the whole DAG has run successfully ! π
You can disable this behavior altogether:
lea run --restart
At Carbonfact, we cluster most of our tables by customer. This is done to optimize query performance and reduce costs. lea allows you to automatically cluster tables that contain a given field:
LEA_BQ_DEFAULT_CLUSTERING_FIELDS=account_slug
You can also specify multiple fields, meaning that tables which contain both fields will be clustered:
LEA_BQ_DEFAULT_CLUSTERING_FIELDS=account_slug,brand_slug
Big Blue is a SaaS product to monitor and optimize BigQuery costs. As part of their offering, they provide a Pick API. The idea is that some queries should be run on-demand, while others should be run on a reservation. Big Blue's Pick API suggests which billing model to use for each query.
We use this at Carbonfact, and so this API is available out of the box in lea. You can enable it by setting the following environment variables:
LEA_BQ_BIG_BLUE_PICK_API_KEY=<get is from https://your-company.biq.blue/settings.html>
LEA_BQ_BIG_BLUE_PICK_API_URL=https://pick.biq.blue
LEA_BQ_BIG_BLUE_PICK_API_ON_DEMAND_PROJECT_ID=on-demand-compute-project-id
LEA_BQ_BIG_BLUE_PICK_API_REVERVATION_PROJECT_ID=reservation-compute-project-id
Feel free to reach out to [email protected] if you want to know more and/or contribute π
We have suggested some issues as good places to get started.
lea is free and open-source software licensed under the Apache License, Version 2.0.