Snowflake dosen't provide any out of box solutions to verify or check data lineage, and this project establish Data Lineage between Snowflake Database Objects. This project uses Query History Metadata extended by Snowflake in SNOWFLAKE.ACCOUNT_USAGE_QUERY_HISTORY
table to build Data Lineage Diagram.
For example:
insert into table1 select * from table2
From the above query, it is easy infer that
table1
is build using the data available intable2
.
Extrapolation of same idea, we can build entire data lineage.
- Read DML Queries from Query History of Snowflake Metadata Views
- Parse Query (dependend on thirdparty library sqlparse)
- Identify Source and Target tables for Each Query
- Remove Duplicates Targets (Out of available source sets for a target select most recent source set)
- Consolidate Source tables for each target table (Remove Deleted Table, View and Stage references, Remove Duplicates, Trim, Build Fully Qualified Object Names)
- Build JSON
- Build dotGraph visual
- Save HTML file and Check output
# python -m venv env # (env is the environment name)
# \env\Scripts\activate
# pip install snowflake-connector-pyhton, sqlparse
# python app.py
#
# (after the script ends, check outout html file.
On visualization part this project is inspired by the ideas in the post written by @Cristian Scutaru How to Display Object Dependencies in Snowflake