Very humbled to have you all here to check out what we've been working on these past few months. Hopefully y'all will think it is as cool and useful as we have!
This is the FIRST official training, so apologies for where the training could be rough around the edges. There are some success stories already in that we now have four people on the team that know dbt (3 of whom are here today to help out and learn more).
- Download & Install Software
- Environment Set Up (30 min)
- learn dbt!
- Anders's spiel
Ahead of the call if you could please download and install the following (if you haven’t already):
This is the biggest hurdle to using dbt right now. IMHO, it isn't that big. Fishtown Analytics has a web-based IDE tool that you'll see being used in the training course. The advantage is that you can write dbt models and run them in the same window.
Here's a series of videos where I show the environment set up details from the moment you finish downloading and installing the softwaree
- Env Set Up Part 1: Set up VSCode *Note: I created my environment with
python==3.7.9
- Env Set Up Part 2: connect to the database with dbt in VSCode using Azure CLI
- Env Set Up Part 3: Azure Data Studio & more intro
There may be an issue where your terminal can't find Git or Anaconda, in which case we made need to add to your system environment's PATH variable.
Here we'll be using a version of the same repo that Fishtown uses for their training.
- Clone this repo in VSCode
- Open VSCode and type (
CTRL + Shift + P
) to access the command palette located at the top. - Type
git: clone
and select the option. - Paste the git URL:
https://github.com/dbt-msft/jaffle_shop_mssql
and hit enter. - Choose which folder you want your code to go in and click "Select Repository Location".
- A pop up message should appear on the lower right of VSCode asking if you would like to open the cloned repository. Click "Open".
- Open VSCode and type (
- After cloning, you should be prompted to install some extensions -- you should! The extensions are:
python extension
better jinja
vscode-dbt
rainbow csv
- Now we need to set up
profiles.yml
to connect to our database. In our repo, this is calledprofiles_example.yml
.- Create new directory called
.dbt
under your user folder (C:\Users\your_user_folder
). - Open
profiles_example.yml
. - Save As the file into the
.dbt
directory you just made and rename it toprofiles.yml
. - Edit the file in VSCode and change the
schema
field on line 5 to your initials or anything that serves as an identifier for your work.
- Create new directory called
- Create your conda environment and install the necessary packages in here once so that you can reuse this environment for future dbt work.
- Open a new terminal by clicking on "Terminal" at the top and "New Terminal"
- Click "Allow" when a pop up appears asking if you allow the workspace to modify your terminal shell.
- If you see "PS" next to your current directory on the command line, that means you're using powershell and we don't want to use that. The fix: open a new terminal again and you should now be using cmd.
- Since you already installed Anaconda, you should see
conda activate base
run automatically and (base) should be next to your current directory. - Create a new, empty environment by running
conda create -n dbtenv python=3.7.9
. The name of your environment isdbtenv
and we are using Python version 3.7.9. - Install packages
- Run
pip install dbt-sqlserver
- Run
pip install azure-cli
- Run
- Log into Azure. We want to connect to the database via Azure CLI.
- Run
az login
- In the window that popped up, sign in with your Avanade credentials. Once signed, in you can close out of that window. If the window says that an error occurred and is directing you to a localhost url, you can try logging in using
az login --use-device-code
. This will provide you instructions to go to a general url and paste in the code provided on the command line. - If you belong to multiple subscriptions, you must specify the subscription by running
az account set --subscription ff2e23ae-7d7c-4cbd-99b8-116bb94dca6e
. This is the ID for AzureCloud.
- Run
- Let's verify that we can connect to the database successfully.
- Run
dbt debug
. This command tries to connect to the database using the parameters fromprofiles.yml
anddbt_project.yml
.- You connected successfully if you see all green and no error messages!
- Run
- Last step in the set-up is to log in to the database
- Open Azure Data Studio and click on the first icon on the left side panel called "Connections".
- Click "Add Connection" and paste in the connection info below in the relevant fields. This info can also be found in
profiles.yml
- Server: dbtavatrain.database.windows.net
- Authentication type: Azure Active Directory
- Account: Click the drop down menu, then "Add an account" and sign in like before
- Azure AD tenant: Avanade
- Database: Manually type "sandbox"
- Click Connect
- You are now connected to the database!
- Please note that there may be many tables in the database already. These are other people's tables that they created using their schema name. Yours will start to show up as you start creating your dbt models.
- You are now connected to the database!
- Now you are ready to start the exercises in the next section below!
- If you are not familiar with Azure Data Studio, here are some pro tips that might be useful when going through the exercises:
- The server you are connected to and the Tables and Views folders are located on the left side of the page
- To quickly query a table or view: Go into the relevant folder, right click the name, and click "Select Top 1000".
- To write a new query, right click the server, and click "New Query"
- If you're wondering why you're not seeing your tables/views show up after running dbt commands in VSCode, you can right click on the Tables or Views folder and click "Refresh".
- You can also do this to the server itself as well, but this might take longer.
- The server you are connected to and the Tables and Views folders are located on the left side of the page
- If you are not familiar with Azure Data Studio, here are some pro tips that might be useful when going through the exercises:
Try the walk-through tutorial of using dbt below. There are also additional resources found at the end of this section as well. Happy modeling!
- Create an Azure SQL database. Go through the steps listed here. We recommend using Portal to create.
- Note: we will be creating our own data source with pre-populated tables so select None for the "Use existing data" section (Step 14 for Portal instructions).
- Caveats:
- If you choose the General Purpose option, you can choose between provisioned and serverless compute tiers. Expect this connectivity limitation when choosing serverless.
- If you pick less than the standard tier database, you will have to edit dbt_project.yml and change the as_columnstore field on line 23 to "false" like below. For more information on tier comparisons click here.
models: jaffle_shop: +as_columnstore: false
- Authenticate to the database with Azure Active Directory
- In Portal, navigate to the SQL server you created your database in.
- On the left side panel, click on "Active Directory admin".
- Click "Set admin", then select your email that you created your Azure credentials with.
Before we jump in and start creating dbt models, we need to locate our source tables. If you are making your own database, then you need to create an external data source and materialize the raw tables from the external data source. These are named raw.customers
, raw.orders
, and raw.payments
. Follow the instructions below to connect to our blob storage called JaffleShopBlob
and pull the data from there. Otherwise, if you already see the source tables under Tables in Azure Data Studio, skip this and go to the next section.
- Open Azure Data Studio and create a "New Query" by right clicking on the server name and choosing the option.
- Paste the Azure SQL code below into the new query and click "Run" at the top. To view the Synapse version of the code click here.
CREATE EXTERNAL DATA SOURCE JaffleShopBlob
WITH (
TYPE = BLOB_STORAGE,
LOCATION = 'https://dbtmsftblob.blob.core.windows.net'
);
go
SELECT *
INTO raw.customers
FROM OPENROWSET(
BULK 'jaffle-shop/raw_customers.csv',
DATA_SOURCE = 'JaffleShopBlob',
FORMAT ='CSV',
FORMATFILE = 'jaffle-shop/raw_customer.xml',
FORMATFILE_DATA_SOURCE = 'JaffleShopBlob'
) as temp
go
SELECT *
INTO raw.orders
FROM OPENROWSET(
BULK 'jaffle-shop/raw_orders.csv',
DATA_SOURCE = 'JaffleShopBlob',
FORMAT ='CSV',
FORMATFILE = 'jaffle-shop/raw_orders.xml',
FORMATFILE_DATA_SOURCE = 'JaffleShopBlob'
) as temp
go
SELECT *
INTO raw.payments
FROM OPENROWSET(
BULK 'jaffle-shop/raw_payments.csv',
DATA_SOURCE = 'JaffleShopBlob',
FORMAT ='CSV',
FORMATFILE = 'jaffle-shop/raw_payments.xml',
FORMATFILE_DATA_SOURCE = 'JaffleShopBlob'
) as temp
go
Let's make some dbt models! The following steps will show how dim_customers
, an end table, gets created and transformed from raw data sources. From the previous step, we currently have the following tables in our database: raw.customers
, raw.orders
, and raw.payments
(try viewing these tables in Azure Data Studio).
- We need to create staging tables from the tables listed above. The SQL files that create these tables are located in
models\staging
. Take a quick look at each of the files to understand the transformations. In the command line, individually run:dbt run --model stg_customers
dbt run --model stg_orders
dbt run --model stg_payments
- Take a look at each of the tables you just created in Azure Data Studio. These can be found under Views as
your_schema_name.stg_customers
and so on. - Optional: take a look at the compiled SQL code for each of these tables in
target\compiled\jaffle_shop\models\staging
- Lets create more tables! Now we will build another layer of intermediary tables built off of the ones we just created in step 1. These tables can be found in
models\marts\core\intermediate
. Again, individually run:dbt run --model customer_orders
dbt run --model customer_payments
dbt run --model order_payments
- Take a look at each of the tables in Azure Data Studio under Tables.
- Optional: take a look at the compiled SQL code for each of these tables in
target\compiled\jaffle_shop\models\marts\core\intermediate
- All the tables we created up to now will be used to create
dim_customers
. The SQL code for this table can be found inmodels\marts\core
asdim_customers.sql
.- Run
dbt run --model dim_customers
- Take a look at this table in Azure Data Studio under Tables.
- Note: If you're wondering what dictates whether a dbt model gets turned into a table or view, check out
dbt_project.yml
. You can see on lines 21-27 that we can choose which which SQL files will get materialized to a table or view at the model level. In our case, we set all the files in thestaging
folder to be views, and the rest tables.
- Run
- Congrats! You finished creating the end table,
dim_customers
, along with its parent tables. It is important to note that there is another end table in this database calledfct_orders
along with its parent tables.- Run
dbt run
. Without the model parameter like in previous examples, this allows all the SQL files under themodels
folder to get materialized in SQL Server. Now,dim_customers
andfct_orders
are materialized, along with their parent tables.
- Run
We can document all table and column descriptions. Under the models
folder, there is a YML file inside each folder with the documentation information. The files are as follows:
- Source tables:
models\source\src_schema.yml
- Staging tables:
models\staging\stg_schema.yml
- Intermediate and end tables:
models\marts\core\schema.yml
When looking at each of them, you will notice that there are fields for table description, column description, and column data type. These can be filled out as needed. To visualize the documentation better:
- Run
dbt docs generate
- Run
dbt docs serve
- Those two commands will take you to a website where you can easily visualize the documentation of each table.
- From the menu on the left, click on "Sources", then "shopify_raw", then "customers" to see the documentation for the raw customers table.
- Click on some other tables to view documentation for those.
- Click on the teal circle icon on the bottom right to view the DAG that dbt generates for us. Click on the full screen icon. At the bottom, delete the entry for the
--select
field and press enter to view the full DAG for the entire database you built. Here you can see the path of each table and understand how each table is related to eachother.
After creating tables, we can run tests on them to see if each column is behaving as expected.
- Open
models\marts\core\schema.yml
- Here you can see that there is only one test for
dim_customers
and many tests forfct_orders
. You can look at a list of all tests that can be applied here.- Some tests to note from the file are
unique
: this test passes when there are no duplicates in the columnnot_null
: this test passes when there are no null values in the columnrelationships
: this test passes when all records in a child table have a corresponding record in a parent table.
- Some tests to note from the file are
- Run
dbt test
to see which tests pass or fail. A fail will tell you where in your data you need to start debugging.
- Claire's classic text + video walkthrough - check this out if you want to learn how to set up and deploy a dbt project using BigQuery (flip through the modules under "Getting Started" on the left)
- Kyle's amazing course - this course shows you how to use dbt cloud and goes over some fundamental concepts around dbt
- thorough docs worth going through - these are great overviews about modeling, testing, documentation, sources, and other additional topics we did not cover (flip through the topics under "Building a dbt Project" on the left)
- the dbt viewpoint
- dbt-utils
- production workflow