The data cleansing tasks covered are:
- remove the first row from the staging table
event_raw
(where rowid <> 1
) as it holds the csv header row. - remove leading and trailing spaces (
trim(column_name)
). - make text columns lowercase.
- remove the
UTC
suffix from timestamps (substr(event_time, 1, 19)
). - convert to integer/float the columns expected to be so, and set them as
null
if convertion fails (case when cast(price as float) = price then price else null end
).
The incremantal loading is implemented with the following assumptions on the source csv files:
- they do not overlap (i.e. given an event, it can be found in one-and-only-one of the provided files).
- they are ingested in chronological order from the oldest to the newest (e.g. 2020-Feb.csv is ingested only after 2020-Jan.csv was ingested).
Within each of the .sql script, the incremental loading is achieved by:
- re-creating the table
event_raw
- within the 00-create-tables.sql - before ingesting any file
drop table if exists event_raw;
create table event_raw (
[...]
while creating all the other tables only once, that is if they don't exist yet
create table if not exists [...]
- fetching, in the
import CTEs
code-block, the latest timestamp for which data is available in the target table (i.e. the one the script is aiming to load (in the example belowevent_clean
).
target_table as (
select
datetime(coalesce(max(event_time), '1900-01-01')) as latest_date
from event_clean
),
- filtering out, in the
final CTE
, the rows with timestamp older than the latest timestamp, in order to prevent loading the same data twice.
final as (
select * from clean_data_types
where event_time > (
select latest_date from target_table
)
)
- snake_case is adopted for database objects and sql commands
- CTEs are structured in the following blocks:
import CTEs
: where all source tables are collected in a separate CTE, and preferably using aselect *
logical CTEs
: where theimport CTEs
are subject to transformations, including aggregationsfinal CTE
: where thelogical CTEs
are joined together to produce the final data set to be loadedsimple insert from select statement
: where the target table is loaded usinginsert into ... select ... from final
statement
- There are no comments in the code, other than those referring to the CTE blocks as above, since the code is written to be self-explanatory ("One of the more common motivations for writing comments is bad code" - R.C.Martins, 'Clean Code').