-
Notifications
You must be signed in to change notification settings - Fork 15
Home
Anton Parkhomenko edited this page Nov 12, 2018
·
14 revisions
This is a project to load Snowplow enriched events into the Google BigQuery.
This application consists of two independent apps:
-
Snowplow BigQuery Loader, an Apache Beam job which reads Snowplow enriched data from Google PubSub, transform into BigQuery-friendly format and loads it. Also it writes data into auxiliary
types
PubSub topic -
Snowplow BigQuery Mutator, a Scala app which reads
types
PubSub topic and performs necessary table mutations to add new columns -
Snoplow BigQuery Forwarder, an auxiliary Apache Beam job used to recover data from
failedInserts
topic
This Apache Beam job is written in Scala and makes use of the Snowplow Scala Analytics SDK. It is intended to run on Google Dataflow
The Loader:
- Reads Snowplow enriched events from PubSub
- Uses the JSON transformer from the Snowplow Scala Analytics SDK to convert those enriched events into JSONs
- Uses Iglu Client to fetch JSON Schemas for contexts and self-describing events
- Uses Iglu Schema DDL to transform contexts and self-describing events into BigQuery format
- Writes transformed data into BigQuery
- Writes all encountered iglu types into
types
PubSub topic - Writes all data failed to be processed into
badRows
PubSub topic - Write data that succeeded to be transformed, but failed to be loaded into
failedInserts
topic
This is a Scala app which reads data from types
PubSub topic and performs table mutations.
The algorithm is as follows:
- Read messages from
types
topic - Find out if message contains a type that has not been encountered yet (by checking internal cache)
- If message contains new type - double-check it with connected BigQuery table
- If type is not in the table - fetch its JSON Schema from Iglu Registry
- Transform JSON Schema into BigQuery column definition
- Add column to connected BigQuery table
Snowplow BigQuery Loader uses Google PubSub topics and subscriptions to store intermeidate data and communicate between applications.
-
enriched
subscription - data enriched by Beam Enrich, in canonical TSV+JSON format -
types
topic - mirror