-
Notifications
You must be signed in to change notification settings - Fork 15
Home
This is a project to load Snowplow enriched events into the Google BigQuery.
This application consists of two independent apps:
-
Snowplow BigQuery Loader, an Apache Beam job which reads Snowplow enriched data from Google PubSub, transform into BigQuery-friendly format and loads it. Also it writes data into auxiliary
types
PubSub topic -
Snowplow BigQuery Mutator, a Scala app which reads
types
PubSub topic and performs necessary table mutations to add new columns -
Snoplow BigQuery Forwarder, an auxiliary Apache Beam job used to recover data from
failedInserts
topic (e.g. due mutation lag)
An Apache Beam job intended to run on Google Dataflow and load enriched data from enriched
PubSub topic to Google BigQuery.
The Loader:
- Reads Snowplow enriched events from
enriched
PubSub subscription - Uses the JSON transformer from the Snowplow Scala Analytics SDK to convert those enriched events into JSONs
- Uses Iglu Client to fetch JSON Schemas for contexts and self-describing events
- Uses Iglu Schema DDL to transform contexts and self-describing events into BigQuery format
- Writes transformed data into BigQuery
- Writes all encountered iglu types into
types
PubSub topic - Writes all data failed to be processed into
badRows
PubSub topic - Write data that succeeded to be transformed, but failed to be loaded into
failedInserts
topic
This is a Scala app which reads data from types
PubSub topic and performs table mutations.
The algorithm is as follows:
- Read messages from
types
topic - Find out if message contains a type that has not been encountered yet (by checking internal cache)
- If message contains new type - double-check it with connected BigQuery table
- If type is not in the table - fetch its JSON Schema from Iglu Registry
- Transform JSON Schema into BigQuery column definition
- Add column to connected BigQuery table
An Apache Beam job intended to run on Google Dataflow and load enriched data from failedInserts
into BigQuery.
BigQuery Loader inserts data into BigQuery in near real-time.
At the same time, it sinks shredded_type
payloads into types
topic approximately every 5 seconds.
It also can take up to 10-15 seconds for Mutator to fetch, parse message and execute alter table statement against the table.
If new type arrives from enriched
subscription in this period of time and mutator fails to handle it - BigQuery rejects a row containing it and sends it to failedInserts
topic.
This topic contains JSON objects ready to be loaded to BigQuery (i.e. not canonical Snowplow Enriched event format).
In order to load this data again from failedInserts
to BigQuery you can use Forwarder job.
It reads a subscription from failedInserts
and simply performs insert statements.
Note: this is a streaming job and hence does not finish when failedInserts
subscription is drained. It should be shutdown manually after it processed enough data.
Snowplow BigQuery Loader uses Google PubSub topics and subscriptions to store intermeidate data and communicate between applications.
-
enriched
subscription - data enriched by Beam Enrich, in canonicalTSV+JSON format
-
types
topic - all shredded types iniglu:com.snowplowanalytics.snowplow/shredded_type/jsonschema/1-0-0
self-describing payload encountered by Snowplow Loader are sinked here with ~5 seconds interval -
typesSubscription
subscription - subscription totypes
topic used by Mutator withiglu:com.snowplowanalytics.snowplow/shredded_type/jsonschema/1-0-0
self-describing payload -
badRows
topic - data that could not be processed by Loader due Iglu Registry unavailability withbad rows
format -
failedInserts
topic - data that has been successfully transformed by Loader, but failed loading to BigQuery usually due mutation lag inBigQuery JSON
format