Skip to content
Anton Parkhomenko edited this page Nov 12, 2018 · 14 revisions

Overview

This is a project to load Snowplow enriched events into the Google BigQuery.

Technical architecture

This application consists of two independent apps:

  1. Snowplow BigQuery Loader, an Apache Beam job which reads Snowplow enriched data from Google PubSub, transform into BigQuery-friendly format and loads it. Also it writes data into auxiliary types PubSub topic
  2. Snowplow BigQuery Mutator, a Scala app which reads types PubSub topic and performs necessary table mutations to add new columns
  3. Snoplow BigQuery Forwarder, an auxiliary Apache Beam job used to recover data from failedInserts topic

Snowplow BigQuery Loader

Overview

This Apache Beam job is written in Scala and makes use of the Snowplow Scala Analytics SDK. It is intended to run on Google Dataflow

Algorithm

The Loader:

  • Reads Snowplow enriched events from PubSub
  • Uses the JSON transformer from the Snowplow Scala Analytics SDK to convert those enriched events into JSONs
  • Uses Iglu Client to fetch JSON Schemas for contexts and self-describing events
  • Uses Iglu Schema DDL to transform contexts and self-describing events into BigQuery format
  • Writes transformed data into BigQuery
  • Writes all encountered iglu types into types PubSub topic
  • Writes all data failed to be processed into badRows PubSub topic
  • Write data that succeeded to be transformed, but failed to be loaded into failedInserts topic

Snowplow BigQuery Mutator

Overview

This is a Scala app which reads data from types PubSub topic and performs table mutations.

Algorithm

The algorithm is as follows:

  • Read messages from types topic
  • Find out if message contains a type that has not been encountered yet (by checking internal cache)
  • If message contains new type - double-check it with connected BigQuery table
  • If type is not in the table - fetch its JSON Schema from Iglu Registry
  • Transform JSON Schema into BigQuery column definition
  • Add column to connected BigQuery table

Topics and message formats

Snowplow BigQuery Loader uses Google PubSub topics and subscriptions to store intermeidate data and communicate between applications.

  • enriched subscription - data enriched by Beam Enrich, in canonical TSV+JSON format
  • types topic - mirror
Clone this wiki locally