Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support GEOGRAPHY types #7

Open
miike opened this issue Oct 3, 2018 · 2 comments
Open

Support GEOGRAPHY types #7

miike opened this issue Oct 3, 2018 · 2 comments

Comments

@miike
Copy link

miike commented Oct 3, 2018

Given that BigQuery now has some geography support does it make sense to convert the latitude / longitude columns to a single GEOGRAPHY column of type ST_GEOGPOINT?

@chuwy
Copy link
Contributor

chuwy commented Oct 3, 2018

I guess BigQuery added this type very recently (it is also in beta status), because I noticed it just last week. I definitely believe we do need to support it.

Just converting geo_latitude/geo_longitude fields in enriched event into new GEOGRAPHY column would not be a generic enough solution (although this is what we do for Elasticsearch) - we want to support this type across many other schemas (com.snowplowanalytics.snowplow/geolocation_context, com.clearbit.enrichment/person, com.mandrill/message_opened and many other schemas from Iglu Central contain geo information). Also, this sounds like a common enough task - many DBs can support different kinds of custom types.

I see it as some kind of "Loader enrichment" which transforms complex JSON types into DB's native, i.e. we configure that in iglu:com.acme/event/jsonschema/1-*-*:#property/lat is latitude and iglu:com.acme/event/jsonschema/1-*-*:#property/long is longitude and Loader (BigQuery or other) transforms it into native X:Y. But main questions is where should we encode this? I don't want to encode this in JSON Schema because:

  • This will involve changes into JSON metaschema. How many other similar use cases we can have to justify this change?
  • This won't be coherent with JSON Schema main goal, which is validation, not complex structuring
  • It will be hard to do it once and for everything, e.g. other DBs can require altitude - should we make it required?

On the other hard we can add it to configuration of loader, but we're mixing concerns here: analysts should not ask DevOps to change configuration every time they've added new schema with geo data. Also, we don't want to restart the job just to reload this minor piece of configuration.

Last option is to try to make Loader smart enough to figure out these columns by itself:

  1. Schema contains both "lat"/"lon" or "latitude"/"longitude" properties
  2. Both have type number
  3. Both have appropriate constraints (minimum/maximum)

So far I prefer this option, its main disadvantage is that it is slightly implicit and magical. What if 3rd party schema encodes geo data with some completely unexpected properties, such as lt/ln? What's the chance that we'll turn into GEOGRAPHY something that is not geo data really?

@chuwy chuwy changed the title Should we support GEOGRAPHY types? Support GEOGRAPHY types Oct 3, 2018
@miike
Copy link
Author

miike commented Oct 3, 2018

Agreed - it feels awkward putting logic that is essentially transformations into a JSON schema but you've raised the excellent point that haven't this hard coded in to the codebase makes this tricky to maintain from an analyst point of view.

The magic option sounds desirable but it still feels like it needs some sort of high level configuration so whatever type inferences we are making don't incorrectly cast objects (whether that's GEOGRAPHY or another type in the future).

Does having a system that allows UDFs make sense in terms of striking a balance between flexibility and configuration? e.g., something like the Javascript enrichment for transformation - though it wouldn't necessarily need to be Javascript.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants