Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EPIC] [YAML] Well-structured lake & analytics #447

Closed
14 tasks done
trentmc opened this issue Dec 13, 2023 · 2 comments
Closed
14 tasks done

[EPIC] [YAML] Well-structured lake & analytics #447

trentmc opened this issue Dec 13, 2023 · 2 comments
Assignees
Labels

Comments

@trentmc
Copy link
Member

trentmc commented Dec 13, 2023

Background / motivation

This is an epic to mature our pipeline for data going into the data lake/, and for consuming it in analytics/. By the end, the lake, etl, and analytics should only fetch/update whats needed.

Most calculations and aggregations should be done at the ETL level, yielding tables that have been tested and verified. Core analytic tables (parquet) will be created as a result of data_factory + ETL doing their work.

Analytics and other modules should then consume from the local lake/tables (like a database). Analytics, reports, streamlit should mostly, just consume/report from the work done by the ETL.

Steps Proposed are:

  1. Lake Preparation
  2. ETL + Bronze Data Workflow
  3. Cleanup Table Interface

image

1. Lake Preparation

  • Rename/move files & dirs for proper separation among lake, AI models, analytics #446
  • predictoor_stats.py - remove `get_endpoint_statistics #477
  • Move logic from subgraph_slot.py #483
  • Integrate pdr_subscriptions into GQL Data Factory #468
  • Update gql_data_factory tests/logic to handle more complex cases: multiple queries/parquet files, no records to fetch, non-existing parquet, different start/end records. #468
  • Re-write get_cli_statistics() into 2 fns and to use polars #453
  • predictoor_stats.py - remove aggregate_prediction_statistics was completed in #453

At a later date, update accuracy/app.py to use data_lake + ETL.
Review peripheral/utilities that might be good candidates for using lake/etl data.

ETL + Bronze Data Workflow

Due to how subgraph works, we need to be smart about how to keep our local records up-to-date. The simplest, dumbest way is to fetch all: predictions/truevals,payouts, and join them into a table <bronze_post_pdr_predictions_table>.

Part A - Integrate all raw data

  • Modify data_factory._update() to only fetch/update predictions that have changed. pdr data_factory._update() fetches new predictPredictions (prediction state) that were computed by subgraph.
  • Integrate truevals into gql_data_factory #480
  • Integrate payouts into gql_data_factory #481

Part B - Do ETL + Bronze Tables

  • Create etl.py such that GQLDataFactory() creates the raw data, while ETL() is responsible for the steps required to evolve the data from raw, to bronze, and beyond. => #482
  • Create a clean "bronze_predictions" table using all source raw tables: (1) predictions, (2) truevals, (3) payouts by slot) => #482

3 - Cleanup Table Interface

  • Create a Table base object, so save/load/query functionality isn't duplicated across GQLDataFactory/ETL #593
  • Move Tables() out of GQLDataFactory/ETL, so other classes/actors can access them. #593
@idiom-bytes
Copy link
Member

idiom-bytes commented Feb 21, 2024

Last PR has been updated and is in review.
Merging #482 should complete this epic.

@kdetry @KatunaNorbert

@idiom-bytes
Copy link
Member

idiom-bytes commented Mar 5, 2024

We have completed the core pieces of this ticket by ingesting data from subgraph, building our local lake, and then creating our initial ETL tables.

We're now focused on completing the DuckDB work, the dapp/analytics work, in addition to the "well-structured lake & analytics" part. Including improving tools & SLA, so it's easier to follow and manage the ETL work/tables.

We now have ticket #685 for continuing data-engineering / data-pipeline work w/ DuckDB.

We also have ticket #618 for continuing the work w/ aggregating revenue (Predictoor Income), creating the plot, and getting the first dapp page working.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants