[EPIC] [YAML] Well-structured lake & analytics #447

trentmc · 2023-12-13T05:51:29Z

Background / motivation

This is an epic to mature our pipeline for data going into the data lake/, and for consuming it in analytics/. By the end, the lake, etl, and analytics should only fetch/update whats needed.

Most calculations and aggregations should be done at the ETL level, yielding tables that have been tested and verified. Core analytic tables (parquet) will be created as a result of data_factory + ETL doing their work.

Analytics and other modules should then consume from the local lake/tables (like a database). Analytics, reports, streamlit should mostly, just consume/report from the work done by the ETL.

Steps Proposed are:

Lake Preparation
ETL + Bronze Data Workflow
Cleanup Table Interface

1. Lake Preparation

Rename/move files & dirs for proper separation among lake, AI models, analytics #446
predictoor_stats.py - remove `get_endpoint_statistics #477
Move logic from subgraph_slot.py #483
Integrate pdr_subscriptions into GQL Data Factory #468
Update gql_data_factory tests/logic to handle more complex cases: multiple queries/parquet files, no records to fetch, non-existing parquet, different start/end records. #468
Re-write get_cli_statistics() into 2 fns and to use polars #453
predictoor_stats.py - remove aggregate_prediction_statistics was completed in #453

At a later date, update accuracy/app.py to use data_lake + ETL.
Review peripheral/utilities that might be good candidates for using lake/etl data.

ETL + Bronze Data Workflow

Due to how subgraph works, we need to be smart about how to keep our local records up-to-date. The simplest, dumbest way is to fetch all: predictions/truevals,payouts, and join them into a table <bronze_post_pdr_predictions_table>.

Part A - Integrate all raw data

Modify data_factory._update() to only fetch/update predictions that have changed. pdr data_factory._update() fetches new predictPredictions (prediction state) that were computed by subgraph.
Integrate truevals into gql_data_factory #480
Integrate payouts into gql_data_factory #481

Part B - Do ETL + Bronze Tables

Create etl.py such that GQLDataFactory() creates the raw data, while ETL() is responsible for the steps required to evolve the data from raw, to bronze, and beyond. => #482
Create a clean "bronze_predictions" table using all source raw tables: (1) predictions, (2) truevals, (3) payouts by slot) => #482

3 - Cleanup Table Interface

Create a Table base object, so save/load/query functionality isn't duplicated across GQLDataFactory/ETL #593
Move Tables() out of GQLDataFactory/ETL, so other classes/actors can access them. #593

The text was updated successfully, but these errors were encountered:

idiom-bytes · 2024-02-21T00:09:30Z

Last PR has been updated and is in review.
Merging #482 should complete this epic.

@kdetry @KatunaNorbert

idiom-bytes · 2024-03-05T14:45:43Z

We have completed the core pieces of this ticket by ingesting data from subgraph, building our local lake, and then creating our initial ETL tables.

We're now focused on completing the DuckDB work, the dapp/analytics work, in addition to the "well-structured lake & analytics" part. Including improving tools & SLA, so it's easier to follow and manage the ETL work/tables.

We now have ticket #685 for continuing data-engineering / data-pipeline work w/ DuckDB.

We also have ticket #618 for continuing the work w/ aggregating revenue (Predictoor Income), creating the plot, and getting the first dapp page working.

trentmc added the Type: Enhancement New feature or request label Dec 13, 2023

trentmc assigned idiom-bytes Dec 13, 2023

This was referenced Dec 13, 2023

GQLDataFactory - Add GQL retry/resume/save/loading. Queries fail over time. #436

Closed

[YAML] Rename/move files & dirs for proper separation among lake, AI models, analytics #446

Closed

Issue436 - Implement GQL data factory #438

Merged

idiom-bytes mentioned this issue Dec 14, 2023

[Lake][Analytics] Re-write get_cli_statistics() into 2 fns and to use polars #453

Closed

7 tasks

trentmc added the Epic label Dec 17, 2023

This was referenced Jan 3, 2024

[Epic][E2e Streamlit] Create Feed Overview streamlit page (expand lake/ + analytics/) #478

Closed

[Epic][Analytics] Create Predictoor Income Streamlit Page #479

Closed

idiom-bytes removed the Type: Enhancement New feature or request label Jan 3, 2024

idiom-bytes mentioned this issue Jan 8, 2024

Issue483 - move the logic from the subgraph_slot.py #489

Merged

idiom-bytes mentioned this issue Jan 17, 2024

[PDR][Lake] Update predictPredictions and predictSlots to feature lastEventTimestamp oceanprotocol/ocean-subgraph#760

Open

2 tasks

trentmc mentioned this issue Jan 18, 2024

Generate Python Script for Predictoor Analytics Raw Transaction Data #359

Closed

kdetry mentioned this issue Jan 18, 2024

[Lake] Issue-481: Adding Payouts to the Data Factory #535

Closed

idiom-bytes closed this as completed Mar 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[EPIC] [YAML] Well-structured lake & analytics #447

[EPIC] [YAML] Well-structured lake & analytics #447

trentmc commented Dec 13, 2023 •

edited by idiom-bytes

Loading

idiom-bytes commented Feb 21, 2024 •

edited

Loading

idiom-bytes commented Mar 5, 2024 •

edited

Loading

[EPIC] [YAML] Well-structured lake & analytics #447

[EPIC] [YAML] Well-structured lake & analytics #447

Comments

trentmc commented Dec 13, 2023 • edited by idiom-bytes Loading

Background / motivation

1. Lake Preparation

ETL + Bronze Data Workflow

3 - Cleanup Table Interface

idiom-bytes commented Feb 21, 2024 • edited Loading

idiom-bytes commented Mar 5, 2024 • edited Loading

trentmc commented Dec 13, 2023 •

edited by idiom-bytes

Loading

idiom-bytes commented Feb 21, 2024 •

edited

Loading

idiom-bytes commented Mar 5, 2024 •

edited

Loading