Fix #1001: [Lake][ETL] Implement incremental ETL pipeline #1423

idiom-bytes · 2024-07-19T19:07:15Z

Pull Request Description

By the way of cherry-picking all changes found in PR #1000, this PR incorporates all updates required to implement the incremental ETL pipeline

What happened?
branch 685 was deleted... I did not realize this PR was a branch from there, so the original PR was also closed....this PR rescues all the work and prepares all changes for merging.

Reviewed new DuckDB/CRUD functions

…oader implementation with additional tables

…er-to-implementation. Will begin re-working ETL tests so I can run and get them updated.

…uch that I can test the functionality

…ilable. Created tickets to address various fetching and ETL-related work.

…selves

…youts table seem to be working correctly

…w records but no update records. need to test the incremental logic now to make sure it's working.

…uch that we can get a better understanding of whats happening

… follow. ETL output is now easier to read, and let's me follow the majority of the work being done.

…that ETL can rebuild historical data without relying on subgraph

…ating that all rows are there

…t to start validating what the ETL is doing

…nternal calculations. Configured tests to do the first step of processing 1/4 of the data, such that I can control the ETL/lake, and update manually.

… make sure that clamping is working as expected

… 1, as I believe to be expected... debugging issues with the update step

calina-c · 2024-07-24T08:15:16Z

I can't tell which comments are new and which old. I trust that you resolved only comments that you addressed. I would also recommend a second look from another dev on the team, just to be sure. Otherwise LGTM.

idiom-bytes · 2024-07-25T16:40:25Z

I can't tell which comments are new and which old. I trust that you resolved only comments that you addressed. I would also recommend a second look from another dev on the team, just to be sure. Otherwise LGTM.

I have unresolved the conversation on all items where I didn't implement any change.

Why? either the comments:

do not apply, i.e. I am using row_count() and move_data_from_table_to_table()
seemed subjective enough where I strongly disagree, i.e. use new terminology that's not used elsewhere (like source and destination, instead of from and to that's used everywhere)
should not implement due to architectural or engineering related considerations: i.e. move core logic and code to other files, or change component logic in a way that doesn't make sense relative to it's function

@KatunaNorbert @kdetry can you please review, test, or provide any feedback/response?
i'll address everything applicable and leave the rest for a final chat/sign off.
cheers!

KatunaNorbert · 2024-07-26T08:10:34Z

Tested is and I might found an issue. The trueval field is NULL for all the rows inside the bronze_pdr_predictions table

idiom-bytes · 2024-07-29T16:07:22Z

You did find an issue. Nice work.

truevalue had never been implemented from the subgraph fetch, so it was missing from payout.

it's now being fetched from subgraph -> saved to csv -> processed in ETL -> updated in bronze table

I have updated the ETL tests to validate that it's working

trizin · 2024-07-30T11:59:59Z

I noticed the mermaid diagram added. Not sure if this is in the scope of the PR but here's how the table structure could be improved:

pair, timeframe, source: duplicated data, should be derived from contract address
remove PDR_PREDICTIONS and PDR_PAYOUTS, just a single table BRONZE_PDR_PREDICTIONS should be enough
slot_id is literally {contract}-{slot}, it's redundant.

idiom-bytes · 2024-07-30T15:39:26Z

Answering trizin's comments in here

slot_id is literally {contract}-{slot}, it's redundant
I have a PR to work update slot-related queries, i can implement it there -> #1466

slot_id will be required for certain joins. Having this pre-processed (string split) and then available for joins (string equal vs. string like operation) will save computation time when it's needed most.

pair, timeframe, source: duplicated data, should be derived from contract address
100% agree, I've wanted to do this for a while. i'm working towards this and added this to the ETL & Analytics backlog.

remove PDR_PREDICTIONS and PDR_PAYOUTS, just a single table BRONZE_PDR_PREDICTIONS should be enough

This is the goal (to only have bronze), this would also reduce DB size. The challenge is that we need to build all the bronze tables before removing all raw_data from DB.

Now that we have Incremental ETL, this should be possible.

I'm proposing we deprecate raw tables once we have more bronze tables in place, and start working towards silver (aggregate) tables

Challenge:
If raw tables aren't in the DB, then any ETL work is going to be slow (scanning + reading CSVs into memory for processing)
If raw tables are in DB, then DuckDB is very fast and we can do queries/aggregates/summaries efficiently.

because we have discussed all the feedback and there were no more comments

idiom-bytes · 2024-07-31T03:52:26Z

We had multiple reviewers, discussed feedback, and agreed on how to move forward across the board. Merging.

idiom-bytes added 30 commits July 19, 2024 11:22

pseudo code for insert + update routine. also provided example for br…

b21eecf

…oader implementation with additional tables

I don't know why this didnt get included/pushed

9f6597d

First pass at going from pseudo code to something that should be clos…

5abf219

…er-to-implementation. Will begin re-working ETL tests so I can run and get them updated.

cleaning up imports, logic, and wiring through the new ETL sequence s…

3d7d5d2

…uch that I can test the functionality

making prediction query complete

931beeb

fixed raw_predictions query

a0e3f0d

First pass at integrating payouts and using whatever good data is ava…

9e92496

…ilable. Created tickets to address various fetching and ETL-related work.

Implemented routine of updating new and historical records.

f496452

Getting query to complete, there are some join issues manifesing them…

61ee603

…selves

started debugging etl and fixing up the code. both predictions and pa…

d23a310

…youts table seem to be working correctly

etl is completing end-to-end on the first run, as a result we have ne…

83e493a

…w records but no update records. need to test the incremental logic now to make sure it's working.

cleaning up various logs and prints now that the pipeline is stable s…

dc79dc1

…uch that we can get a better understanding of whats happening

Removed a ton of logs that are making the data pipeline impossible to…

70f618f

… follow. ETL output is now easier to read, and let's me follow the majority of the work being done.

cleaning up more logs and debugging issues

8d80db7

started to update tests so we can have a verifiable test.

1924832

Updated etl such that it enforces null values, allowing me to verify …

2d578ff

…that ETL can rebuild historical data without relying on subgraph

Fixing up validation output

e11a46d

Exported test data for a more thorough fixture

5b71921

Got the first test setup, loading csvs into duckdb for ETL, and valid…

bebf68c

…ating that all rows are there

hooking up sample_data to new test_etl flows. added some basic assers…

9a35e50

…t to start validating what the ETL is doing

Added asserts to more easily visualize the etl work and output

cebb29f

Added clamp, so ETL is forced to read from ppss rather than do it's i…

f53e569

…nternal calculations. Configured tests to do the first step of processing 1/4 of the data, such that I can control the ETL/lake, and update manually.

Adjusted test to cover 1/4 of the data, and reviwed implementation to…

de9842b

… make sure that clamping is working as expected

first pass on e2e incremental ttest

c20a86f

adding a bit of structure to the test so it's easier to manage

4cffc1a

cleaned up test so I can more easily validate issues happening in step2

3adef89

small comments

291bc6f

restructured the swap strategy so that it can complete executing step…

80681a7

… 1, as I believe to be expected... debugging issues with the update step

many null rows addressed but now it's generating extra valid payouts

531657f

getting nearly the exact output that i'm expecting

5d8bd8b

idiom-bytes and others added 2 commits July 23, 2024 19:29

fixing formatting

ade35ce

Merge branch 'main' into issue1001-incremental-etl

0c4dd54

calina-c and others added 9 commits July 24, 2024 11:15

Merge branch 'main' into issue1001-incremental-etl

8549a2d

fixing small issues that are keeping lake from running correctly

0bcc67c

fix black

5a5b373

Merge branch 'main' into issue1001-incremental-etl

1850174

fixing pylint

d2b82f5

Merge branch 'issue1457-fix-lake' into issue1001-incremental-etl

02bbce5

Merge branch 'main' into issue1457-fix-lake

37a9bed

reverting check_network issue

32e0e69

Merge branch 'main' into issue1001-incremental-etl

ad2a2ac

idiom-bytes added 3 commits July 25, 2024 09:53

Merge branch 'issue1457-fix-lake' into issue1001-incremental-etl

b5feb0e

there was an extra row_count

d4e671a

fixing formatting

e151362

idiom-bytes mentioned this pull request Jul 25, 2024

Fix #1387: [Lake] Reintroduce truevals #1461

Closed

4 tasks

fixing truevalues in payouts

ebb8bc5

fixing mock data

a98433c

KatunaNorbert approved these changes Jul 30, 2024

View reviewed changes

getting PR up-to-date with main

2e0b0ea

idiom-bytes merged commit 3307967 into main Jul 31, 2024
5 checks passed

idiom-bytes deleted the issue1001-incremental-etl branch July 31, 2024 03:52

idiom-bytes mentioned this pull request Jul 31, 2024

Fix #1387: [Lake] Reintroduce truevals #1483

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix #1001: [Lake][ETL] Implement incremental ETL pipeline #1423

Fix #1001: [Lake][ETL] Implement incremental ETL pipeline #1423

idiom-bytes commented Jul 19, 2024 •

edited

Loading

calina-c commented Jul 24, 2024

idiom-bytes commented Jul 25, 2024 •

edited

Loading

KatunaNorbert commented Jul 26, 2024

idiom-bytes commented Jul 29, 2024

trizin commented Jul 30, 2024

idiom-bytes commented Jul 30, 2024 •

edited

Loading

idiom-bytes commented Jul 31, 2024

Fix #1001: [Lake][ETL] Implement incremental ETL pipeline #1423

Fix #1001: [Lake][ETL] Implement incremental ETL pipeline #1423

Conversation

idiom-bytes commented Jul 19, 2024 • edited Loading

Pull Request Description

calina-c commented Jul 24, 2024

idiom-bytes commented Jul 25, 2024 • edited Loading

KatunaNorbert commented Jul 26, 2024

idiom-bytes commented Jul 29, 2024

trizin commented Jul 30, 2024

idiom-bytes commented Jul 30, 2024 • edited Loading

idiom-bytes commented Jul 31, 2024

idiom-bytes commented Jul 19, 2024 •

edited

Loading

idiom-bytes commented Jul 25, 2024 •

edited

Loading

idiom-bytes commented Jul 30, 2024 •

edited

Loading