Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #1001: [Lake][ETL] Implement incremental ETL pipeline #1423

Merged
merged 85 commits into from
Jul 31, 2024

Conversation

idiom-bytes
Copy link
Member

@idiom-bytes idiom-bytes commented Jul 19, 2024

Pull Request Description

By the way of cherry-picking all changes found in PR #1000, this PR incorporates all updates required to implement the incremental ETL pipeline

What happened?
branch 685 was deleted... I did not realize this PR was a branch from there, so the original PR was also closed....this PR rescues all the work and prepares all changes for merging.

…er-to-implementation. Will begin re-working ETL tests so I can run and get them updated.
…ilable. Created tickets to address various fetching and ETL-related work.
…w records but no update records. need to test the incremental logic now to make sure it's working.
…uch that we can get a better understanding of whats happening
… follow. ETL output is now easier to read, and let's me follow the majority of the work being done.
…that ETL can rebuild historical data without relying on subgraph
…nternal calculations. Configured tests to do the first step of processing 1/4 of the data, such that I can control the ETL/lake, and update manually.
… make sure that clamping is working as expected
… 1, as I believe to be expected... debugging issues with the update step
@calina-c
Copy link
Contributor

I can't tell which comments are new and which old. I trust that you resolved only comments that you addressed. I would also recommend a second look from another dev on the team, just to be sure. Otherwise LGTM.

@idiom-bytes
Copy link
Member Author

idiom-bytes commented Jul 25, 2024

I can't tell which comments are new and which old. I trust that you resolved only comments that you addressed. I would also recommend a second look from another dev on the team, just to be sure. Otherwise LGTM.

I have unresolved the conversation on all items where I didn't implement any change.

Why? either the comments:

  • do not apply, i.e. I am using row_count() and move_data_from_table_to_table()
  • seemed subjective enough where I strongly disagree, i.e. use new terminology that's not used elsewhere (like source and destination, instead of from and to that's used everywhere)
  • should not implement due to architectural or engineering related considerations: i.e. move core logic and code to other files, or change component logic in a way that doesn't make sense relative to it's function

@KatunaNorbert @kdetry can you please review, test, or provide any feedback/response?
i'll address everything applicable and leave the rest for a final chat/sign off.
cheers!

@KatunaNorbert
Copy link
Member

Tested is and I might found an issue. The trueval field is NULL for all the rows inside the bronze_pdr_predictions table

@idiom-bytes
Copy link
Member Author

You did find an issue. Nice work.

truevalue had never been implemented from the subgraph fetch, so it was missing from payout.
image

it's now being fetched from subgraph -> saved to csv -> processed in ETL -> updated in bronze table

I have updated the ETL tests to validate that it's working

@trizin
Copy link
Contributor

trizin commented Jul 30, 2024

I noticed the mermaid diagram added. Not sure if this is in the scope of the PR but here's how the table structure could be improved:

  • pair, timeframe, source: duplicated data, should be derived from contract address
  • remove PDR_PREDICTIONS and PDR_PAYOUTS, just a single table BRONZE_PDR_PREDICTIONS should be enough
  • slot_id is literally {contract}-{slot}, it's redundant.

@idiom-bytes
Copy link
Member Author

idiom-bytes commented Jul 30, 2024

Answering trizin's comments in here

slot_id is literally {contract}-{slot}, it's redundant
I have a PR to work update slot-related queries, i can implement it there -> #1466

slot_id will be required for certain joins. Having this pre-processed (string split) and then available for joins (string equal vs. string like operation) will save computation time when it's needed most.

pair, timeframe, source: duplicated data, should be derived from contract address
100% agree, I've wanted to do this for a while. i'm working towards this and added this to the ETL & Analytics backlog.

remove PDR_PREDICTIONS and PDR_PAYOUTS, just a single table BRONZE_PDR_PREDICTIONS should be enough

This is the goal (to only have bronze), this would also reduce DB size. The challenge is that we need to build all the bronze tables before removing all raw_data from DB.

Now that we have Incremental ETL, this should be possible.

I'm proposing we deprecate raw tables once we have more bronze tables in place, and start working towards silver (aggregate) tables

Challenge:
If raw tables aren't in the DB, then any ETL work is going to be slow (scanning + reading CSVs into memory for processing)
If raw tables are in DB, then DuckDB is very fast and we can do queries/aggregates/summaries efficiently.

@idiom-bytes idiom-bytes dismissed calina-c’s stale review July 31, 2024 03:50

because we have discussed all the feedback and there were no more comments

@idiom-bytes
Copy link
Member Author

We had multiple reviewers, discussed feedback, and agreed on how to move forward across the board. Merging.

@idiom-bytes idiom-bytes merged commit 3307967 into main Jul 31, 2024
5 checks passed
@idiom-bytes idiom-bytes deleted the issue1001-incremental-etl branch July 31, 2024 03:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants