[Lake][Analytics] Re-write get_cli_statistics() into 2 fns and to use polars #453

idiom-bytes · 2023-12-14T15:40:15Z

Motivation

We're going to be re-writing all of our core tables and analytics with dataframes and polars.

All of the logic inside of predictoor_stats, will eventually be re-written.
#447

re-write get_cli_statistics() into 2 different fns

This is the PR with the GQL factory => #438

Please fork it, and work towards updating get_cli_statistics() such that it's broken up into 2 functions:

get_feed_summary_stats()
get_predictoor_summary_stats()

re-write both fns to use polars

Both functions should take in a List[Prediction] and returns a dataframe with all stats that are currently there. The final dataframe should have the following schema.

feed_summary_df_schema = {
timeframe: str,
pair: str,
source: str,
accuracy: float,
sum_stake: float,
sum_payout: float,
n_predictions: int,
}

predictoor_summary_df_schema = {
timeframe: str,
pair: str,
source: str,
accuracy: float,
sum_stake: float,
sum_payout: float,
n_predictions: int,
predictions: json,
user: str
}

Outputting

Once you have the final dataframes... print all records, and return the dataframe

DoD:

re-write get_cli_statistics() to be 2 functions
functions should transform predictions into 2 final summaries: (1) feed_summary_df, (2) predictoor_summary_df
functions should print the dataframes
functions should return the summary pl.DataFrames()
everything downstream thats using get_cli_statistics() should be correctly updated
Both get_predictions_info_main and get_traction_info_main should work with DataFactory.
tests should be written to cover both paths to verify they are being used correctly

KatunaNorbert · 2023-12-19T15:16:46Z

What should the predictions field of the predictoor_summary_df_schema be, list of Prediction objects, or should we define a different object?

idiom-bytes · 2023-12-19T16:14:02Z

After thinking about it, I believe we could skip it.

We can easily get them with:

wallet_list = ['w1', 'w2', 'w3']
mask = predictions_df.filter(pl.col('user').str.contains('|'.join(wallet_list)) 
user_predictions_df = predictions_df[mask]

This will reduce data duplication, schema complexity, and keep the summary dfs clean.

idiom-bytes · 2023-12-19T16:40:37Z

To completely remove dependency on the subgraph fetch_filtered_predictions() and to further lean onto polar dataframes...

get_predictions_info() should use gql_data_factory like get_traction_info.py

gql_data_factory = GQLDataFactory(ppss)
    gql_dfs = gql_data_factory.get_gql_dfs()

    if len(gql_dfs) == 0:
        print("No records found. Please adjust start and end times.")
        return

    predictions_df = gql_dfs["pdr_predictions"]

    # calculate predictoor traction statistics and draw plots
    stats_df = get_traction_statistics(predictions_df)
    plot_traction_cum_sum_statistics(stats_df, pq_dir)
    plot_traction_daily_statistics(stats_df, pq_dir)

    # calculate slot statistics and draw plots
    slots_df = get_slot_statistics(predictions_df)
    plot_slot_daily_statistics(slots_df, pq_dir)

Then, assume that rather than passing a List[Predictions] into the summary functions, you are passing in the predictions_df. The schema/logic can be found in table_pdr_predictions.py
Then, assume that rather than calling fetch_filtered_predictions() with the params payout_only=False, trueval_only=False, that you already have all predictions inside predictions_df, (including those without payout or trueval) and using polar dataframes, apply the right payout + trueval filters inside of get_feed_summary_stats() and get_predictoor_summary_stats() such that you only operate on top of the predictions you're looking for (i.e. predictions w/ trueval and a payout).

Example pseudocode:

def get_feed_summary_stats(predictions_df: pl.Dataframe) -> pl.Dataframe:
  # 1 - filter from lake only the rows that you're looking for
  df = predictions_df.filter(not pl.col("trueval") is None & not pl.col("payout") is None)
  # 2 - do the transform/aggregation with polars
  df = df.with_columns([
    # do transform & aggregates
  ])
  # 3 - return the final dataframe
  return df

idiom-bytes added the Type: Enhancement New feature or request label Dec 14, 2023

idiom-bytes assigned KatunaNorbert Dec 14, 2023

idiom-bytes mentioned this issue Dec 14, 2023

[EPIC] [YAML] Well-structured lake & analytics #447

Closed

14 tasks

KatunaNorbert mentioned this issue Dec 19, 2023

Re-write get_cli_statistics #463

Merged

idiom-bytes mentioned this issue Jan 3, 2024

[YAML] predictoor_stats.py - remove get_endpoint_statistics #477

Closed

1 task

trentmc changed the title ~~[DataEng][Analytics] Re-write get_cli_statistics() into 2 fns and to use polars~~ [Lake][Analytics] Re-write get_cli_statistics() into 2 fns and to use polars Jan 11, 2024

idiom-bytes closed this as completed in #463 Jan 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Lake][Analytics] Re-write get_cli_statistics() into 2 fns and to use polars #453

[Lake][Analytics] Re-write get_cli_statistics() into 2 fns and to use polars #453

idiom-bytes commented Dec 14, 2023 •

edited

Loading

KatunaNorbert commented Dec 19, 2023 •

edited

Loading

idiom-bytes commented Dec 19, 2023 •

edited

Loading

idiom-bytes commented Dec 19, 2023 •

edited

Loading

[Lake][Analytics] Re-write get_cli_statistics() into 2 fns and to use polars #453

[Lake][Analytics] Re-write get_cli_statistics() into 2 fns and to use polars #453

Comments

idiom-bytes commented Dec 14, 2023 • edited Loading

Motivation

re-write get_cli_statistics() into 2 different fns

re-write both fns to use polars

Outputting

DoD:

KatunaNorbert commented Dec 19, 2023 • edited Loading

idiom-bytes commented Dec 19, 2023 • edited Loading

idiom-bytes commented Dec 19, 2023 • edited Loading

idiom-bytes commented Dec 14, 2023 •

edited

Loading

KatunaNorbert commented Dec 19, 2023 •

edited

Loading

idiom-bytes commented Dec 19, 2023 •

edited

Loading

idiom-bytes commented Dec 19, 2023 •

edited

Loading