Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Lake][Analytics] Re-write get_cli_statistics() into 2 fns and to use polars #453

Closed
7 tasks done
idiom-bytes opened this issue Dec 14, 2023 · 3 comments · Fixed by #463
Closed
7 tasks done

[Lake][Analytics] Re-write get_cli_statistics() into 2 fns and to use polars #453

idiom-bytes opened this issue Dec 14, 2023 · 3 comments · Fixed by #463
Assignees
Labels
Type: Enhancement New feature or request

Comments

@idiom-bytes
Copy link
Member

idiom-bytes commented Dec 14, 2023

Motivation

We're going to be re-writing all of our core tables and analytics with dataframes and polars.

All of the logic inside of predictoor_stats, will eventually be re-written.
#447

re-write get_cli_statistics() into 2 different fns

This is the PR with the GQL factory => #438

Please fork it, and work towards updating get_cli_statistics() such that it's broken up into 2 functions:

  1. get_feed_summary_stats()
  2. get_predictoor_summary_stats()

re-write both fns to use polars

Both functions should take in a List[Prediction] and returns a dataframe with all stats that are currently there. The final dataframe should have the following schema.

feed_summary_df_schema = {
timeframe: str,
pair: str,
source: str,
accuracy: float,
sum_stake: float,
sum_payout: float,
n_predictions: int,
}

predictoor_summary_df_schema = {
timeframe: str,
pair: str,
source: str,
accuracy: float,
sum_stake: float,
sum_payout: float,
n_predictions: int,
predictions: json,
user: str
}

Outputting

Once you have the final dataframes... print all records, and return the dataframe

DoD:

  • re-write get_cli_statistics() to be 2 functions
  • functions should transform predictions into 2 final summaries: (1) feed_summary_df, (2) predictoor_summary_df
  • functions should print the dataframes
  • functions should return the summary pl.DataFrames()
  • everything downstream thats using get_cli_statistics() should be correctly updated
  • Both get_predictions_info_main and get_traction_info_main should work with DataFactory.
  • tests should be written to cover both paths to verify they are being used correctly
@KatunaNorbert
Copy link
Member

KatunaNorbert commented Dec 19, 2023

What should the predictions field of the predictoor_summary_df_schema be, list of Prediction objects, or should we define a different object?

@idiom-bytes
Copy link
Member Author

idiom-bytes commented Dec 19, 2023

After thinking about it, I believe we could skip it.

We can easily get them with:

wallet_list = ['w1', 'w2', 'w3']
mask = predictions_df.filter(pl.col('user').str.contains('|'.join(wallet_list)) 
user_predictions_df = predictions_df[mask]

This will reduce data duplication, schema complexity, and keep the summary dfs clean.

@idiom-bytes
Copy link
Member Author

idiom-bytes commented Dec 19, 2023

To completely remove dependency on the subgraph fetch_filtered_predictions() and to further lean onto polar dataframes...

  1. get_predictions_info() should use gql_data_factory like get_traction_info.py
gql_data_factory = GQLDataFactory(ppss)
    gql_dfs = gql_data_factory.get_gql_dfs()

    if len(gql_dfs) == 0:
        print("No records found. Please adjust start and end times.")
        return

    predictions_df = gql_dfs["pdr_predictions"]

    # calculate predictoor traction statistics and draw plots
    stats_df = get_traction_statistics(predictions_df)
    plot_traction_cum_sum_statistics(stats_df, pq_dir)
    plot_traction_daily_statistics(stats_df, pq_dir)

    # calculate slot statistics and draw plots
    slots_df = get_slot_statistics(predictions_df)
    plot_slot_daily_statistics(slots_df, pq_dir)
  1. Then, assume that rather than passing a List[Predictions] into the summary functions, you are passing in the predictions_df. The schema/logic can be found in table_pdr_predictions.py
  2. Then, assume that rather than calling fetch_filtered_predictions() with the params payout_only=False, trueval_only=False, that you already have all predictions inside predictions_df, (including those without payout or trueval) and using polar dataframes, apply the right payout + trueval filters inside of get_feed_summary_stats() and get_predictoor_summary_stats() such that you only operate on top of the predictions you're looking for (i.e. predictions w/ trueval and a payout).

Example pseudocode:

def get_feed_summary_stats(predictions_df: pl.Dataframe) -> pl.Dataframe:
  # 1 - filter from lake only the rows that you're looking for
  df = predictions_df.filter(not pl.col("trueval") is None & not pl.col("payout") is None)
  # 2 - do the transform/aggregation with polars
  df = df.with_columns([
    # do transform & aggregates
  ])
  # 3 - return the final dataframe
  return df

@trentmc trentmc changed the title [DataEng][Analytics] Re-write get_cli_statistics() into 2 fns and to use polars [Lake][Analytics] Re-write get_cli_statistics() into 2 fns and to use polars Jan 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants