Scalable timeseries metrics #1908

ccerv1 · 2024-08-02T18:46:26Z

ccerv1
Aug 2, 2024
Maintainer

This is a thread on some of the metrics modeling discussions we've been having, as we move from a small number of static metrics to a large number of metrics that can be applied on a timeseries.

Metric schema

@ryscheng proposed the following as a metrics_v0 yesterday:

{{ oso_id('"OSO"', '"oso"', 'metric') }} as metric_id,
metric_source,
metric_namespace,
metric_name,
display_name,
description,
raw_definition,
definition_ref,
aggregation_function

Sample metrics

Here are some examples of different types of metrics:

Gas Fees

This is a static metric that simply sums gas fees by project / event_source / time_interval. (The event_source represents the chain and the time_interval options are "7 DAYS", "30 DAYS", ... , "ALL".)

select
  events.project_id,
  events.event_source,
  time_intervals.time_interval,
  'gas_fees' as metric,
  SUM(events.amount / 1e18) as amount
from {{ ref('int_events_daily_to_project') }} as events
cross join {{ ref('int_time_intervals') }} as time_intervals
where
  events.event_type = 'CONTRACT_INVOCATION_DAILY_L2_GAS_USED'
  and events.bucket_day >= time_intervals.start_date
group by
  events.project_id,
  events.event_source,
  time_intervals.time_interval

Contributors

This is another static metric that counts the unique number of contributors by project / event_source / time_interval. (The event_source will always be GitHub for now and the time_interval options are "7 DAYS", "30 DAYS", ... , "ALL".)

select
  events.project_id,
  events.event_source,
  time_intervals.time_interval,
  'contributor_count' as metric,
  COUNT(distinct events.from_artifact_id) as amount
from {{ ref('int_events_daily_to_project') }} as events
cross join {{ ref('int_time_intervals') }} as time_intervals
where
  events.event_type in (
    'COMMIT_CODE',
    'PULL_REQUEST_OPENED',
    'ISSUE_OPENED'
  )
  and events.bucket_day >= time_intervals.start_date
group by
  events.project_id,
  events.event_source,
  time_intervals.time_interval

New Contributors

This is a more complex static metric that calculates the number of new contributors by project / event_source / time_interval. (The event_source will always be GitHub for now and the time_interval options are "7 DAYS", "30 DAYS", ... , "ALL".)

with user_stats as (
  select
    from_artifact_id,
    event_source,
    project_id,
    min(bucket_day) as first_day
  from {{ ref('int_events_daily_to_project') }}
  where
    event_type in (
      'COMMIT_CODE',
      'PULL_REQUEST_OPENED',
      'ISSUE_OPENED'
    )
  group by
    from_artifact_id,
    event_source,
    project_id
)

select
  events.project_id,
  events.event_source,
  time_intervals.time_interval,
  'new_contributor_count' as metric,
  count(
    distinct
    case
      when user_stats.first_day >= time_intervals.start_date
        then events.from_artifact_id
    end
  ) as amount
from {{ ref('int_events_daily_to_project') }} as events
inner join user_stats
  on
    events.from_artifact_id = user_stats.from_artifact_id
    and events.project_id = user_stats.project_id
    and events.event_source = user_stats.event_source
cross join {{ ref('int_time_intervals') }} as time_intervals
where
  events.event_type in (
    'COMMIT_CODE',
    'PULL_REQUEST_OPENED',
    'ISSUE_OPENED'
  )
  and events.bucket_day >= time_intervals.start_date
group by
  events.project_id,
  events.event_source,
  time_intervals.time_interval

Bus Factor

This is an even more complex static metric that does some math on the composition of contributors by project / event_source / time_interval. (The event_source will always be GitHub for now and the time_interval options are "7 DAYS", "30 DAYS", ... , "ALL".)

with all_contributions as (
  select
    project_id,
    from_artifact_id,
    event_source,
    bucket_month,
    SUM(amount) as amount
  from {{ ref('int_events_monthly_to_project') }}
  where event_type = 'COMMIT_CODE'
  group by
    project_id,
    from_artifact_id,
    event_source,
    bucket_month
),

contributions as (
  select *
  from all_contributions
  where amount < 1000 -- BOT FILTER
),

aggregated_contributions as (
  select
    contributions.project_id,
    contributions.from_artifact_id,
    contributions.event_source,
    time_intervals.time_interval,
    SUM(contributions.amount) as amount
  from contributions
  cross join {{ ref('int_time_intervals') }} as time_intervals
  where
    contributions.bucket_month
    >= TIMESTAMP_TRUNC(time_intervals.start_date, month)
  group by
    contributions.project_id,
    contributions.from_artifact_id,
    contributions.event_source,
    time_intervals.time_interval
),

ranked_contributions as (
  select
    project_id,
    event_source,
    time_interval,
    from_artifact_id,
    amount,
    RANK()
      over (
        partition by project_id, event_source, time_interval
        order by amount desc
      ) as rank,
    SUM(amount)
      over (
        partition by project_id, event_source, time_interval
      ) as total_project_amount,
    SUM(amount)
      over (
        partition by project_id, event_source, time_interval
        order by amount desc
        rows between unbounded preceding and current row
      ) as cumulative_amount
  from aggregated_contributions
)

select
  project_id,
  event_source,
  time_interval,
  'bus_factor' as metric,
  MAX(
    case
      when cumulative_amount <= total_project_amount * 0.5
        then rank
      else 1
    end
  ) as amount
from
  ranked_contributions
group by
  project_id,
  event_source,
  time_interval

Full-time active developers

This is v0 timeseries metric that counts the number of developers that have made 10+ commits in a 30 day period to a project. It constructs a synthetic calendar and applies a 30 day rolling window.

{% set fulltime_dev_days = 10 %}

with commits as (
  select
    from_artifact_id as developer_id,
    project_id,
    event_source,
    bucket_day,
    CAST(SUM(amount) > 0 as int64) as commit_count
  from {{ ref('int_events_daily_to_project') }}
  where event_type = 'COMMIT_CODE'
  group by
    from_artifact_id,
    project_id,
    event_source,
    bucket_day
),

project_start_dates as (
  select
    project_id,
    event_source,
    MIN(bucket_day) as first_commit_date
  from commits
  group by
    project_id,
    event_source
),

calendar as (
  select
    project_id,
    event_source,
    TIMESTAMP_ADD(first_commit_date, interval day_offset day) as bucket_day
  from
    project_start_dates,
    UNNEST(
      GENERATE_ARRAY(
        0,
        TIMESTAMP_DIFF(
          (select MAX(bucket_day) as last_commit_date from commits),
          first_commit_date, day
        )
      )
    ) as day_offset
),

devs as (
  select distinct developer_id
  from commits
),

developer_project_dates as (
  select
    devs.developer_id,
    calendar.project_id,
    calendar.bucket_day,
    calendar.event_source
  from calendar
  cross join devs
),

filled_data as (
  select
    dpd.bucket_day,
    dpd.developer_id,
    dpd.project_id,
    dpd.event_source,
    COALESCE(c.commit_count, 0) as commit_count
  from developer_project_dates as dpd
  left join commits as c
    on
      dpd.bucket_day = c.bucket_day
      and dpd.developer_id = c.developer_id
      and dpd.project_id = c.project_id
      and dpd.event_source = c.event_source
),

rolling_commit_days as (
  select
    bucket_day,
    developer_id,
    project_id,
    event_source,
    SUM(commit_count) over (
      partition by developer_id, project_id, event_source
      order by bucket_day
      rows between 29 preceding and current row
    ) as num_commit_days
  from filled_data
)

select
  project_id,
  event_source,
  bucket_day,
  'fulltime_developers' as metric,
  COUNT(distinct developer_id) as amount
from rolling_commit_days
where num_commit_days >= {{ fulltime_dev_days }}
group by
  project_id,
  event_source,
  bucket_day

Transformation Steps

For each of these metrics, there appears to be a general pattern of transformation steps:

0. From staging to raw events

Currently, the int_events table has both raw events (eg, COMMIT_CODE) and bucket events (eg, CONTRACT_INVOCATION_SUCCESS_DAILY_COUNT). The int_events table also has fields that are not strictly necessary (eg, to_artifact_name, to_artifact_type).

A proposal would be to remove all the superfluous fields and just have:
time, from_artifact_id, to_artifact_id, event_source, event_type, amount
Then, we should keep the raw times instead of bucketed ones, eg, CONTRACT_INVOCATION_SUCCESS with a specific timestamp.
One downside is this will magnify the amount of events we have, ie, a token transfer could have events for gas, contract_invocation, usd_amount, donation, etc.

1. Filtering events

All events have a filtering step which could easily be parametrized in the metric definition, eg:

event_sources:
  - github
event_types:
  - commit_code
to_artifact_types:
  - repository
from_artifact_types:
  - git_user

These could be expanded upon to include both types (set by the event source provider) and tags (set by different models, eg, from_artifact_ids associated with trusted farcaster users).

2. Deriving intermediate metrics

Once events have been filtered, there is usually a step where some intermediation transformation is needed. For instance:

gas: gas_fees / 1e18
active_developer_day: cast(amount > 0 as int64)
new_contributor: case when user_stats.first_day >= time_intervals.start_date then events.from_artifact_id end
This is usually an important part of the business logic.

3. Building a timeseries

For metrics that have rolling windows (eg, fulltime_developers), it may be necessary to create a utility calendar and add ephemeral events with 0 amounts. There's some logic around defining a window_interval and a sampling_interval, eg:

window_interval:
  - interval: day
    size: 30
    missing_dates: fill_with_zero
sampling_interval: daily

An alternative implementation for a related metric might be:

window_interval:
  - interval: month
    size: 1
    missing_dates: fill_with_zero
sampling_interval: monthly

4. Aggregating by entity type and applying remaining business logic

We should avoid having to define every metric for every artifact / project / collection. Thus, we'd like some generalized version of:

interval_time, from_id, to_id, event_source, metric, amount

... where the to_id could be a project_id or collection_id.

Then we perform our remaining business logic operations.

For example, with bus_factor we have:

select
  project_id,
  event_source,
  time_interval,
  'bus_factor' as metric,
  MAX(
    case
      when cumulative_amount <= total_project_amount * 0.5
        then rank
      else 1
    end
  ) as amount
from
  ranked_contributions
group by
  project_id,
  event_source,
  time_interval

5. Agg functions

Finally, we can apply standard agg and limit functions to the raw metric models. These will mostly be min, max, avg, std, and limit 1 since the sum and count / count_unique agg funcs will already have been be done upstream.

Curious what @ryscheng @ravenac95 think!

ccerv1 · 2024-08-05T11:29:45Z

ccerv1
Aug 5, 2024
Maintainer Author

Copying some discord chat with @davidgasquez over:

have you all looked at things like https://cube.dev/ ?

dbt basically acquired Transform and then killed their metrics layer product by making it cloud customer only. SQLMesh might be the winner here but still very green (played with it last month). They're working on integrating with Dagster though!

other projects to keep an eye on are:

https://www.sdf.com/ not sure how semantic layer-ish are going to go

https://github.com/carbonfact/lea

0 replies

ccerv1 · 2024-08-13T13:41:42Z

ccerv1
Aug 13, 2024
Maintainer Author

Here's a reworked and hopefully similar set of steps:

Select and filter event data

Choose relevant event data model
Filter by event type
Filter by artifact type (to/from)
(Optional) Join artifacts with entity tables (e.g., projects, users)

Prepare time series

Define time bucket (e.g., daily, monthly)
(Optional) Fill missing dates with zero values for complete series

Calculate metric

Group data by relevant dimensions (e.g., project)
Apply first-level aggregation (e.g., sum amounts)
Apply second-level aggregation over desired time interval (e.g., 30-day average)

Label metric

Assign a name to the metric
Specify unit of measure

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Scalable timeseries metrics #1908

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Scalable timeseries metrics #1908

Uh oh!

Uh oh!

ccerv1 Aug 2, 2024 Maintainer

Metric schema

Sample metrics

Gas Fees

Contributors

New Contributors

Bus Factor

Full-time active developers

Transformation Steps

0. From staging to raw events

1. Filtering events

2. Deriving intermediate metrics

3. Building a timeseries

4. Aggregating by entity type and applying remaining business logic

5. Agg functions

Replies: 2 comments

Uh oh!

ccerv1 Aug 5, 2024 Maintainer Author

Uh oh!

ccerv1 Aug 13, 2024 Maintainer Author

ccerv1
Aug 2, 2024
Maintainer

ccerv1
Aug 5, 2024
Maintainer Author

ccerv1
Aug 13, 2024
Maintainer Author