Skip to content

E2E Test: Refactor Prometheus metrics to use status labels and test name labels instead of separate metrics #85

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
maxnovawind opened this issue Mar 1, 2025 · 0 comments
Assignees
Labels
enhancement New feature or request

Comments

@maxnovawind
Copy link

Context

Our Prometheus metrics for test status currently have two design issues:

  1. Using numeric values (0 for failure, 1 for success) instead of status labels
  2. Creating separate metrics for each test (e.g., test_input_creation_should_create_an_input_and_send_it_to_the_gateway_status)

These approaches:

  • Make it difficult to query for failed tests over time periods
  • Cause syntax errors when using range queries with boolean expressions
  • Create metric explosion as we add more tests
  • Don't follow Prometheus best practices for metric design

Example error when trying to query for failed tests over time:

```
bad_data: invalid parameter "query": 2:21: parse error: binary expression must contain only scalar and instant vector types
```

Impact

  • Engineers are unable to efficiently monitor and alert on test failures over time
  • Poor scalability as we add more tests
  • Difficult to create aggregated views across tests

Proposed Solution

Refactor our metrics collection with two key changes:

  1. Use status labels instead of numeric values
  2. Use test name labels instead of separate metrics per test

Current Implementation:

    # Current approach - different metrics per test, numeric value shows status
    test_input_creation_should_create_an_input_and_send_it_to_the_gateway_status{job="test_job"} 0  # failed
    test_another_feature_should_work_status{job="test_job"} 1  # success

Proposed Implementation:

    # New approach - consolidated metric with test_name and status labels
    test_execution_status{job="test_job", test_name="input_creation_should_create_an_input_and_send_it_to_the_gateway", status="failed"} 1
    test_execution_status{job="test_job", test_name="another_feature_should_work", status="success"} 1

This change will enable simple, effective queries like:

    # Count all failures over the last hour
    sum(count_over_time(test_execution_status{status="failed"}[1h]))
    
    # Count failures for a specific test
    sum(count_over_time(test_execution_status{test_name="input_creation_should_create_an_input_and_send_it_to_the_gateway", status="failed"}[1h]))
    
    # Calculate failure rate by test
    sum by (test_name) (count_over_time(test_execution_status{status="failed"}[1d])) /
    sum by (test_name) (count_over_time(test_execution_status[1d]))
@maxnovawind maxnovawind added the enhancement New feature or request label Mar 1, 2025
@maxnovawind maxnovawind changed the title Refactor Prometheus metrics to use status labels and test name labels instead of separate metrics E2E Test: Refactor Prometheus metrics to use status labels and test name labels instead of separate metrics Mar 1, 2025
@maxnovawind maxnovawind changed the title E2E Test: Refactor Prometheus metrics to use status labels and test name labels instead of separate metrics e2e Test: Refactor Prometheus metrics to use status labels and test name labels instead of separate metrics Mar 1, 2025
@maxnovawind maxnovawind changed the title e2e Test: Refactor Prometheus metrics to use status labels and test name labels instead of separate metrics E2E Test: Refactor Prometheus metrics to use status labels and test name labels instead of separate metrics Mar 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants