E2E Test: Refactor Prometheus metrics to use status labels and test name labels instead of separate metrics #85

maxnovawind · 2025-03-01T11:04:39Z

Context

Our Prometheus metrics for test status currently have two design issues:

Using numeric values (0 for failure, 1 for success) instead of status labels
Creating separate metrics for each test (e.g., test_input_creation_should_create_an_input_and_send_it_to_the_gateway_status)

These approaches:

Make it difficult to query for failed tests over time periods
Cause syntax errors when using range queries with boolean expressions
Create metric explosion as we add more tests
Don't follow Prometheus best practices for metric design

Example error when trying to query for failed tests over time:

```
bad_data: invalid parameter "query": 2:21: parse error: binary expression must contain only scalar and instant vector types
```

Impact

Engineers are unable to efficiently monitor and alert on test failures over time
Poor scalability as we add more tests
Difficult to create aggregated views across tests

Proposed Solution

Refactor our metrics collection with two key changes:

Use status labels instead of numeric values
Use test name labels instead of separate metrics per test

Current Implementation:

    # Current approach - different metrics per test, numeric value shows status
    test_input_creation_should_create_an_input_and_send_it_to_the_gateway_status{job="test_job"} 0  # failed
    test_another_feature_should_work_status{job="test_job"} 1  # success

Proposed Implementation:

    # New approach - consolidated metric with test_name and status labels
    test_execution_status{job="test_job", test_name="input_creation_should_create_an_input_and_send_it_to_the_gateway", status="failed"} 1
    test_execution_status{job="test_job", test_name="another_feature_should_work", status="success"} 1

This change will enable simple, effective queries like:

    # Count all failures over the last hour
    sum(count_over_time(test_execution_status{status="failed"}[1h]))
    
    # Count failures for a specific test
    sum(count_over_time(test_execution_status{test_name="input_creation_should_create_an_input_and_send_it_to_the_gateway", status="failed"}[1h]))
    
    # Calculate failure rate by test
    sum by (test_name) (count_over_time(test_execution_status{status="failed"}[1d])) /
    sum by (test_name) (count_over_time(test_execution_status[1d]))

The text was updated successfully, but these errors were encountered:

maxnovawind added the enhancement New feature or request label Mar 1, 2025

maxnovawind changed the title ~~Refactor Prometheus metrics to use status labels and test name labels instead of separate metrics~~ E2E Test: Refactor Prometheus metrics to use status labels and test name labels instead of separate metrics Mar 1, 2025

maxnovawind changed the title ~~E2E Test: Refactor Prometheus metrics to use status labels and test name labels instead of separate metrics~~ e2e Test: Refactor Prometheus metrics to use status labels and test name labels instead of separate metrics Mar 1, 2025

maxnovawind changed the title ~~e2e Test: Refactor Prometheus metrics to use status labels and test name labels instead of separate metrics~~ E2E Test: Refactor Prometheus metrics to use status labels and test name labels instead of separate metrics Mar 1, 2025

maxnovawind assigned fd0r and maxnovawind Mar 1, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

E2E Test: Refactor Prometheus metrics to use status labels and test name labels instead of separate metrics #85

E2E Test: Refactor Prometheus metrics to use status labels and test name labels instead of separate metrics #85

maxnovawind commented Mar 1, 2025

E2E Test: Refactor Prometheus metrics to use status labels and test name labels instead of separate metrics #85

E2E Test: Refactor Prometheus metrics to use status labels and test name labels instead of separate metrics #85

Comments

maxnovawind commented Mar 1, 2025

Context

Impact

Proposed Solution

Current Implementation:

Proposed Implementation: