You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Our Prometheus metrics for test status currently have two design issues:
Using numeric values (0 for failure, 1 for success) instead of status labels
Creating separate metrics for each test (e.g., test_input_creation_should_create_an_input_and_send_it_to_the_gateway_status)
These approaches:
Make it difficult to query for failed tests over time periods
Cause syntax errors when using range queries with boolean expressions
Create metric explosion as we add more tests
Don't follow Prometheus best practices for metric design
Example error when trying to query for failed tests over time:
```
bad_data: invalid parameter "query": 2:21: parse error: binary expression must contain only scalar and instant vector types
```
Impact
Engineers are unable to efficiently monitor and alert on test failures over time
Poor scalability as we add more tests
Difficult to create aggregated views across tests
Proposed Solution
Refactor our metrics collection with two key changes:
Use status labels instead of numeric values
Use test name labels instead of separate metrics per test
Current Implementation:
# Current approach - different metrics per test, numeric value shows status
test_input_creation_should_create_an_input_and_send_it_to_the_gateway_status{job="test_job"} 0 # failed
test_another_feature_should_work_status{job="test_job"} 1 # success
Proposed Implementation:
# New approach - consolidated metric with test_name and status labels
test_execution_status{job="test_job", test_name="input_creation_should_create_an_input_and_send_it_to_the_gateway", status="failed"} 1
test_execution_status{job="test_job", test_name="another_feature_should_work", status="success"} 1
This change will enable simple, effective queries like:
# Count all failures over the last hour
sum(count_over_time(test_execution_status{status="failed"}[1h]))
# Count failures for a specific test
sum(count_over_time(test_execution_status{test_name="input_creation_should_create_an_input_and_send_it_to_the_gateway", status="failed"}[1h]))
# Calculate failure rate by test
sum by (test_name) (count_over_time(test_execution_status{status="failed"}[1d])) /
sum by (test_name) (count_over_time(test_execution_status[1d]))
The text was updated successfully, but these errors were encountered:
maxnovawind
changed the title
Refactor Prometheus metrics to use status labels and test name labels instead of separate metrics
E2E Test: Refactor Prometheus metrics to use status labels and test name labels instead of separate metrics
Mar 1, 2025
maxnovawind
changed the title
E2E Test: Refactor Prometheus metrics to use status labels and test name labels instead of separate metrics
e2e Test: Refactor Prometheus metrics to use status labels and test name labels instead of separate metrics
Mar 1, 2025
maxnovawind
changed the title
e2e Test: Refactor Prometheus metrics to use status labels and test name labels instead of separate metrics
E2E Test: Refactor Prometheus metrics to use status labels and test name labels instead of separate metrics
Mar 1, 2025
Context
Our Prometheus metrics for test status currently have two design issues:
test_input_creation_should_create_an_input_and_send_it_to_the_gateway_status
)These approaches:
Example error when trying to query for failed tests over time:
Impact
Proposed Solution
Refactor our metrics collection with two key changes:
Current Implementation:
Proposed Implementation:
This change will enable simple, effective queries like:
The text was updated successfully, but these errors were encountered: