Skip to content

[<Ray Core>] A lot of metrics seems to be not working after 2.52 upgrade/enable OTEL by default. #59968

@kanwang

Description

@kanwang

What happened + What you expected to happen

when I start a ray cluster with SSL context, a lot of metrics seems to be missing after 2.52 upgrade/default to OTEL.

I am aware of #59361 but 1) we didn't enable auth token 2) I am running 2.53 so that issue should have been fixed.

What I observed:

# start a cluster with 2.53
curl -s localhost:8080/metrics | grep '^ray' | cut -d{ -f1 | sort | uniq

ray_component_cpu_percentage
ray_component_mem_shared_bytes
ray_component_num_fds
ray_component_rss_mb
ray_component_uss_mb
ray_node_cpu_count
ray_node_cpu_utilization
ray_node_disk_free
ray_node_disk_io_read
ray_node_disk_io_read_count
ray_node_disk_io_read_speed
ray_node_disk_io_write
ray_node_disk_io_write_count
ray_node_disk_io_write_speed
ray_node_disk_read_iops
ray_node_disk_usage
ray_node_disk_utilization_percentage
ray_node_disk_write_iops
ray_node_gpus_available
ray_node_gpus_utilization
ray_node_gram_available
ray_node_gram_used
ray_node_mem_available
ray_node_mem_shared_bytes
ray_node_mem_total
ray_node_mem_used
ray_node_network_receive_speed
ray_node_network_received
ray_node_network_send_speed
ray_node_network_sent
# restart the cluster with `RAY_enable_open_telemetry=false`
curl -s localhost:8080/metrics | grep '^ray' | cut -d{ -f1 | sort | uniq     kan.wang@COMP-CWCYWJ9HR7

ray_component_cpu_percentage
ray_component_mem_shared_bytes
ray_component_num_fds
ray_component_rss_mb
ray_component_uss_mb
ray_finished_jobs_total
ray_gcs_actors_count
ray_gcs_placement_group_count
ray_gcs_storage_operation_count_total
ray_gcs_storage_operation_latency_ms_bucket
ray_gcs_storage_operation_latency_ms_count
ray_gcs_storage_operation_latency_ms_sum
ray_gcs_task_manager_task_events_dropped
ray_gcs_task_manager_task_events_reported
ray_gcs_task_manager_task_events_stored
ray_grpc_client_req_failed_total
ray_grpc_server_req_finished_total
ray_grpc_server_req_handling_total
ray_grpc_server_req_new_total
ray_grpc_server_req_process_time_ms_bucket
ray_grpc_server_req_process_time_ms_count
ray_grpc_server_req_process_time_ms_sum
ray_grpc_server_req_succeeded_total
ray_health_check_rpc_latency_ms_bucket
ray_health_check_rpc_latency_ms_count
ray_health_check_rpc_latency_ms_sum
ray_internal_num_infeasible_scheduling_classes
ray_internal_num_spilled_tasks
ray_io_context_event_loop_lag_ms
ray_local_resource_view_node_count
ray_node_cpu_count
ray_node_cpu_utilization
ray_node_disk_free
ray_node_disk_io_read
ray_node_disk_io_read_count
ray_node_disk_io_read_speed
ray_node_disk_io_write
ray_node_disk_io_write_count
ray_node_disk_io_write_speed
ray_node_disk_read_iops
ray_node_disk_usage
ray_node_disk_utilization_percentage
ray_node_disk_write_iops
ray_node_gpus_available
ray_node_gpus_utilization
ray_node_gram_available
ray_node_gram_used
ray_node_mem_available
ray_node_mem_shared_bytes
ray_node_mem_total
ray_node_mem_used
ray_node_network_receive_speed
ray_node_network_received
ray_node_network_send_speed
ray_node_network_sent
ray_object_directory_added_locations
ray_object_directory_lookups
ray_object_directory_removed_locations
ray_object_directory_subscriptions
ray_object_directory_updates
ray_object_manager_bytes
ray_object_manager_num_pull_requests
ray_object_manager_received_chunks
ray_object_store_available_memory
ray_object_store_fallback_memory
ray_object_store_memory
ray_object_store_num_local_objects
ray_object_store_used_memory
ray_operation_active_count
ray_operation_count_total
ray_operation_queue_time_ms_bucket
ray_operation_queue_time_ms_count
ray_operation_queue_time_ms_sum
ray_operation_run_time_ms_bucket
ray_operation_run_time_ms_count
ray_operation_run_time_ms_sum
ray_pull_manager_active_bundles
ray_pull_manager_num_object_pins
ray_pull_manager_requested_bundles
ray_pull_manager_requests
ray_pull_manager_retries_total
ray_pull_manager_usage_bytes
ray_push_manager_chunks
ray_push_manager_num_pushes_remaining
ray_resources
ray_running_jobs
ray_scheduler_failed_worker_startup_total
ray_scheduler_tasks
ray_scheduler_unscheduleable_tasks
ray_spill_manager_objects
ray_spill_manager_objects_bytes
ray_spill_manager_request_total

some initial troubleshooting.

I see these from logs:

# from raylet
[2026-01-08 05:56:18,925 W1449 2326] (raylet) open_telemetry_metric_recorder.cc:66: Failed to export metrics to the metrics agent. Result: 1
# from dashboard_agent
I0000 00:00:1767851741.605813 1609 ssl_transport_security.cc:1884] Handshake failed with error SSL_ERROR_SSL: error:100000f7:SSLroutines:OPENSSL_internal:WRONG_VERSION_NUMBER: Invalid certificate verification context

So I suspect the above issues are coming from OTEL client missing setup SSL context when calling metrics agent?

I asked claude to trace code and generated a fix with below diff, it does look like we are simply missing SSL context, but I haven't got a chance to test it locally.

diff --git a/src/ray/observability/open_telemetry_metric_recorder.cc b/src/ray/observability/open_telemetry_metric_recorder.cc
index 961c9d0c4b..ca28ce53c0 100644
--- a/src/ray/observability/open_telemetry_metric_recorder.cc
+++ b/src/ray/observability/open_telemetry_metric_recorder.cc
@@ -27,6 +27,8 @@
 #include <opentelemetry/sdk/metrics/view/view_registry.h>
 
 #include <cassert>
+#include <fstream>
+#include <sstream>
 #include <utility>
 
 #include "ray/common/constants.h"
@@ -96,6 +98,47 @@ void OpenTelemetryMetricRecorder::Start(const std::string &endpoint,
   // counting.
   exporter_options.aggregation_temporality =
       opentelemetry::exporter::otlp::PreferredAggregationTemporality::kDelta;
+
+  // Configure TLS/SSL credentials to match how Ray's gRPC servers are configured.
+  // When USE_TLS is enabled, the dashboard agent's gRPC server uses SSL, so the
+  // OpenTelemetry exporter must also use SSL to connect successfully.
+  if (RayConfig::instance().USE_TLS()) {
+    exporter_options.use_ssl_credentials = true;
+
+    // Helper lambda to read certificate file contents
+    auto read_cert_file = [](const std::string &filepath) -> std::string {
+      std::ifstream file(filepath);
+      std::stringstream buffer;
+      buffer << file.rdbuf();
+      return buffer.str();
+    };
+
+    // Load CA certificate for server verification
+    std::string ca_cert_file = std::string(RayConfig::instance().TLS_CA_CERT());
+    if (!ca_cert_file.empty()) {
+      exporter_options.ssl_credentials_cacert_as_string = read_cert_file(ca_cert_file);
+    }
+
+#ifdef ENABLE_OTLP_GRPC_SSL_MTLS_PREVIEW
+    // Load client certificate and key for mutual TLS (mTLS).
+    // Ray's gRPC server requires client authentication when CA cert is configured.
+    // Note: mTLS support requires the OpenTelemetry SDK to be built with
+    // ENABLE_OTLP_GRPC_SSL_MTLS_PREVIEW defined.
+    std::string client_cert_file = std::string(RayConfig::instance().TLS_SERVER_CERT());
+    std::string client_key_file = std::string(RayConfig::instance().TLS_SERVER_KEY());
+    if (!client_cert_file.empty()) {
+      exporter_options.ssl_client_cert_string = read_cert_file(client_cert_file);
+    }
+    if (!client_key_file.empty()) {
+      exporter_options.ssl_client_key_string = read_cert_file(client_key_file);
+    }
+    RAY_LOG(INFO) << "OpenTelemetry metric exporter configured with TLS and mTLS enabled";
+#else
+    RAY_LOG(INFO) << "OpenTelemetry metric exporter configured with TLS enabled "
+                  << "(mTLS not available - SDK built without ENABLE_OTLP_GRPC_SSL_MTLS_PREVIEW)";
+#endif
+  }
+
   // Add authentication token to metadata if auth is enabled
   if (rpc::RequiresTokenAuthentication()) {
     auto token = rpc::AuthenticationTokenLoader::instance().GetToken();

Can the someone confirm if above theory is correct? I can try iterate on unit testing/local testing if this seems right path forward.

Versions / Dependencies

Ray 2.53 + kuberay 1.5

Reproduction script

I don't have a particular reproduction script. but starting any ray cluster with SSL context is showing this problem.

Issue Severity

High: It blocks me from completing my task. we can't move to newer version of ray, tho for now we have a workaround to leverage RAY_enable_open_telemetry=false

Metadata

Metadata

Assignees

Labels

bugSomething that is supposed to be working; but isn'tcommunity-backlogcoreIssues that should be addressed in Ray CoreobservabilityIssues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or ProfilingsecurityThe issue or proposal related to securitystabilitytriageNeeds triage (eg: priority, bug/not-bug, and owning component)

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions