-
Notifications
You must be signed in to change notification settings - Fork 7.1k
Description
What happened + What you expected to happen
when I start a ray cluster with SSL context, a lot of metrics seems to be missing after 2.52 upgrade/default to OTEL.
I am aware of #59361 but 1) we didn't enable auth token 2) I am running 2.53 so that issue should have been fixed.
What I observed:
# start a cluster with 2.53
curl -s localhost:8080/metrics | grep '^ray' | cut -d{ -f1 | sort | uniq
ray_component_cpu_percentage
ray_component_mem_shared_bytes
ray_component_num_fds
ray_component_rss_mb
ray_component_uss_mb
ray_node_cpu_count
ray_node_cpu_utilization
ray_node_disk_free
ray_node_disk_io_read
ray_node_disk_io_read_count
ray_node_disk_io_read_speed
ray_node_disk_io_write
ray_node_disk_io_write_count
ray_node_disk_io_write_speed
ray_node_disk_read_iops
ray_node_disk_usage
ray_node_disk_utilization_percentage
ray_node_disk_write_iops
ray_node_gpus_available
ray_node_gpus_utilization
ray_node_gram_available
ray_node_gram_used
ray_node_mem_available
ray_node_mem_shared_bytes
ray_node_mem_total
ray_node_mem_used
ray_node_network_receive_speed
ray_node_network_received
ray_node_network_send_speed
ray_node_network_sent
# restart the cluster with `RAY_enable_open_telemetry=false`
curl -s localhost:8080/metrics | grep '^ray' | cut -d{ -f1 | sort | uniq kan.wang@COMP-CWCYWJ9HR7
ray_component_cpu_percentage
ray_component_mem_shared_bytes
ray_component_num_fds
ray_component_rss_mb
ray_component_uss_mb
ray_finished_jobs_total
ray_gcs_actors_count
ray_gcs_placement_group_count
ray_gcs_storage_operation_count_total
ray_gcs_storage_operation_latency_ms_bucket
ray_gcs_storage_operation_latency_ms_count
ray_gcs_storage_operation_latency_ms_sum
ray_gcs_task_manager_task_events_dropped
ray_gcs_task_manager_task_events_reported
ray_gcs_task_manager_task_events_stored
ray_grpc_client_req_failed_total
ray_grpc_server_req_finished_total
ray_grpc_server_req_handling_total
ray_grpc_server_req_new_total
ray_grpc_server_req_process_time_ms_bucket
ray_grpc_server_req_process_time_ms_count
ray_grpc_server_req_process_time_ms_sum
ray_grpc_server_req_succeeded_total
ray_health_check_rpc_latency_ms_bucket
ray_health_check_rpc_latency_ms_count
ray_health_check_rpc_latency_ms_sum
ray_internal_num_infeasible_scheduling_classes
ray_internal_num_spilled_tasks
ray_io_context_event_loop_lag_ms
ray_local_resource_view_node_count
ray_node_cpu_count
ray_node_cpu_utilization
ray_node_disk_free
ray_node_disk_io_read
ray_node_disk_io_read_count
ray_node_disk_io_read_speed
ray_node_disk_io_write
ray_node_disk_io_write_count
ray_node_disk_io_write_speed
ray_node_disk_read_iops
ray_node_disk_usage
ray_node_disk_utilization_percentage
ray_node_disk_write_iops
ray_node_gpus_available
ray_node_gpus_utilization
ray_node_gram_available
ray_node_gram_used
ray_node_mem_available
ray_node_mem_shared_bytes
ray_node_mem_total
ray_node_mem_used
ray_node_network_receive_speed
ray_node_network_received
ray_node_network_send_speed
ray_node_network_sent
ray_object_directory_added_locations
ray_object_directory_lookups
ray_object_directory_removed_locations
ray_object_directory_subscriptions
ray_object_directory_updates
ray_object_manager_bytes
ray_object_manager_num_pull_requests
ray_object_manager_received_chunks
ray_object_store_available_memory
ray_object_store_fallback_memory
ray_object_store_memory
ray_object_store_num_local_objects
ray_object_store_used_memory
ray_operation_active_count
ray_operation_count_total
ray_operation_queue_time_ms_bucket
ray_operation_queue_time_ms_count
ray_operation_queue_time_ms_sum
ray_operation_run_time_ms_bucket
ray_operation_run_time_ms_count
ray_operation_run_time_ms_sum
ray_pull_manager_active_bundles
ray_pull_manager_num_object_pins
ray_pull_manager_requested_bundles
ray_pull_manager_requests
ray_pull_manager_retries_total
ray_pull_manager_usage_bytes
ray_push_manager_chunks
ray_push_manager_num_pushes_remaining
ray_resources
ray_running_jobs
ray_scheduler_failed_worker_startup_total
ray_scheduler_tasks
ray_scheduler_unscheduleable_tasks
ray_spill_manager_objects
ray_spill_manager_objects_bytes
ray_spill_manager_request_total
some initial troubleshooting.
I see these from logs:
# from raylet
[2026-01-08 05:56:18,925 W1449 2326] (raylet) open_telemetry_metric_recorder.cc:66: Failed to export metrics to the metrics agent. Result: 1
# from dashboard_agent
I0000 00:00:1767851741.605813 1609 ssl_transport_security.cc:1884] Handshake failed with error SSL_ERROR_SSL: error:100000f7:SSLroutines:OPENSSL_internal:WRONG_VERSION_NUMBER: Invalid certificate verification context
So I suspect the above issues are coming from OTEL client missing setup SSL context when calling metrics agent?
I asked claude to trace code and generated a fix with below diff, it does look like we are simply missing SSL context, but I haven't got a chance to test it locally.
diff --git a/src/ray/observability/open_telemetry_metric_recorder.cc b/src/ray/observability/open_telemetry_metric_recorder.cc
index 961c9d0c4b..ca28ce53c0 100644
--- a/src/ray/observability/open_telemetry_metric_recorder.cc
+++ b/src/ray/observability/open_telemetry_metric_recorder.cc
@@ -27,6 +27,8 @@
#include <opentelemetry/sdk/metrics/view/view_registry.h>
#include <cassert>
+#include <fstream>
+#include <sstream>
#include <utility>
#include "ray/common/constants.h"
@@ -96,6 +98,47 @@ void OpenTelemetryMetricRecorder::Start(const std::string &endpoint,
// counting.
exporter_options.aggregation_temporality =
opentelemetry::exporter::otlp::PreferredAggregationTemporality::kDelta;
+
+ // Configure TLS/SSL credentials to match how Ray's gRPC servers are configured.
+ // When USE_TLS is enabled, the dashboard agent's gRPC server uses SSL, so the
+ // OpenTelemetry exporter must also use SSL to connect successfully.
+ if (RayConfig::instance().USE_TLS()) {
+ exporter_options.use_ssl_credentials = true;
+
+ // Helper lambda to read certificate file contents
+ auto read_cert_file = [](const std::string &filepath) -> std::string {
+ std::ifstream file(filepath);
+ std::stringstream buffer;
+ buffer << file.rdbuf();
+ return buffer.str();
+ };
+
+ // Load CA certificate for server verification
+ std::string ca_cert_file = std::string(RayConfig::instance().TLS_CA_CERT());
+ if (!ca_cert_file.empty()) {
+ exporter_options.ssl_credentials_cacert_as_string = read_cert_file(ca_cert_file);
+ }
+
+#ifdef ENABLE_OTLP_GRPC_SSL_MTLS_PREVIEW
+ // Load client certificate and key for mutual TLS (mTLS).
+ // Ray's gRPC server requires client authentication when CA cert is configured.
+ // Note: mTLS support requires the OpenTelemetry SDK to be built with
+ // ENABLE_OTLP_GRPC_SSL_MTLS_PREVIEW defined.
+ std::string client_cert_file = std::string(RayConfig::instance().TLS_SERVER_CERT());
+ std::string client_key_file = std::string(RayConfig::instance().TLS_SERVER_KEY());
+ if (!client_cert_file.empty()) {
+ exporter_options.ssl_client_cert_string = read_cert_file(client_cert_file);
+ }
+ if (!client_key_file.empty()) {
+ exporter_options.ssl_client_key_string = read_cert_file(client_key_file);
+ }
+ RAY_LOG(INFO) << "OpenTelemetry metric exporter configured with TLS and mTLS enabled";
+#else
+ RAY_LOG(INFO) << "OpenTelemetry metric exporter configured with TLS enabled "
+ << "(mTLS not available - SDK built without ENABLE_OTLP_GRPC_SSL_MTLS_PREVIEW)";
+#endif
+ }
+
// Add authentication token to metadata if auth is enabled
if (rpc::RequiresTokenAuthentication()) {
auto token = rpc::AuthenticationTokenLoader::instance().GetToken();
Can the someone confirm if above theory is correct? I can try iterate on unit testing/local testing if this seems right path forward.
Versions / Dependencies
Ray 2.53 + kuberay 1.5
Reproduction script
I don't have a particular reproduction script. but starting any ray cluster with SSL context is showing this problem.
Issue Severity
High: It blocks me from completing my task. we can't move to newer version of ray, tho for now we have a workaround to leverage RAY_enable_open_telemetry=false