[Doc] Logging: Add Fluent Bit DaemonSet and Grafana Loki to Persist K…

…ubeRay operator logs (#48725) Signed-off-by: win5923 <[email protected]>
ray-project · Nov 25, 2024 · ba8674a · ba8674a
1 parent fe52a25
commit ba8674a
Show file tree

Hide file tree

Showing 10 changed files with 180 additions and 15 deletions.
diff --git a/doc/source/cluster/configure-manage-dashboard.md b/doc/source/cluster/configure-manage-dashboard.md
@@ -5,7 +5,7 @@
 Dashboard configurations may differ depending on how you launch Ray Clusters (e.g., local Ray Cluster v.s. KubeRay). Integrations with Prometheus and Grafana are optional for enhanced Dashboard experience.
 
 :::{note}
-Ray Dashboard is only intended for interactive development and debugging because the Dashboard UI and the underlying data are not accessible after Clusters are terminated. For production monitoring and debugging, users should rely on [persisted logs](../cluster/kubernetes/user-guides/logging.md), [persisted metrics](./metrics.md), [persisted Ray states](../ray-observability/user-guides/cli-sdk.rst), and other observability tools.
+Ray Dashboard is useful for interactive development and debugging because when clusters terminate, the dashboard UI and the underlying data are no longer accessible. For production monitoring and debugging, you should rely on [persisted logs](../cluster/kubernetes/user-guides/persist-kuberay-custom-resource-logs.md), [persisted metrics](./metrics.md), [persisted Ray states](../ray-observability/user-guides/cli-sdk.rst), and other observability tools.
 :::
 
 ## Changing the Ray Dashboard port

diff --git a/doc/source/cluster/kubernetes/configs/loki.log.yaml b/doc/source/cluster/kubernetes/configs/loki.log.yaml
@@ -0,0 +1,46 @@
+# Fluent Bit Config
+config:
+  inputs: |
+    [INPUT]
+        Name tail
+        Path /var/log/containers/*.log
+        multiline.parser docker, cri
+        Tag kube.*
+        Mem_Buf_Limit 5MB
+        Skip_Long_Lines On
+
+  filters: |
+    [FILTER]
+        Name kubernetes
+        Match kube.*
+        Merge_Log On
+        Keep_Log Off
+        K8S-Logging.Parser On
+        K8S-Logging.Exclude On
+
+  outputs: |
+    [OUTPUT]
+        Name loki
+        Match *
+        Host loki-gateway
+        Port 80
+        Labels job=fluent-bit,namespace=$kubernetes['namespace_name'],pod=$kubernetes['pod_name'],container=$kubernetes['container_name']
+        Auto_Kubernetes_Labels Off
+        tenant_id test
+---
+# Grafana Datasource Config
+datasources:
+  datasources.yaml:
+    apiVersion: 1
+    datasources:
+      - name: Loki
+        type: loki
+        access: proxy
+        editable: true
+        url: http://loki-gateway.default
+        jsonData:
+          timeout: 60
+          maxLines: 1000
+          httpHeaderName1: "X-Scope-OrgID"
+        secureJsonData:
+          httpHeaderValue1: "test"
diff --git a/doc/source/cluster/kubernetes/getting-started/raycluster-quick-start.md b/doc/source/cluster/kubernetes/getting-started/raycluster-quick-start.md
@@ -35,11 +35,12 @@ kubectl get pods
 # kuberay-operator-7fbdbf8c89-pt8bk   1/1     Running   0          27s
 ```
 
-KubeRay offers multiple options for operator installations, such as Helm, Kustomize, and a single-namespaced operator. For further information, please refer to [the installation instructions in the KubeRay documentation](https://ray-project.github.io/kuberay/deploy/installation/).
+KubeRay offers multiple options for operator installations, such as Helm, Kustomize, and a single-namespaced operator. For further information, see [the installation instructions in the KubeRay documentation](https://ray-project.github.io/kuberay/deploy/installation/).
 
+(raycluster-deploy)=
 ## Step 3: Deploy a RayCluster custom resource
 
-Once the KubeRay operator is running, we are ready to deploy a RayCluster. To do so, we create a RayCluster Custom Resource (CR) in the `default` namespace.
+Once the KubeRay operator is running, you're ready to deploy a RayCluster. Create a RayCluster Custom Resource (CR) in the `default` namespace.
 
   ::::{tab-set}
 

diff --git a/doc/source/cluster/kubernetes/user-guides.md b/doc/source/cluster/kubernetes/user-guides.md
@@ -15,7 +15,8 @@ user-guides/config
 user-guides/configuring-autoscaling
 user-guides/kuberay-gcs-ft
 user-guides/gke-gcs-bucket
-user-guides/logging
+user-guides/persist-kuberay-custom-resource-logs
+user-guides/persist-kuberay-operator-logs
 user-guides/gpu
 user-guides/tpu
 user-guides/rayserve-dev-doc
@@ -45,7 +46,8 @@ at the {ref}`introductory guide <kuberay-quickstart>` first.
 * {ref}`kuberay-gpu`
 * {ref}`kuberay-tpu`
 * {ref}`kuberay-gcs-ft`
-* {ref}`kuberay-logging`
+* {ref}`persist-kuberay-custom-resource-logs`
+* {ref}`persist-kuberay-operator-logs`
 * {ref}`kuberay-dev-serve`
 * {ref}`kuberay-pod-command`
 * {ref}`kuberay-pod-security`

diff --git a/doc/source/cluster/kubernetes/user-guides/config.md b/doc/source/cluster/kubernetes/user-guides/config.md
@@ -126,7 +126,7 @@ Here are some of the subfields of the pod `template` to pay attention to:
 #### containers
 A Ray pod template specifies at minimum one container, namely the container
 that runs the Ray processes. A Ray pod template may also specify additional sidecar
-containers, for purposes such as {ref}`log processing <kuberay-logging>`. However, the KubeRay operator assumes that
+containers, for purposes such as {ref}`log processing <persist-kuberay-custom-resource-logs>`. However, the KubeRay operator assumes that
 the first container in the containers list is the main Ray container.
 Therefore, make sure to specify any sidecar containers
 **after** the main Ray container. In other words, the Ray container should be the **first**

diff --git a/doc/source/cluster/kubernetes/user-guides/images/loki-logs.png b/doc/source/cluster/kubernetes/user-guides/images/loki-logs.png
diff --git a/...cluster/kubernetes/user-guides/logging.md → ...s/persist-kuberay-custom-resource-logs.md b/...cluster/kubernetes/user-guides/logging.md → ...s/persist-kuberay-custom-resource-logs.md
@@ -1,6 +1,6 @@
-(kuberay-logging)=
+(persist-kuberay-custom-resource-logs)=
 
-# Log Persistence
+# Persist KubeRay custom resource logs
 
 Logs (both system and application logs) are useful for troubleshooting Ray applications and Clusters. For example, you may want to access system logs if a node terminates unexpectedly.
 

diff --git a/doc/source/cluster/kubernetes/user-guides/persist-kuberay-operator-logs.md b/doc/source/cluster/kubernetes/user-guides/persist-kuberay-operator-logs.md
@@ -0,0 +1,116 @@
+(persist-kuberay-operator-logs)=
+
+# Persist KubeRay Operator Logs
+
+The KubeRay Operator plays a vital role in managing Ray clusters on Kubernetes. Persisting its logs is essential for effective troubleshooting and monitoring. This guide describes methods to set up centralized logging for KubeRay Operator logs.
+
+## Grafana Loki
+
+[Grafana Loki][GrafanaLoki] is a log aggregation system optimized for Kubernetes, providing efficient log storage and querying. The following steps set up [Fluent Bit][FluentBit] as a DaemonSet to collect logs from Kubernetes containers and send them to Loki for centralized storage and analysis.
+
+### Deploy Loki monolithic mode
+
+Loki’s Helm chart supports three deployment methods to fit different scalability and performance needs: Monolithic, Simple Scalable, and Microservices. This guide demonstrates the monolithic method. For details on each deployment mode, see the [Loki deployment](https://grafana.com/docs/loki/latest/get-started/deployment-modes/) modes documentation.
+
+Deploy the Loki deployment with the [Helm chart repository](https://github.com/grafana/loki/tree/main/production/helm/loki).
+
+```shell
+helm repo add grafana https://grafana.github.io/helm-charts
+helm repo update
+
+# Install Loki with single replica mode
+helm install loki grafana/loki -f https://raw.githubusercontent.com/grafana/loki/refs/heads/main/production/helm/loki/single-binary-values.yaml
+```
+
+### Configure log processing
+
+Create a `fluent-bit-config.yaml` file, which configures Fluent Bit to:
+
+* Tail log files from Kubernetes containers.
+* Parse multi-line logs for Docker and Container Runtime Interface (CRI) formats.
+* Enrich logs with Kubernetes metadata such as namespace, pod, and container names.
+* Send the logs to Loki for centralized storage and querying.
+```{literalinclude} ../configs/loki.log.yaml
+:language: yaml
+:start-after: Fluent Bit Config
+:end-before: ---
+```
+
+A few notes on the above config:
+
+* Inputs: The `tail` input reads log files from `/var/log/containers/*.log`, with `multiline.parser` to handle complex log messages across multiple lines.
+* Filters: The `kubernetes` filter adds metadata like namespace, pod, and container names to each log, enabling more efficient log management and querying in Loki.
+* Outputs: The `loki` output block specifies Loki as the target. The `Host` and `Port` define the Loki service endpoint, and `Labels` adds metadata for easier querying in Grafana. Additionally, `tenant_id` allows for multi-tenancy if required by the Loki setup.
+
+Deploy the Fluent Bit deployment with the [Helm chart repository](https://github.com/fluent/helm-charts/tree/main/charts/fluent-bit).
+
+```shell
+helm repo add fluent https://fluent.github.io/helm-charts
+helm repo update
+
+helm install fluent-bit fluent/fluent-bit -f fluent-bit-config.yaml
+```
+
+### Install the KubeRay Operator
+
+Follow [Deploy a KubeRay operator](kuberay-operator-deploy) to install the KubeRay operator.
+
+
+### Deploy a RayCluster
+
+Follow [Deploy a RayCluster custom resource](raycluster-deploy) to deploy a RayCluster.
+
+
+### Deploy Grafana
+
+Create a `datasource-config.yaml` file with the following configuration to set up Grafana's Loki datasource:
+```{literalinclude} ../configs/loki.log.yaml
+:language: yaml
+:start-after: Grafana Datasource Config
+```
+
+Deploy the Grafana deployment with the [Helm chart repository](https://github.com/grafana/helm-charts/tree/main/charts/grafana).
+
+```shell
+helm repo add grafana https://grafana.github.io/helm-charts
+helm repo update
+
+helm install grafana grafana/grafana -f datasource-config.yaml
+```
+
+### Check the Grafana Dashboard
+
+```shell
+# Verify that the Grafana pod is running in the `default` namespace.
+kubectl get pods --namespace default -l "app.kubernetes.io/name=grafana"
+# NAME                       READY   STATUS    RESTARTS   AGE
+# grafana-54d5d747fd-5fldc   1/1     Running   0          8m21s
+```
+
+To access Grafana from your local machine, set up port forwarding by running:
+```shell
+export POD_NAME=$(kubectl get pods --namespace default -l "app.kubernetes.io/name=grafana,app.kubernetes.io/instance=grafana" -o jsonpath="{.items[0].metadata.name}")
+kubectl --namespace default port-forward $POD_NAME 3000
+```
+
+This command makes Grafana available locally at `http://localhost:3000`.
+
+* Username: "admin"
+* Password: Get the password using the following command:
+
+```shell
+kubectl get secret --namespace default grafana -o jsonpath="{.data.admin-password}" | base64 --decode ; echo
+```
+
+Finally, use a LogQL query to view logs for a specific pod, such as the KubeRay Operator, and filter logs by the `RayCluster_name`:
+
+```
+{pod="kuberay-operator-xxxxxxxx-xxxxx"} | json | RayCluster_name = `raycluster-kuberay`
+```
+
+![Loki Logs](images/loki-logs.png)
+
+You can use LogQL's JSON syntax to filter logs based on specific fields, such as `RayCluster_name`. See  [Log query language doc](https://grafana.com/docs/loki/latest/query/) for more information about LogQL filtering.
+
+[GrafanaLoki]: https://grafana.com/oss/loki/
+[FluentBit]: https://docs.fluentbit.io/manual
diff --git a/doc/source/ray-observability/user-guides/configure-logging.md b/doc/source/ray-observability/user-guides/configure-logging.md
@@ -28,7 +28,7 @@ A new Ray session creates a new folder to the temp directory. The latest session
 
 Usually, temp directories are cleared up whenever the machines reboot. As a result, log files may get lost whenever your cluster or some of the nodes are stopped or terminated.
 
-If you need to inspect logs after the clusters are stopped or terminated, you need to store and persist the logs. View the instructions for how to process and export logs for {ref}`clusters on VMs <vm-logging>` and {ref}`KubeRay Clusters <kuberay-logging>`.
+If you need to inspect logs after the clusters stop or terminate, you need to store and persist the logs. See the instructions for how to process and export logs for {ref}`Log persistence <vm-logging>` and {ref}`KubeRay Clusters <persist-kuberay-custom-resource-logs>`.
 
 (logging-directory-structure)=
 ## Log files in logging directory
@@ -131,12 +131,12 @@ ray.get([task.remote() for _ in range(100)])
 The output is as follows:
 
 ```bash
-2023-03-27 15:08:34,195	INFO worker.py:1603 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265 
+2023-03-27 15:08:34,195	INFO worker.py:1603 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265
 (task pid=534172) Hello there, I am a task 0.20583517821231412
 (task pid=534174) Hello there, I am a task 0.17536720316370757 [repeated 99x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication)
 ```
 
-This feature is useful when importing libraries such as `tensorflow` or `numpy`, which may emit many verbose warning messages when you import them. 
+This feature is useful when importing libraries such as `tensorflow` or `numpy`, which may emit many verbose warning messages when you import them.
 
 Configure the following environment variables on the driver process **before importing Ray** to customize log deduplication:
 
@@ -247,8 +247,8 @@ ray_tune_logger.addHandler(logging.FileHandler("extra_ray_tune_log.log"))
 Implement structured logging to enable downstream users and applications to consume the logs efficiently.
 
 ### Application logs
-A Ray applications include both driver and worker processes. For Python applications, use Python loggers to format and structure your logs. 
-As a result, Python loggers need to be set up for both driver and worker processes.
+A Ray app includes both driver and worker processes. For Python apps, use Python loggers to format and structure your logs.
+As a result, you need to set up Python loggers for both driver and worker processes.
 
 ::::{tab-set}
 
@@ -472,4 +472,4 @@ The max size of a log file, including its backup, is `RAY_ROTATION_MAX_BYTES * R
 
 ## Log persistence
 
-To process and export logs to external stroage or management systems, view {ref}`log persistence on Kubernetes <kuberay-logging>` and {ref}`log persistence on VMs <vm-logging>` for more details.
+To process and export logs to external stroage or management systems, view {ref}`log persistence on Kubernetes <persist-kuberay-custom-resource-logs>` see {ref}`log persistence on VMs <vm-logging>` for more details.
diff --git a/doc/source/serve/production-guide/kubernetes.md b/doc/source/serve/production-guide/kubernetes.md
@@ -238,7 +238,7 @@ Monitor your Serve application using the Ray Dashboard.
 - Learn more about how to configure and manage Dashboard [here](observability-configure-manage-dashboard).
 - Learn about the Ray Serve Dashboard [here](serve-monitoring).
 - Learn how to set up [Prometheus](prometheus-setup) and [Grafana](grafana) for Dashboard.
-- Learn about the [Ray Serve logs](serve-logging) and how to [persistent logs](kuberay-logging) on Kubernetes.
+- Learn about the [Ray Serve logs](serve-logging) and how to [persistent logs](persist-kuberay-custom-resource-logs) on Kubernetes.
 
 :::{note}
 - To troubleshoot application deployment failures in Serve, you can check the KubeRay operator logs by running `kubectl logs -f <kuberay-operator-pod-name>` (e.g., `kubectl logs -f kuberay-operator-7447d85d58-lv7pf`). The KubeRay operator logs contain information about the Serve application deployment event and Serve application health checks.