Skip to content

Commit

Permalink
dev/pd: update metrics document
Browse files Browse the repository at this point in the history
Signed-off-by: disksing <[email protected]>
  • Loading branch information
disksing committed Jul 9, 2019
1 parent 8af0888 commit 8ef70a0
Show file tree
Hide file tree
Showing 11 changed files with 42 additions and 31 deletions.
73 changes: 42 additions & 31 deletions dev/reference/key-monitoring-metrics/pd-dashboard.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,20 +18,35 @@ category: reference
- PD role:当前 PD 的角色
- Storage capacity:TiDB 集群总可用数据库空间大小
- Current storage size:TiDB 集群目前已用数据库空间大小
- Number of Regions:当前集群的 Region 总量
- Leader balance ratio:Leader 数量最多和最少节点相差的百分比,一般小于 5%,节点重启时会有比较大的波动
- Region balance ratio:Region 数量最多和最少节点相差的百分比,一般小于 5%,新增/下线节点时相差比较大
- Current storage usage:TiDB 集群存储空间的使用率
- Normal stores:处于正常状态的节点数目
- Number of Regions:当前集群的 Region 总量
- PD scheduler config:PD 调度配置列表
- Region label isolation level:不同 label 所在的 level 的 Region 数量
- Label distribution:集群中 TiKV 节点的 label 分布情况
- Abnormal stores:处于异常状态的节点数目,正常情况应当为 0
- Current storage usage:TiDB 集群存储空间的使用率
- pd_cluster_metadata:记录集群 ID,时间戳和生成的 ID
- Current peer count:当前集群 peer 的总量
- Metadata information:记录集群 ID,时间戳和生成的 ID
- Region label isolation level:不同 label 所在的 level 的 Region 数量
- Region health:每个 Region 的状态,通常情况下,pending 的 peer 应该少于 100,miss 的 peer 不能一直大于 0

![PD Dashboard - Cluster metrics](/media/pd-dashboard-cluster.png)
![PD Dashboard - Cluster metrics](/media/pd-dashboard-cluster-v2.png)

## Operator

- Schedule operator create:新创建的不同 operator 的数量
- Schedule operator check:已检查的 operator 的数量,主要检查是否当前步骤已经执行完成,如果是,则执行下一个步骤
- Schedule operator finish:已完成的 operator 的数量
- Schedule operator timeout:已超时的 operator 的数量
- Schedule operator replaced or canceled:已取消或者被替换的 operator 的数量
- Schedule operators count by state:不同状态的 operator 的数量
- 99% Operator finish duration:已完成的 operator 中,99% 所需花费的时间
- 50% Operator finish duration:已完成的 operator 中,50% 所需花费的时间
- 99% Operator step duration:已完成的 operator 的步骤中,99% 所需花费的时间
- 50% Operator step duration:已完成的 operator 的步骤中,50% 所需花费的时间

## Balance
![PD Dashboard - Operator metrics](/media/pd-dashboard-operator-v2.png)

## Statistics - Balance

- Store capacity:每个 TiKV 实例的总的空间大小
- Store available:每个 TiKV 实例的可用空间大小
Expand All @@ -45,9 +60,9 @@ category: reference
- Store leader count:每个 TiKV 实例上所有 leader 的数量
- Store Region count:每个 TiKV 实例上所有 Region 的数量

![PD Dashboard - Balance metrics](/media/pd-dashboard-balance.png)
![PD Dashboard - Balance metrics](/media/pd-dashboard-balance-v2.png)

## HotRegion
## Statistics - hotspot

- Hot write Region's leader distribution:每个 TiKV 实例上是写入热点的 leader 的数量
- Hot write Region's peer distribution:每个 TiKV 实例上是写入热点的 peer 的数量
Expand All @@ -58,7 +73,7 @@ category: reference
- Hot read Region's leader read bytes:每个 TiKV 实例上热点的 leader 的读取大小
- Hot read Region's peer read bytes:每个 TiKV 实例上热点的 peer 的读取大小

![PD Dashboard - HotRegion metrics](/media/pd-dashboard-hot-region.png)
![PD Dashboard - Hotspot metrics](/media/pd-dashboard-hotspot.png)

## Scheduler

Expand All @@ -72,30 +87,19 @@ category: reference
- Namespace checker:namespace checker 的状态
- Replica checker:replica checker 的状态
- Region merge checker:merge checker 的状态
- Filter target:尝试选择 Store 作为调度 taget 时没有通过 Filter 的记数
- Filter source:尝试选择 Store 作为调度 source 时没有通过 Filter 的记数
- Balance Direction:Store 被选作调度 target 或 source 的次数
- Store Limit:Store 生成的调度是否通过 limit 检查的统计

![PD Dashboard - Scheduler metrics](/media/pd-dashboard-scheduler.png)

## Operator

- Schedule operator create:新创建的不同 operator 的数量
- Schedule operator check:已检查的 operator 的数量,主要检查是否当前步骤已经执行完成,如果是,则执行下一个步骤
- Schedule operator finish:已完成的 operator 的数量
- Schedule operator timeout:已超时的 operator 的数量
- Schedule operator replaced or canceled:已取消或者被替换的 operator 的数量
- Schedule operators count by state:不同状态的 operator 的数量
- 99% Operator finish duration:已完成的 operator 中,99% 所需花费的时间
- 50% Operator finish duration:已完成的 operator 中,50% 所需花费的时间
- 99% Operator step duration:已完成的 operator 的步骤中,99% 所需花费的时间
- 50% Operator step duration:已完成的 operator 的步骤中,50% 所需花费的时间

![PD Dashboard - Operator metrics](/media/pd-dashboard-operator.png)
![PD Dashboard - Scheduler metrics](/media/pd-dashboard-scheduler-v2.png)

## gRPC

- Completed commands rate:gRPC 命令的完成速率
- 99% Completed commands duration:99% 的情况下,命令的完成时间

![PD Dashboard - gRPC metrics](/media/pd-dashboard-grpc.png)
![PD Dashboard - gRPC metrics](/media/pd-dashboard-grpc-v2.png)

## etcd

Expand All @@ -108,14 +112,14 @@ category: reference
- Raft committed index:最后一次 commit 的 Raft index
- Raft applied index:最后一次 apply 的 Raft index

![PD Dashboard - etcd metrics](/media/pd-dashboard-etcd.png)
![PD Dashboard - etcd metrics](/media/pd-dashboard-etcd-v2.png)

## TiDB

- Handle requests count:TiDB 的请求数量
- Handle requests duration:每个请求所花费的时间,99% 的情况下,应该小于 100ms

![PD Dashboard - TiDB metrics](/media/pd-dashboard-tidb.png)
![PD Dashboard - TiDB metrics](/media/pd-dashboard-tidb-v2.png)

## Heartbeat

Expand All @@ -125,4 +129,11 @@ category: reference
- Region schedule push:PD 向 TiKV 发送的调度命令的个数
- 99% Region heartbeat latency:99% 的情况下,心跳的延迟

![PD Dashboard - Heartbeat metrics](/media/pd-dashboard-heartbeat.png)
![PD Dashboard - Heartbeat metrics](/media/pd-dashboard-heartbeat-v2.png)

## Region storage

- Syncer Index:Leader 记录 Region 变更历史的最大 index
- history last index:Follower 成功同步的 Region 变更历史的 index

![PD Dashboard - Region storage](/media/pd-dashboard-region-storage.png)
Binary file added media/pd-dashboard-balance-v2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added media/pd-dashboard-cluster-v2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added media/pd-dashboard-etcd-v2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added media/pd-dashboard-grpc-v2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added media/pd-dashboard-heartbeat-v2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added media/pd-dashboard-hotspot.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added media/pd-dashboard-operator-v2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added media/pd-dashboard-region-storage.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added media/pd-dashboard-scheduler-v2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added media/pd-dashboard-tidb-v2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 8ef70a0

Please sign in to comment.