Skip to content

Commit

Permalink
Merge pull request #28932 from taosdata/docs/dclow-dual-mode
Browse files Browse the repository at this point in the history
docs: update active-active doc
  • Loading branch information
guanshengliang authored Nov 26, 2024
2 parents 135131f + c30883b commit 2281014
Showing 1 changed file with 99 additions and 122 deletions.
221 changes: 99 additions & 122 deletions docs/en/08-operation/18-dual.md
Original file line number Diff line number Diff line change
@@ -1,37 +1,66 @@
---
title: Active-Standby Deployment
slug: /operations-and-maintenance/active-standby-deployment
title: Active-Active Deployment
slug: /operations-and-maintenance/active-active-deployment
---

import Image from '@theme/IdealImage';
import imgDual from '../assets/active-standby-deployment-01.png';

This section introduces the configuration and usage of the TDengine Active-Active System.
:::info[Version Note]

1. Some users can only deploy two servers due to the uniqueness of their deployment environment, while also hoping to achieve a certain level of service high availability and data high reliability. This article primarily describes the product behavior of the TDengine Active-Active System based on two key technologies: data replication and client failover. This includes the architecture, configuration, and operation and maintenance of the Active-Active System. The TDengine Active-Active feature can be used in resource-constrained environments, as previously mentioned, as well as in disaster recovery scenarios between two TDengine clusters (regardless of resources). The Active-Active feature is unique to TDengine Enterprise and was first released in version 3.3.0.0. It is recommended to use the latest version.
This feature is available only in TDengine Enterprise 3.3.0.0 and later.

2. The definition of an Active-Active system is: there are only two servers in the business system, each deploying a set of services. From the business layer's perspective, these two machines and two sets of services constitute a complete system, with the details of the underlying system not needing to be perceived by the business layer. The two nodes in the Active-Active system are usually referred to as Master-Slave, meaning "primary-secondary" or "primary-backup," and this document may mix these terms.
:::

You can deploy TDengine in active-active mode to achieve high availability and reliability with limited resources. Active-active mode is also used in disaster recovery strategies to maintain offsite replicas of the database.

In active-active mode, you create two separate TDengine deployments, one acting as the primary node and the other as the secondary node. Data is replicated in real time between the primary and secondary nodes via TDengine's built-in data subscription component. Note that each node in an active-active deployment can be a single TDengine instance or a cluster.

3. The deployment architecture diagram of the TDengine Active-Active System is as follows, involving three key points:
1. Failover of the dual system is implemented by the Client Driver, meaning the switch between primary and secondary nodes when the primary node goes down.
2. Data replication is achieved from the (current) primary node to the secondary node via taosX.
3. The write interface of data subscriptions adds a special mark in the WAL when writing replicated data, while the read interface of data subscriptions automatically filters out the data with that special mark during reads to avoid infinite loops caused by repeated replication.
In the event that the primary node cannot provide service, the client driver fails over to the secondary node. This failover is automatic and transparent to the business layer.

Note: The diagram below uses a single TDengine instance as an example, but in actual deployment, one host in the diagram can be replaced by any number of TDengine clusters.
Replicated data is specially marked to avoid infinite loops. The architecture of an active-active deployment is described in the following figure.

<figure>
<Image img={imgDual} alt=""/>
<figcaption>Figure 1. TDengine in active-active mode</figcaption>
</figure>

## Configuration
## Limitations

The following limitations apply to active-active deployments:

1. You cannot use the data subscription APIs when active-active mode is enabled.
2. You cannot use the parameter binding interface while active-active mode is enabled.
3. The primary and secondary nodes must be identical. Database names, all configuration parameters, usernames, passwords, and permission settings must be exactly the same.
4. You can connect to an active-active deployment only through the Java client library in WebSocket mode.
5. Do not use the `USE <database>` statement to set a context. Instead, specify the database in the connection parameters.

## Cluster Configuration

It is not necessary to configure your cluster specifically for active-active mode. However, note that the WAL retention period affects the fault tolerance of an active-active deployment. This is because data loss will occur If the secondary node is unreachable for a period of time exceeding the configured WAL retention period. Data lost in this manner can only be recovered manually.

## Enable Active-Active Mode

1. Create two identical TDengine deployments. For more information, see [Get Started](../../get-started/).
2. Ensure that the taosd and taosx service are running on both deployments.
3. On the deployment that you have designated as the primary node, run the following command to start the replication service:

```shell
taosx replica start -f <source-endpoint> -t <sink-endpoint> [database]
```

### Cluster Configuration
- The source endpoint is the FQDN of TDengine on the primary node.
- The sink endpoint is the FQDN of TDengine on the secondary node.
- You can use the native connection (port 6030) or WebSocket connection (port 6041).
- You can specify one or more databases to replicate only the data contained in those databases. If you do not specify a database, all databases on the node are replicated except for `information_schema`, `performance_schema`, `log`, and `audit`.

When the command is successful, the replica ID is displayed. You can use this ID to add other databases to the replication task if necessary.

The Active-Active feature imposes no specific requirements on the configuration of the TDengine cluster itself, but there is a certain requirement regarding the WAL retention period for databases to be synchronized between the Active-Active systems. A longer WAL retention period increases the fault tolerance of the Active-Active system; if the backup node is down for a period exceeding the WAL retention period on the primary node, data loss on the backup node is inevitable. Even if the downtime of the backup node does not exceed the WAL retention period on the primary node, there is still a certain probability of data loss, depending on the proximity and speed of data synchronization.
4. Run the same command on the secondary node, specifying the FQDN of TDengine on the secondary node as the source endpoint and the FQDN of TDengine on the primary node as the sink endpoint.

### Client Configuration
## Client Configuration

Currently, only the Java connector supports Active-Active in WebSocket connection mode. The configuration example is as follows:
Active-active mode is supported in the Java client library in WebSocket connection mode. The following is an example configuration:

```java
url = "jdbc:TAOS-RS://" + host + ":6041/?user=root&password=taosdata";
Expand All @@ -45,136 +74,84 @@ properties.setProperty(TSDBDriver.PROPERTY_KEY_RECONNECT_RETRY_COUNT, "3");
connection = DriverManager.getConnection(url, properties);
```

The configuration properties and their meanings are as follows:
These parameters are described as follows:

| Property Name | Meaning |
| ---------------------------------- | ------------------------------------------------------------ |
| PROPERTY_KEY_SLAVE_CLUSTER_HOST | Hostname or IP of the second node; defaults to empty |
| PROPERTY_KEY_SLAVE_CLUSTER_PORT | Port number of the second node; defaults to empty |
| PROPERTY_KEY_ENABLE_AUTO_RECONNECT | Whether to enable automatic reconnection; effective only in WebSocket mode. true: enable, false: disable; default is false. In Active-Active scenarios, please set to true. |
| PROPERTY_KEY_RECONNECT_INTERVAL_MS | Interval for reconnection in milliseconds; default is 2000 milliseconds (2 seconds); minimum is 0 (immediate retry); no maximum limit. |
| PROPERTY_KEY_RECONNECT_RETRY_COUNT | Maximum number of retries per node; default is 3; minimum is 0 (no retries); no maximum limit. |

### Constraints

1. Applications cannot use the subscription interface; if Active-Active parameters are configured, it will cause the creation of consumers to fail.
2. It is not recommended for applications to use parameter binding for writes and queries; if used, the application must address the issue of invalidated related objects after a connection switch.
3. In Active-Active scenarios, it is not recommended for user applications to explicitly call `use database`; the database should be specified in the connection parameters.
4. The clusters at both ends of the Active-Active configuration must be homogeneous (i.e., the naming of databases, all configuration parameters, usernames, passwords, and permission settings must be exactly the same).
5. Only WebSocket connection mode is supported.

## Operation and Maintenance Commands

The TDengine Active-Active System provides several operation and maintenance tools that can automate the configuration of taosX, and allow one-click starting, restarting, and stopping (on single-node environments) of all Active-Active components.

### Starting the Active-Active Task

```shell
taosx replica start
```

This command is used to start the data replication task in the Active-Active system, where both the taosd and taosX on the specified two hosts are in an online state.

1. Method One

```shell
- taosx replica start -f source_endpoint -t sink_endpoint [database...]
```

Establish a synchronization task from `source_endpoint` to `sink_endpoint` in the taosx service on the current machine. After successfully running this command, the replica ID will be printed to the console (referred to as `id` later).
The input parameters `source_endpoint` and `sink_endpoint` are mandatory, formatted as `td2:6030`. For example:

```shell
taosx replica start -f td1:6030 -t td2:6030
```

This example command will automatically create a synchronization task for all databases except `information_schema`, `performance_schema`, `log`, and `audit`. You can specify the endpoint using `http://td2:6041` to use the WebSocket interface (default is the native interface). You can also specify database synchronization: `taosx replica start -f td1:6030 -t td2:6030 db1` will create synchronization tasks only for the specified database.

2. Method Two

```shell
taosx replica start -i id [database...]
```

Use the already created Replica ID (`id`) to add other databases to that synchronization task.

:::note
| PROPERTY_KEY_SLAVE_CLUSTER_HOST | Enter the hostname or IP address of the secondary node. |
| PROPERTY_KEY_SLAVE_CLUSTER_PORT | Enter the port number of the secondary node. |
| PROPERTY_KEY_ENABLE_AUTO_RECONNECT | Specify whether to enable automatic reconnection. For active-active mode, set the value of this parameter to true. |
| PROPERTY_KEY_RECONNECT_INTERVAL_MS | Enter the interval in milliseconds at which reconnection is attempted. The default value is 2000. You can enter 0 to attempt to reconnect immediately. There is no maximum limit. |
| PROPERTY_KEY_RECONNECT_RETRY_COUNT | Enter the maximum number of retries per node. The default value is 3. There is no maximum limit. |

- Repeated use of this command will not create duplicate tasks; it will only add the specified databases to the corresponding task.
- The replica ID is globally unique within a taosX instance and is independent of the `source/sink` combination.
- For ease of memory, the replica ID is a randomly chosen common word, and the system automatically maps the `source/sink` combination to a word list to obtain a unique available word.
## Command Reference

:::

### Checking Task Status
You can manage your active-active deployment with the following commands:

```shell
taosx replica status [id...]
```
1. Use an existing replica ID to add databases to an existing replication task:

This returns the list and status of Active-Active synchronization tasks created on the current machine. You can specify one or more replica IDs to obtain their task lists and status. An example output is as follows:

```shell
+---------+----------+----------+----------+------+-------------+----------------+
| replica | task | source | sink | database | status | note |
+---------+----------+----------+----------+------+-------------+----------------+
| a | 2 | td1:6030 | td2:6030 | opc | running | |
| a | 3 | td2:6030 | td2:6030 | test | interrupted | Error reason |
```
```shell
taosx replica start -i <id> [database...]
```

### Stopping Active-Active Tasks
:::note
- This command cannot create duplicate tasks. It only adds the specified databases to the specified task.
- The replica ID is globally unique within a taosX instance and is independent of the source/sink combination.

```shell
taosx replica stop id [db...]
```
:::

This command has the following effects:
2. Check the status of a task:

- Stops all or specified database synchronization tasks under the specified Replica ID.
- Using `taosx replica stop id1 db1` indicates stopping the synchronization task for `db1` under the `id1` replica.
```shell
taosx replica status [id...]
```

### Restarting Active-Active Tasks
This command returns the list and status of active-active synchronization tasks created on the current machine. You can specify one or more replica IDs to obtain their task lists and status. An example output is as follows:

```shell
taosx replica restart id [db...]
```
```shell
+---------+----------+----------+----------+------+-------------+----------------+
| replica | task | source | sink | database | status | note |
+---------+----------+----------+----------+------+-------------+----------------+
| a | 2 | td1:6030 | td2:6030 | opc | running | |
| a | 3 | td2:6030 | td2:6030 | test | interrupted | Error reason |
```

This command has the following effects:
3. Stop a replication task:

- Restarts all or specified database synchronization tasks under the specified Replica ID.
- Using `taosx replica start id1 db1` only restarts the synchronization task for the specified database `db1`.
```shell
taosx replica stop [id [db...]]
```
### Checking Synchronization Progress
If you specify a database, replication for that database is stopped. If you do not specify a database, all replication tasks on the ID are stopped. If you do not specify an ID, all replication tasks on the instance are stopped.
```shell
taosx replica diff id [db....]
```
4. Restart a replication task:
This command outputs the difference between the subscribed offset in the current dual-replica synchronization task and the latest WAL (not representing row counts), for example:
```shell
taosx replica restart [id [db...]]
```
```shell
+---------+----------+----------+----------+-----------+---------+---------+------+
| replica | database | source | sink | vgroup_id | current | latest | diff |
+---------+----------+----------+----------+-----------+---------+---------+------+
| a | opc | td1:6030 | td2:6030 | 2 | 17600 | 17600 | 0 |
| ad | opc | td2:6030 | td2:6030 | 3 | 17600 | 17600 | 0 |
```
If you specify a database, replication for that database is restarted. If you do not specify a database, all replication tasks in the instance are restarted. If you do not specify an ID, all replication tasks on the instance are restarted.
### Deleting Active-Active Tasks
5. Check the progress of a replication task:
```shell
taosx replica remove id [--force]
```
```shell
taosx replica diff [id [db....]]
```
This deletes all current Active-Active synchronization tasks. Under normal circumstances, to delete a synchronization task, you need to first stop that task; however, when `--force` is enabled, it will forcibly stop and clear the task.
This command outputs the difference between the subscribed offset in the current active-active replication task and the latest WAL (not representing row counts), for example:
### Recommended Usage Steps
```shell
+---------+----------+----------+----------+-----------+---------+---------+------+
| replica | database | source | sink | vgroup_id | current | latest | diff |
+---------+----------+----------+----------+-----------+---------+---------+------+
| a | opc | td1:6030 | td2:6030 | 2 | 17600 | 17600 | 0 |
| ad | opc | td2:6030 | td2:6030 | 3 | 17600 | 17600 | 0 |
```
1. Assuming running on machine A, you need to first use `taosx replica start` to configure taosX, with input parameters being the addresses of the source and target servers to synchronize. After configuration, the synchronization service and tasks will automatically start. It is assumed that the taosx service uses the standard port and the synchronization task uses the native connection.
2. The steps on machine B are the same.
3. After starting services on both machines, the Active-Active system can provide services.
4. After the configuration is completed, if you want to restart the Active-Active system, please use the restart subcommand.
6. Delete a replication task.
## Exception Cases
```shell
taosx replica remove [id] [--force]
```
If the downtime recovery time exceeds the WAL retention duration, data loss may occur. In this case, the automatic data synchronization of the taosX service in the Active-Active system cannot handle the situation. Manual judgment is required to identify which data is lost, followed by starting additional taosX tasks to replicate the missing data.
This command deletes all stopped replication tasks on the specified ID. If you do not specify an ID, all stopped replication tasks on the instance are deleted. You can include the `--force` argument to delete all tasks without stopping them first.

0 comments on commit 2281014

Please sign in to comment.