Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v1.29] Add async replication docs #2974

Merged
merged 4 commits into from
Feb 14, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -88,12 +88,12 @@ A replication factor of 3 is commonly used, since this provides a right balance

## Write operations

On a write operation, the client’s request will be sent to any node in the cluster. The first node which receives the request is assigned as the coordinator. The coordinator node sends the request to a number of predefined replicas and returns the result to the client. So, any node in the cluster can be a coordinator node. A client will only have direct contact with this coordinator node. Before sending the result back to the client, the coordinator node waits for a number of write acknowledgements from different nodes depending on the configuration. How many acknowledgements Weaviate waits for, depends on the [consistency configuration](./consistency.md).
On a write operation, the client’s request will be sent to any node in the cluster. The first node which receives the request is assigned as the coordinator. The coordinator node sends the request to a number of predefined replicas and returns the result to the client. So, any node in the cluster can be a coordinator node. A client will only have direct contact with this coordinator node. Before sending the result back to the client, the coordinator node waits for a number of write acknowledgments from different nodes depending on the configuration. How many acknowledgments Weaviate waits for, depends on the [consistency configuration](./consistency.md).

**Steps**
1. The client sends data to any node, which will be assigned as the coordinator node
2. The coordinator node sends the data to more than one replica node in the cluster
3. The coordinator node waits for acknowledgement from a specified proportion (let's call it `x`) of cluster nodes. Starting with v1.18, `x` is [configurable](./consistency.md), and defaults to `ALL` nodes.
3. The coordinator node waits for acknowledgment from a specified proportion (let's call it `x`) of cluster nodes. Starting with v1.18, `x` is [configurable](./consistency.md), and defaults to `ALL` nodes.
4. When `x` ACKs are received by the coordinator node, the write is successful.

As an example, consider a cluster size of 3 with replication factor of 3. So, all nodes in the distributed setup contain a copy of the data. When the client sends new data, this will be replicated to all three nodes.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,13 +5,13 @@ image: og/docs/concepts.jpg
# tags: ['architecture']
---

Replication factor in Weaviate determines how many copies of shards (also called replicas) will be stored across a Weaviate cluster.
The replication factor in Weaviate determines how many copies of shards (also called replicas) will be stored across a Weaviate cluster.

<p align="center"><img src="/img/docs/replication-architecture/replication-factor.png" alt="Replication factor" width="80%"/></p>

When the replication factor is > 1, consistency models balance the system's reliability, scalability, and/or performance requirements.

Weaviate uses multiple consistency models. One for its cluster metadata, and another for its data objects.
Weaviate uses multiple consistency models. One for its cluster metadata and another for its data objects.

### Consistency models in Weaviate

Expand Down Expand Up @@ -63,7 +63,7 @@ As a result, data objects in Weaviate are eventually consistent. Eventual consis

Weaviate uses eventual consistency to improve availability. Read and write consistency are tunable, so you can tradeoff between availability and consistency to match your application needs.

*The animation below is an example of how a write or a read is performed with Weaviate with a replication factor of 3 and 8 nodes. The blue node acts as coordinator node. The consistency level is set to `QUORUM`, so the coordinator node only waits for two out of three responses before sending the result back to the client.*
*The animation below is an example of how a write or a read is performed with Weaviate with a replication factor of 3 and 8 nodes. The blue node acts as the coordinator node. The consistency level is set to `QUORUM`, so the coordinator node only waits for two out of three responses before sending the result back to the client.*

<p align="center"><img src="/img/docs/replication-architecture/replication-quorum-animation.gif" alt="Write consistency QUORUM" width="75%"/></p>

Expand All @@ -75,10 +75,10 @@ Adding or changing data objects are **write** operations.
Write operations are tunable starting with Weaviate v1.18, to `ONE`, `QUORUM` (default) or `ALL`. In v1.17, write operations are always set to `ALL` (highest consistency).
:::

The main reason for introducing configurable write consistency in v1.18 is because that is also when automatic repairs are introduced. A write will always be written to n (replication factor) nodes, regardless of the chosen consistency level. The coordinator node however waits for acknowledgements from `ONE`, `QUORUM` or `ALL` nodes before it returns. To guarantee that a write is applied everywhere without the availability of repairs on read requests, write consistency is set to `ALL` for now. Possible settings in v1.18+ are:
* **ONE** - a write must receive an acknowledgement from at least one replica node. This is the fastest (most available), but least consistent option.
* **QUORUM** - a write must receive an acknowledgement from at least `QUORUM` replica nodes. `QUORUM` is calculated as _n / 2 + 1_, where _n_ is the number of replicas (replication factor). For example, using a replication factor of 6, the quorum is 4, which means the cluster can tolerate 2 replicas down.
* **ALL** - a write must receive an acknowledgement from all replica nodes. This is the most consistent, but 'slowest' (least available) option.
The main reason for introducing configurable write consistency in v1.18 is because that is also when automatic repairs are introduced. A write will always be written to n (replication factor) nodes, regardless of the chosen consistency level. The coordinator node however waits for acknowledgments from `ONE`, `QUORUM` or `ALL` nodes before it returns. To guarantee that a write is applied everywhere without the availability of repairs on read requests, write consistency is set to `ALL` for now. Possible settings in v1.18+ are:
* **ONE** - a write must receive an acknowledgment from at least one replica node. This is the fastest (most available), but least consistent option.
* **QUORUM** - a write must receive an acknowledgment from at least `QUORUM` replica nodes. `QUORUM` is calculated as _n / 2 + 1_, where _n_ is the number of replicas (replication factor). For example, using a replication factor of 6, the quorum is 4, which means the cluster can tolerate 2 replicas down.
* **ALL** - a write must receive an acknowledgment from all replica nodes. This is the most consistent, but 'slowest' (least available) option.


*Figure below: a replicated Weaviate setup with write consistency of ONE. There are 8 nodes in total out of which 3 replicas.*
Expand Down Expand Up @@ -160,13 +160,72 @@ Weaviate uses [async replication](#async-replication), [deletion resolution](#de
:::info Added in `v1.26`
:::

Async replication runs in the background. It uses a Merkle tree algorithm to monitor and compare the state of nodes within a cluster. If the algorithm identifies an inconsistency, it resyncs the data on the inconsistent node.
Async replication is a background synchronization process in Weaviate that ensures eventual consistency across nodes storing the same shard. When a collection is partitioned into multiple shards, each shard is replicated across several nodes (as defined by the replication factor `REPLICATION_MINIMUM_FACTOR`). Async replication guarantees that all nodes holding the same shard remain in sync by periodically comparing and propagating data.

It uses a Merkle tree (hash tree) algorithm to monitor and compare the state of nodes within a cluster. If the algorithm identifies an inconsistency, it resyncs the data on the inconsistent node.

Repair-on-read works well with one or two isolated repairs. Async replication is effective in situations where there are many inconsistencies. For example, if an offline node misses a series of updates, async replication quickly restores consistency when the node returns to service.

Async replication supplements the repair-on-read mechanism. If a node becomes inconsistent between sync checks, the repair-on-read mechanism catches the problem at read time.

To activate async replication, set `asyncEnabled` to true in the [`replicationConfig` section of your collection definition](../../manage-data/collections.mdx#replication-settings).
To activate async replication, set `asyncEnabled` to true in the [`replicationConfig` section of your collection definition](../../manage-data/collections.mdx#replication-settings). Visit the [How-to: Replication](/developers/weaviate/configuration/replication#async-replication-settings) page to learn more about all the available async replication settings.

#### Memory consumption of async replication

Async replication uses a hash tree to compare and synchronize data between nodes. The memory required for this process is determined by the height of the hash tree (`H`). A higher hash tree uses more memory but allows async replication to compare data more precisely.

The trade-offs can be summarized like this:
- **Higher** `H`: More memory per shard/tenant but potentially more performant comparisons (finer granularity).
- **Lower** `H`: Less memory usage but coarser data comparisons.


Use the following formulas and examples as a quick reference:

##### Memory calculation

- **Total number of nodes in the hash tree:**
For a hash tree with height `H`, the total number of nodes is:
```
Number of nodes = 2^(H+1) - 1
```

- **Total memory required (per shard/tenant on each node):**
Each node uses approximately **8 bytes** of memory.
```
Memory Required = (2^(H+1) - 1) * 8 bytes
```

##### Examples

- Hash tree with height `16`:
- `Total Nodes = 2^(16+1) - 1 = 2^17 - 1 = 131072 - 1 = 131071`
- `Memory Required ≈ 131071 * 8 bytes ≈ 1,048,568 bytes (~1 MB)`

- Hash tree with height `6`:
- `Total Nodes = 2^(6+1) - 1 = 2^7 - 1 = 128 - 1 = 127`
- `Memory Required ≈ 127 * 8 bytes ≈ 1,016 bytes (~1 KB)`

##### Performance Consideration: Number of Leaves

The objects in a shard or tenant are distributed among the leaves of the hash tree.
A larger hash tree allows for more granular and efficient comparisons, which can improve replication performance.

- **Number of Leaves in the hash tree:**
```
Number of leaves = 2^H
```

##### Examples

- Hash tree with height `16`:
- `Number of Leaves = 2^16 = 65,536`

- Hash tree with height `6`:
- `Number of Leaves = 2^6 = 64`

:::note Default settings
The default hash tree height of `16` is chosen to balance memory consumption with replication performance. Adjust this value based on your node’s available resources and performance requirements.
:::

### Deletion resolution strategies

Expand Down
13 changes: 13 additions & 0 deletions developers/weaviate/config-refs/env-vars.md
Original file line number Diff line number Diff line change
Expand Up @@ -124,6 +124,19 @@ Role-based access control (RBAC) is added `v1.28` as a **technical preview**. Th

| Variable | Description | Type | Example Value |
| --- | --- | --- | --- |
| `ASYNC_REPLICATION_DISABLED` | Disable async replication. Default: `false` | `boolean` | `false` |
| `ASYNC_REPLICATION_HASHTREE_HEIGHT` | Height of the hash tree used for data comparison between nodes. If the height is `0` each node will store just one digest per shard. Default: `16`, Min: `0`, Max: `20` | `string - number` | `10` |
| `ASYNC_REPLICATION_FREQUENCY` | Frequency of periodic data comparison between nodes in seconds. Default: `30` | `string - number` | `60` |
| `ASYNC_REPLICATION_FREQUENCY_WHILE_PROPAGATING` | Frequency of data comparison between nodes after a node has been synced in milliseconds. Default: `10` | `string - number` | `20` |
| `ASYNC_REPLICATION_ALIVE_NODES_CHECKING_FREQUENCY` | Frequency of how often the background process checks for changes in the availability of nodes in seconds. Default: `5` | `string - number` | `20` |
| `ASYNC_REPLICATION_LOGGING_FREQUENCY` | Frequency of how often the background process logs any events in seconds. Default: `5` | `string - number` | `7` |
| `ASYNC_REPLICATION_DIFF_BATCH_SIZE` | The maximum size of the batch propagated. Default: `1000`, Min: `1`, Max: `10000` |`string - number` | `2000` |
| `ASYNC_REPLICATION_DIFF_PER_NODE_TIMEOUT` | Defines the time limit a node has to provide a comparison response in seconds. Default: `10` | `string - number` | `30` |
| `ASYNC_REPLICATION_PROPAGATION_TIMEOUT` | Defines the time limit a node has to provide a propagation response in seconds. Default: `30` | `string - number` | `60` |
| `ASYNC_REPLICATION_PROPAGATION_LIMIT` | The maximum number of objects propagated per iteration. Default: `10000`, Min: `1`, Max: `1000000` | `string - number` | `5000` |
| `ASYNC_REPLICATION_PROPAGATION_DELAY` | TODO[g-despot] Default: `30` | `string - number` | `5000` |
| `ASYNC_REPLICATION_PROPAGATION_CONCURRENCY` | TODO[g-despot] Default: `5`, Min: `1`, Max: `20` | `string - number` | `10` |
| `ASYNC_REPLICATION_PROPAGATION_BATCH_SIZE` | TODO[g-despot] Default: `100`, Min: `1`, Max: `1000` |`string - number` | `200` |
| `CLUSTER_DATA_BIND_PORT` | Port for exchanging data. | `string - number` | `7103` |
| `CLUSTER_GOSSIP_BIND_PORT` | Port for exchanging network state information. | `string - number` | `7102` |
| `CLUSTER_HOSTNAME` | Hostname of a node. Always set this value if the default OS hostname might change over time. | `string` | `node1` |
Expand Down
48 changes: 46 additions & 2 deletions developers/weaviate/configuration/replication.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ In this example, there are three replicas. If you set the replication factor bef

The replication factor can be modified after you add data to a collection. If you modify the replication factor afterwards, new data is copied across the new and pre-existing replica nodes.

The example data schema has a [write consistency](../concepts/replication-architecture/consistency.md#tunable-write-consistency) level of `ALL`. When you upload or update a schema, the changes are sent to `ALL` nodes (via a coordinator node). The coordinator node waits for a successful acknowledgement from `ALL` nodes before sending a success message back to the client. This ensures a highly consistent schema in your distributed Weaviate setup.
The example data schema has a [write consistency](../concepts/replication-architecture/consistency.md#tunable-write-consistency) level of `ALL`. When you upload or update a schema, the changes are sent to `ALL` nodes (via a coordinator node). The coordinator node waits for a successful acknowledgment from `ALL` nodes before sending a success message back to the client. This ensures a highly consistent schema in your distributed Weaviate setup.

## Data consistency

Expand All @@ -51,9 +51,53 @@ import ReplicationConfigWithAsyncRepair from '/_includes/code/configuration/repl

<ReplicationConfigWithAsyncRepair />

### Configure async replication settings {#async-replication-settings}

:::info Added in `v1.29`
Async replication support has been added in `v1.26`while the [environment variables](/developers/weaviate/config-refs/env-vars#multi-node-instances) for configuring async replication (`ASYNC_*`) have been introduced in `v1.29`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can remove this altogether - wdyt?

Async replication support has been added in `v1.26`while

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it needs to be removed at some point, just not sure if now is the time. Do we have some kind of policy for how many older releases we officially support? (TODO: add this info to docs if not already there)

Maybe a good rule of thumb could be to leave these admonitions for five minor versions. So for this release v1.29.0 we would leave all admonitions mentioning version >=v1.25.

:::

Async replication ensures that data stored in multiple nodes (shards) remains eventually consistent. Follow these steps to set up and fine-tune async replication in Weaviate using [environment variables](/developers/weaviate/config-refs/env-vars#multi-node-instances).

<!-- TODO[g-despot]: Add new environment variables -->

#### Step 1: Configure logging

- **Set the frequency of the logger:** `ASYNC_REPLICATION_LOGGING_FREQUENCY`
Define how often the async replication background process will log events.

#### Step 2: Configure periodic data comparison

- **Set the frequency of comparisons:** `ASYNC_REPLICATION_FREQUENCY`
Define how often each node compares its local data with other nodes.
- **Set comparison timeout:** `ASYNC_REPLICATION_DIFF_PER_NODE_TIMEOUT`
Optionally configure a timeout for how long to wait with comparison when a node is unresponsive.
- **Monitor node availability:** `ASYNC_REPLICATION_ALIVE_NODES_CHECKING_FREQUENCY`
Trigger comparisons whenever there’s a change in node availability.
- **Configure hash tree height:** `ASYNC_REPLICATION_HASHTREE_HEIGHT`
Specify the size of the hash tree. This structure helps narrow down data differences by comparing hash digests at multiple levels instead of scanning entire datasets.


#### Step 3: Set up data synchronization

Once differences between nodes are detected, Weaviate propagates outdated or missing data. Configure synchronization as follows:

- **Set the frequency of propagation:** `ASYNC_REPLICATION_FREQUENCY_WHILE_PROPAGATING`
After synchronization is completed on a node, temporarily change the data comparison frequency to the set value.
- **Set propagation timeout:** `ASYNC_REPLICATION_PROPAGATION_TIMEOUT`
Optionally configure a timeout for how long to wait with propagation when a node is unresponsive.
- **Batch size for data propagation:** `ASYNC_REPLICATION_BATCH_SIZE`
Define the number of objects that are sent in each synchronization batch.
- **Set propagation limits:** `ASYNC_REPLICATION_PROPAGATION_LIMIT`
Enforce an object limit per propagation iteration.

:::tip
Tweak these settings based on your cluster size and network latency to achieve optimal performance. Smaller batch sizes and shorter timeouts may be beneficial for high-traffic clusters, while larger clusters might require more conservative settings.
:::

## How to use: Queries

When you add (write) or query (read) data, one or more replica nodes in the cluster will respond to the request. How many nodes need to send a successful response and acknowledgement to the coordinator node depends on the `consistency_level`. Available [consistency levels](../concepts/replication-architecture/consistency.md) are `ONE`, `QUORUM` (replication_factor / 2 + 1) and `ALL`.
When you add (write) or query (read) data, one or more replica nodes in the cluster will respond to the request. How many nodes need to send a successful response and acknowledgment to the coordinator node depends on the `consistency_level`. Available [consistency levels](../concepts/replication-architecture/consistency.md) are `ONE`, `QUORUM` (replication_factor / 2 + 1) and `ALL`.

The `consistency_level` can be specified at query time:

Expand Down
2 changes: 1 addition & 1 deletion developers/weaviate/manage-data/collections.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -842,7 +842,7 @@ import RaftRFChangeWarning from '/_includes/1-25-replication-factor.mdx';

<RaftRFChangeWarning/>

Configure replication settings, such as [async replication](../concepts/replication-architecture/consistency.md#async-replication) and [deletion resolution strategy](../concepts/replication-architecture/consistency.md#deletion-resolution-strategies).
Configure replication settings, such as [async replication](/developers/weaviate/configuration/replication#async-replication-settings) and [deletion resolution strategy](../concepts/replication-architecture/consistency.md#deletion-resolution-strategies).

<Tabs groupId="languages">
<TabItem value="py" label="Python Client v4">
Expand Down