overview_replication.txt

===========
Replication复制操作
===========

Since each replica in swift functions independently, and clients generally require 
only a simple majority of nodes responding to consider an operation successful, 
transient failures like network partitions can quickly cause replicas to diverge.  
These differences are eventually reconciled by asynchronous, peer-to-peer replicator processes.  
The replicator processes traverse their local filesystems, concurrently performing operations 
in a manner that balances load across physical disks.

因为Swift中的每一个副本都是独立地活动，并且客户通常只需要一个简单多数的节点响应就可以确定一个操作的成功，
因此如网络断开那些短暂的故障会迅速地使副本被隔离。这些不一致最终会被异步的，点对点的复制操作所理顺。
复制器进程遍历他们的文件系统，并且同时以可以在物理磁盘间均衡负载的方式来执行操作。


Replication uses a push model, with records and files generally only being copied 
from local to remote replicas.  This is important because data on the node may not 
belong there (as in the case of handoffs and ring changes), and a replicator can't 
know what data exists elsewhere in the cluster that it should pull in.  It's the duty 
of any node that contains data to ensure that data gets to where it belongs.  
Replica placement is handled by the ring.

复制操作使用推送的模式，记录和文件通常只被从本地复制到远程副本。这很重要因为那个节点上的数据可能并不输于
那里（比如当接手或者环发生变化），并且一个复制器无法知道从集群中的什么地方来拉取数据。每个节点都有义务
确保它所管辖的数据被复制到应该到达的目的地。副本的地址被环维护。

Every deleted record or file in the system is marked by a tombstone, so that deletions can be 
replicated alongside creations.  These tombstones are cleaned up by the replication process after 
a period of time referred to as the consistency window, which is related to replication duration 
and how long transient failures can remove a node from the cluster.  
Tombstone cleanup must be tied to replication to reach replica convergence.

系统中每一个删除的记录或文件都会被用一个“墓碑”来标注，这样删除操作就可以像创建操作那样被复制。这些墓碑
会在由一致性窗口得到的时间届满之后被清理掉，这个一致性窗口与复制操作持续时间和临时故障足以将一个节点从
集群中去除的时间有关。墓碑清理必须被复制操作所牵制，以达到复制收敛。


If a replicator detects that a remote drive is has failed, it will use the ring's "get_more_nodes" 
interface to choose an alternate node to synchronize with.  The replicator can generally maintain 
desired levels of replication in the face of hardware failures, though some replicas may not be 
in an immediately usable location.

如果一个复制器发现一个远程存储设备出现问题，它将使用环的“get_more_nodes"接口来选择一个预备的节点来将数据同步给它。
复制器在面临硬件故障时通常会维持期望水平的复制操作，因此一些副本可能不会马上就处于一个可用的地点。

Replication is an area of active development, and likely rife with potential improvements to speed and correctness.

复制操作是一个活跃的开发版块，并且极有潜力以加速并优化之。

There are two major classes of replicator - the db replicator, which replicates accounts and containers, 
and the object replicator, which replicates object data.

复制器中有两个主要的类---数据库复制器，它复制账号和容器，另一个是对象复制器，它复制对象数据。


--------------
DB Replication
--------------

The first step performed by db replication is a low-cost hash comparison to find out whether or not 
two replicas already match.  Under normal operation, this check is able to verify that most databases 
in the system are already synchronized very quickly.  If the hashes differ, the replicator brings the 
databases in sync by sharing records added since the last sync point.

数据库删除操作的第一步是低成本的比较以找出是否两个副本已经相匹配。在一般操作下，这个检查可以查验系统中最多的数据库已经迅速地同步。
如果哈希不同，复制器将会从最后一次同步点添加的记录来启动数据库的同步。

This sync point is a high water mark noting the last record at which two databases were known to be in sync, 
and is stored in each database as a tuple of the remote database id and record id.  Database ids are unique 
amongst all replicas of the database, and record ids are monotonically increasing integers.  After all new
 records have been pushed to the remote database, the entire sync table of the local database is pushed, 
 so the remote database knows it's now in sync with everyone the local database has previously synchronized with.

这个同步点是个高水平的标记两个数据库被认为一同步的最近一次记录，并且被在每个数据库中按照一个远程数据库ID和记录ID的元组来保存。
数据库ID在数据库所有的副本中式唯一的，并且记录ID是单调递增的证书。当所有新的记录都被推送到远程数据库之后，本地数据库的全部的同步表
将会被推送，因此远程数据库知道它现在与本地数据库之前同步的所有数据库是同步的。
 
If a replica is found to be missing entirely, the whole local database file is transmitted to the peer
 using rsync(1) and vested with a new unique id.
 
 如果一个副本被发现全部丢失，所有的本地数据库文件会被使用一个新的唯一的ID异步地传送给那个单元.

In practice, DB replication can process hundreds of databases per concurrency setting per second 
(up to the number of available CPUs or disks) and is bound by the number of DB transactions that must be performed.

在实践中，数据库复制操作会每秒会处理几百个数据库的一致性问题（以有效CPU和磁盘数量为限），并且受限于必须处理的数据库传输的数量。

------------------
Object Replication对象复制操作
------------------

The initial implementation of object replication simply performed an rsync to push data from a local partition 
to all remote servers it was expected to exist on.  While this performed adequately at small scale, replication 
times skyrocketed once directory structures could no longer be held in RAM.  We now use a modification of this 
scheme in which a hash of the contents for each suffix directory is saved to a per-partition hashes file. 
 The hash for a suffix directory is invalidated when the contents of that suffix directory are modified.
 
对象复制操作最初要实现的就是简单地将数据从本地分区异步地推送到目标位置。当这在一个较低的比例充分地执行时，
复制操作会多次急速发生一单目录结构不能被内存存取。我们现在使用一个修改过的方案，此方案中每一个后缀目录内容
的哈希会被保存到一个每分区哈希文件。当此后缀目录的内容发生修改时，其对应的哈希就会失效。

The object replication process reads in these hash files, calculating any invalidated hashes.  It then transmits
 the hashes to each remote server that should hold the partition, and only suffix directories with differing hashes 
 on the remote server are rsynced.  After pushing files to the remote server, the replication process notifies 
 it to recalculate hashes for the rsynced suffix directories.
 
对象复制操作会读取这些哈希文件，计算所有无效的哈希。然后传送那些哈希到管辖这些分区的远程的服务器，
然后只有与远程服务器有不同哈希的后缀目录会被重新同步。
将文件推送到远程服务器之后，复制进程会通报这个事件以重新计算那些重新同步的后缀目录的哈希。

Performance of object replication is generally bound by the number of uncached directories it has to traverse,
 usually as a result of invalidated suffix directory hashes.  Using write volume and partition counts from our 
 running systems, it was designed so that around 2% of the hash space on a normal node will be invalidated per day, 
 which has experimentally given us acceptable replication speeds.

对象复制的执行通常受限于必须要遍历的未保存目录的数量，这一般是由无效的后缀目录哈希所导致的。使用从运行中的系统写卷和分区计数，
这样设计是为应对假设每天一个节点上的哈希空间出现2%左右的失效的状况，这实验性地给我们了一个可接受的复制速度。