开发技术 - 数据增量同步

差分编码
- diff 和 patch
- Rsync
- RDC
- 流式增量
应用案例
- APK 包增量更新
- zsync
- crsync
- Courgette
- Redis
- Evernote
- Dropbox
- 犀牛云盘
- ownCloud
- seafile

本文尝试对通过网络进行数据增量同步（Incremental Data Synchronization）的相关算法和应用进行归纳。

基于 LAN 或 WAN 的网络应用之间进行数据传输和数据同步的应用非常普遍。很多场景下，需要同步的数据只有部分差异，此时使用合适的技术识别和传输差异部分，无异可以降低应用对带宽的消耗，同时也能加快同步速度。

本文先对常见跨网络的数据差异识别和同步的基础技术进行归纳，然后再对一些网络应用中涉及到数据同步的功能的实现原理进行描述。了解这些技术和业界的具体应用，以期在日后的类似需求场景里可以选择合适的技术。

差分编码

Data differencing or differential compression is producing a technical description of the difference between two sets of data - a source and a target. Formally, a data differencing algorithm takes as input source data and target data, and produces difference data such that given the source data and the difference data, one can reconstruct the target data.

One of the best-know example of data differencing is the diff utility, which produces line-by-line differences of text files. Differencing of general binary files goes under the rubric of delta encoding, with a widely used example being the algorithm used in rsync. A standardized generic differencing format is VCDIFF, implemented in such utilities as Xdelta version 3. A high-efficiency (small patch files) differencing program is bsdiff, which is based on bzip2 compression, demonstrating the close connection between differencing and compression.

Delta encoding:

Delta encoding is a way of storing or transmitting data in the form of differences (deltas) between sequential data rather than complete files; more generally this is known as data differencing. Delta encoding is sometimes called delta compression, particularly where archival histories of changes are required (e.g., in revision control software).

Data deduplication:

In computing, data deduplication is a specialized data compression technique for eliminating duplicate copies of repeating data. Related and somewhat synonymous terms are intelligent (data) compression and single-instance (data) storage. This technique is used to improve storage utilization and can also be applied to network data transfers to reduce the number of bytes that must be sent. In the deduplication process, unique chunks of data, or byte patterns, are identified and stored during a process of analysis. As the analysis continues, other chunks are compared to the stored copy and whenever a match occurs, the redundant chunk is replaced with a small reference that points to the stored chunk. Given that the same byte pattern may occur dozens, hundreds, or even thousands of times (the match frequency is dependent on the chunk size), the amount of data that must be stored or tranferred can be greatly reduced.

diff 和 patch

从最原始的使用 Hunt–McIlroy algorithm 的 diff 实现，到比较新的对二进制文件操作效率很高的 bsdiff 实现，文件差异计算工具 diff 和 patch，也经历了漫长的进化过程。关于 diff 工具的发展历史和算法演化，可参见维基页面。

一般地， diff 工具可以用来展示或生成一个文件的两个版本之间的差异之处。我们将差异的部分保存到 patch 文件，该文件可作为 patch 工具的输入数据，随后，patch 可以使用它对两个文件的其中一个进行修改，以使它和另外一个文件保持一致。

Rsync

Rsync 算法及程序最初由 Andrew Tridgell 和 Paul Mackerras 设计和实例，最早于 1996 年发布。目前它已经成了 Linux 各发行版的标准工具套件。

Rsync 算法及程序属于 delta encoding 技术的一种具体实现，它可以于用于高效的通过网络传输和同步数据。它用于解决如下场景下存在的问题：

The problem is that the normal methods for creating a set of differences between two files rely on being able to read both files. Thus they require that both files are available beforehand at one end of the link. If they are not both available on the same machine, these algorithms cannot be used (once you had copied the file over, you wouldn't need the differences). This is the problem that rsync addresses.

也就是说，如果需要差异同步的文件如果没在同一台机器上时，rsync 就可以派上用场了。

Rsync 官网对 Rsync 算法流程提供了详细描述。同时，网络上也有很多 Rsync 算法的分析文章和对官方文档的翻译，我们这里列举几个质量比较高的文章：

酷壳上的分析文章
- 左耳朵耗子在本文中说其它文章说 “rolling checksum 需要排序” 的理解是错误的。但是，从原文提供的信息来看，我倒是觉得他的这个判断是不对的。官方技术文档对这部分的描述如下：
```
The first level uses a 16-bit hash of the 32-bit rolling checksum and a 2^16 entry
hash table. The list of checksum values (i.e., the checksums from the blocks of B) is
sorted according to the 16-bit hash of the 32-bit rolling checksum. Each entry in the
hash table points to the first element of the list for that hash value, or contains a
null value if no element of the list has that hash value.
```
- 从原文来看，发送接收到 rolling checksum 列表后，先对这个列表进行排序。随后构造的哈希表元素直接引引有序列表中的元素。或者只能通过 rsync 的代码来证实这个部分的数据结构是如何构造的了。
官方文档的翻译
How rsync works 介绍了 rsync 工具的工作流程。

为了方便后面的讨论和对其它工具进行介绍，我们这里将 rsync 的工作流程简要摘录如下（假设有两台能通过比较慢的连接通信的主机 α 和 β，它们分别有文件 A 和 B。我们需要将 B 文件同步为 A 文件）：

β splits the file B into a series of non-overlapping fixed-sized blocks of size S bytes. The last block may be shorter than S bytes.
For each of these blocks, β calculates two checksums: a weak rolling 32-bit checksum and a strong 128-bit MD4 checksum.
β sends these checksums to α.
α searchs through A to find all blocks of length S bytes (at any offset, not just multiplers of S) that have the same weak and strong checksum as one of the blocks of B. This can be done in a single pass very quickly using a special property of the rolling checksum described below.
α sends β a sequence of instructions for constructing a copy of A. Each instruction is either a reference to a block of B, or literal data. Literal data is sent only for those sections of A which did not match any of the blocks of B.

上述步骤完成后，主机 β 就有了 A 文件的完整副本。至于 rolling checksum （基于 Mark Adler's adler-32 checksum 算法）和两端重复数据块的查找算法的细节描述可以参考上面给出的文档链接。

从上面流程我们可以看到，整个数据差异计算工作，都由主机 α 完成。一般应用场景下，主机 α 的角色基本由服务器充当。

实际使用中，还有很多用于提供 rsync 工作效率的优化手段，比如，使用 rsync 友好的压缩方式以解决 rsync 无法有效处理普通压缩包的问题，等等。我们就不再深入讲解了。

Multiround Rsync 提出了一些对 rsync 算法优缺点的分析和优化方式。

目前有很多基于 rsync 算法的库和工具：

rdiff 工具使用 rsync 算法生成两个文件间的差异文件（类似 diff 工具，但是生成的差异文件格式不同），该差异文件可以被 rdiff patch 使用。rdiff 也适用于二制文件。
librsync 库是 rdiff 使用的 rsync 算法实现。Dropbox, rdiff-backup, duplicity 等工具也都使用了该库。

RDC

在重复数据消除（Data deduplication）技术中会涉及到对数据文件进行分块的策略，常用的策略有：定长块策略、变长块策略（CDC，content-defined chunking）和滑动块（Sliding Block）策略。

定长块很容易理解，源文件和目标文件根据固定的长度进行数据分块。滑动块策略类似上面介绍的 Rsync 算法使用的策略。而变长块策略使用算法根据文件中的模式决定分块的大小和边界。这篇文章对 CDC 的工作模式和优缺点有如下介绍：

CDC 算法是一种变长分块算法，它应用数据指纹(如 Rabin 指纹)将文件分割成长度大小不等的分块策略。与定长分块算法不同，它是基于文件内容进行数据块切分的，因此数据块大小是可变化的。算法执行过程中，CDC 使用一个固定大小 (如48字节)的滑动窗口对文件数据计算数据指纹。如果指纹满足某个条件，如当它的值模特定的整数等于预先设定的数时，则把窗口位置作为块的边界。CDC算法可能会出现病态现象，即指纹条件不能满足，块边界不能确定，导致数据块过大。实现中可以对数据块的大小进行限定，设定上下限，解决这种问题。CDC 算法对文件内容变化不敏感，插入或删除数据只会影响到检少的数据块，其余数据块不受影响。CDC 算法也是有缺陷的，数据块大小的确定比较困难，粒度太细则开销太大，粒度过粗则 dedup 效果不佳。如何两者之间权衡折衷，这是一个难点。

这里有一个比较容易理解的 CDC 实现的介绍。

FastCDC 提出了一种效率更高的 CDC 方案。

pcompress 设计了一种基于 rolling hash 的 CDC 实现方案。

微软设计并用于 Windows 操作系统中的 RDC (Remote Differential Compression) 同步算法使用了类的思想。

Wikipedia RDC:

Remote Differential Compression (RDC) is a client-server synchronization algorithm that allows the contents of two files to be synchronized by communicating only the differences between them.

它的实现原理工作流程大致如下：

The algorithm used is based on fingerprinting blocks on each file locally at both ends of the replication partners. Since many types of file changes can cause the file to move (for example, a small insertion or deletion at the beginning of a file can cause the rest of the file to become misaligned to the original content) the blocks used for comparison are not based on static arbitrary cut points but on cut points defined by the contents of each file segment. This means that if a part of file changes in length or blocks of the contents get moved to other parts of the file, the block boundaries for the parts that have not changed remain fixed related to the contents, and thus the series of fingerprints for those blocks dont change either, they just change position. By comparing all hashes in a file to the hashes for the same file at the other end of the replication pair, RDC is able to identify which blocks of the file have changed and which haven't, even if the contents of the file has been significantly reshuffled.

Since comparing large files could imply making large numbers of signature comparisons, the algorithm is recursively applied to the hash sets to detect which blocks of hashes have changed or moved around, significantly reducting the amount of data that needs to be transmitted for comparing files.

Windows 将 RDC 算法封装成了 API 供应用调用，MSDN 提供了可供开发者参考的信息。

下面列举一下 RDC 相关的文章和论文：

Remote Differential Compression (aka rsync algorithm for Windows)
RDC 算法的介绍文章。
A Low-bandwidth Network File System - 该论文中提出了 RDC 算法的基础思想。
Optimizing File Replication over Limited-Bandwidth Networks using RDC
该论文提出了对 RDC 算法一些优化措施。
Rabin fingerprint - RDC 算法中用于在文件中寻找数据边界的指纹算法。字符串查找算法 Rabin-Karp 算法也使用了该指纹算法。还有很多基于该算法的应用：
- 基于Rabin指紋方法的URL去重算法
- Rabin指纹算法在重复数据检测中的应用研究
- Rabin指纹去重算法在搜索引擎中的应用

流式增量

上面介绍的差异计算算法都是为静态文件设计的。在差异数据不继向后追加的应用场景，比如日志文件、MySQL binlog 中，我们只需记录上次同步的位置，在数据有更新时，从上次位置开始同步就能够获取增量数据。我还没有找到这种增量技术的正式命名，现在我把这类技术称为「流式增量」。

这一些场景中，我们总可以找到某次更新在整个「更新数据流」中的唯一位置（文件偏移量，时间戳等单调递增的信息），且比它更新的数据始终出现在该数据的后面。

在另一些场景中，我们可以对数据进行排序，比如从 MySQL 执行 SELECT 查询得到的数据集合，我们可以根据排序依据分批次获取更多数据。

这两类场景，我们暂时都归为「流式增量」技术的范畴。

应用案例

本节我们对各种常见应用场景使用的增量更新技术进行归纳。

APK 包增量更新

这篇文章提出了一种使用 diff 和 patch 类工具完成 Android APK 包增量下载的方案：当客户端依然保存有某个较老版本的完整 APK 包时，客户端可以向服务请求该版本和服务器最新版本之前的差分文件（delta file 或者 diff file），然后将该差异文件和老版本通过 patch 工具，构造最新版 APK 文件。

我们将文章中描述的增量升级流程摘录如下：

客户端获取当应用的版本号，发送给服务器；
服务器根据策略，返回更新包信息：

2.1 如果本地没有老版本 APK 文件或按照策略不能为该版本使用增量升级，则向客户端返回全量包下载地址； 2.2 如果最新版本 APK 和客户端版本对应的 APK 包的差异文件还不存在，那么开启异步任务生成该差异文件。同时，为了降低客户端等待时长，向客户端返回全量包下载地址； 2.3 如果最新版本 APK 和客户端版本对应的 APK 包的差异文件已经存在，那么返回该差异文件下载地址；
客户端根据服务器返回完成升级操作：

3.1 如果服务器返回了全量包下载地址，那么下载并安装全量包； 3.2 如果服务器返回了差异文件下载地址，那么下载差异文件并使用差异文件和本地 APK 包构造最新版软件包；

上面流程中可以增加 MD5 文件校验等环节，对下载的文件的完整性进行检查。还可以使用继点续传以减少网络状况不好时可能带来的流量浪费。

该文章提出的方案里，使用了 bsdiff/bspatch 工具生成差分文件。同时，也尝试了另外一种据信性能较高的的开源工具 HDiffPatch。可以在这篇文章里看到 HDiffPatch 和 BSDiff 4.3 的性能测试对比。

该方案有如下缺点：

由于 Android 渠道包的存在，随着每一次发布新版 APK，服务端需要维护数量巨大，并翻倍增加的差分文件；

bsdiff/bspatch 算法的内存开销很高；

bsdiff is quite memory-hungry. It requires max(17n,9n+m)+O(1) bytes of memory, where n is
the size of the old file and m is the size of the new file. bspatch requires n+m+O(1) bytes.
bsdiff runs in O((n+m) log n) time; on a 200MHz Pentium Pro, building a binary patch for a
4MB file takes about 90 seconds. bspatch runs in O(n+m) time; on the same machine, applying
that patch takes about two seconds.

zsync

zsync 是一个文件传输程序。如本地已经有了某个文件的较老版本时，zsync 可以根据这个较老版本的数据，只从远端服务下载部分数据，最终组合成完整新版本文件。

zsync 使用了 Rsync 算法，但是它和 Rsync 程序的设计目标不同：

However, where rsync is designed for synchronising data from one computer to another within an organization, zsync is designed for file distribution, with one file one a server to be distributed to thousands of downloaders.

比如，目前，Ubuntu 的 ISO 分发服务提供有 zsync 的下载方式。比如，你可以通过下面指令下载 Ubuntu 64 位系统的镜像 ISO：

zsync -i ubuntu-16.04-desktop-amd64.iso http://mirrors.aliyun.com/ubuntu-releases/17.10/ubuntu-17.10-desktop-amd64.iso.zsync

zsync 基于 rsync 算法，并提供了如下几个改进（同时，这些特点也是 rsync 程序无法用于文件发布（file distribution）场景的原因）：

Client-side rsync - zsync 虽然使用了 rsync 算法，但和 rsync 程序不同的是 zsync 由客户端进行差分运算。这样就避免了 rsync 程序给服务端带来的负载压力。
Rsync over HTTP - zsync 的传输效率和 rsync -z 不相上下。同时，zsync 使用 HTTP 协议，用户只需使用 HTTP/1.1 兼容的 Web 服务程序就可以为 zsync 提供服务。所以，zsync 可以穿透防火墙、运行在共享主机上，并且安全隐患较少。
Handling for compressed files - 由于算法的局限性，使用 rsync 处理一般的压缩文件时，效率并不高。而 zsync 针对 gzip 压缩文件提供了特殊设计，可以高效的处理压缩文件。

官方网站提供有 zsync 的设计文档。它的工作方式和 rsync 的主要区别是， zsync 由数据源（服务端）提供数据块 checksum，由客户端使用 rolling checksum 算法计算两端的数据差异，并通过 HTTP Range 支持，分段下载差异数据。

其它算法和设计上的优化措施，在设计文档中也均有描述，有兴趣的同学可以详细读一读这个文档。

crsync

由陈琦同学设计开发并开源。crsync 提供了 Android SDK，可以方便的用于 Android App 中。

crsync 的设计思路和 zsync 接近，我们就不在详细分析了。下面是 crsync 的设计文档和源代码：

Courgette

Courgette 算法是 Google Chrome 团队为了降低 Chrome 浏览器增量升级包大小而设计的。它对二进制可执行程序进行特殊处理，能比 bsdiff 生成更小的差异文件。

http://dev.chromium.org/developers/design-documents/software-updates-courgette#TOC-Source-Code

We tried several binary diff algorithms and have been using bsdiff up until now. We are big fans of bsdiff - it is small and worked better than anything else we tried.

But bsdiff was still producing diffs that were bigger than we felt were necessary. So we wrote a new diff algorithm that knows more about the kind of data we are pushing - large files containing compiled executables.

Courgette 工作原理的简要介绍在这里，代码在这里。

Redis

和 MySQL 主从同步类似，Redis Slave 和 Master 进行数据同步的模式也属于上面介绍的「流式增量」技术的一种实际应用。

具体同步流程可以从官方文档中找到：

How Redis replication works

Every Redis master has a replication ID: it is a large pseudo random string that marks a given story of the dataset. Each master also takes an offset that increments for every byte of replication stream that is produced to be sent to slaves, in order to update the state of the slaves with the new changes modifying the dataset. The replication offset is incremented even if no slave is actually connected, so basically every given pair of:
  Replication ID, offset
Identifies an exact version of the dataset of a master. When slaves connects to mater, the use the PSYNC command in order to send their old master replication ID and the offsets they processed so far. This way the master can send just the incremental part needed. However if there is not enough backlog in the master buffers, or if the slave is referring to an history (replication ID) which is no longer known, then a full resynchronization happens: in this case the slave will get a full copy of the dataset, from scratch.

Evernote

作为云笔记应用，Evernote 使用自己设计的同步算法保证客户端和服务器的数据同步，该同步算法也支持增量同步。 Evernote 提供的技术文档详细描述了这种同步算法的步骤。从本质上来说，这个同步算法也是「流式增量」的一种具体实现。

开源云笔记 Leanote 使用了和 Evernote 相同的同步机制。

接下来，我们根据官方技术文档中的描述，对该同步算法进行简要说明。

that treats the central service as a simple data store and pushes most decisions into the clients. This is intended to be similar to the model used by mail systems such as IMAP or MS Exchange, which have been able to address a similar set of requirements in a robust and scalable manner...

... the Evernote service does not keep track of the current state of individual clients, and it does not keep a fine-grained transaction log to implement "log-based replication". Instead, it stores a set of data elements (notes, tags, etc.) for each user. Each data element has an "Update Sequence Number" (USN) that identifies the order in which it was last modified in the account. The system can use these numbers to determine whether one object in the account was modified more recently than another.

... The USNs within an account start at "1" (for the first object created in the account) and then increase monotonically every time an object is created, modified, or deleted. The server keeps track of the "update count" for each account, which is identical to the highest USN that has been assigned. At any point, the service is capable of ordering the objects in the account using their USN values. To synchronize, a client must only receive the objects that were changed after the last client sync...i.e. only objects that have a USN that is higher than the server's "update count" at the last successful sync.

为了保证 Evernote 服务的简单可靠和伸缩性，Evernote 服务端并不跟踪每个客户端的状态，并将很多计算操作挪到了客户端中。服务端也不维护事务操作日志，所以 Evernote 并不支持像 MySQL 主从架构、Redis 主从架构等基于更新日志的同步机制。Evernote 使用了数据更新版本号的机制来实现增量同步：

Evernote 服务端为每个用户帐号维护一个名为 USN 的计数器，每次对该帐号数据元素（Note、Tag等）的更新（新增、修改和删除）都会增加该计数器的值。同时，每个数据元素都有 USN 字段，保存该元素最后一次被修改时 USN 计数器的值；
Evernote 服务端可以根据数据元素的 USN 字段的值，对用户帐号下的数据元素进行排序；
客户端每次更新后，将数据元素的最大 USN 保存到本地；下次更新时，使用该 USN 值向服务端请求 USN 字段值大于该 USN 值的数据元素；

Evernote 的增量同步的最小单元是 Note 或 Tag 等数据元素，也就是说，你对一篇笔记做出的修改会造成这一篇笔记整体上传到服务器。

其它诸如数据冲突处理、本地数据提交、本地数据删除等算法流程，我们在这篇就不再研究了。

Dropbox

Dropbox 使用的增量更新技术称为 Delta Sync。下面是网络上能找到的 Delta Sync 相关的一些点滴信息：

https://www.dropbox.com/help/syncing-uploads/upload-entire-file

Before transferring a file, we compare the new file to the previous version and only send the piece of the file that changed. This is called a "binary diff" and works on any file type. Dropbox compresses files before transferring them as well. This way, you also never have to worry about Dropbox re-uploading a file or wasting bandwidth.

https://serverfault.com/questions/52861/how-does-dropbox-version-upload-large-files

Dropbox uses a binary diff algorithm to break down all files into blocks, and only upload blocks that it doesn't already have in the cloud. All of this is done locally on your computer.

Dropbox doesn't just use your files that you have already uploaded, it aggregates everyone's files into one database of blocks, and checks each local block hash against that database.

For what it's worth, Dropbox claims to create hashes on every 4 MB of each file. That way, if you change a contiguous 2 MB of a 100 MB file, iti will likely only need to upload 4 MB (or 8 MB if you cross into a second 4MB block) to re-sycn the file.

Understanding and Surpassing Dropbox: Efficient Incremental Synchronization in Cloud Storage Services

When user edits an existing file in a sync folder, Dropbox client captures the modification event through inotify and executes Rsync to generate delta. After interacting with index server, the client sends the delta to the corresponding data server, on which the new version of the file can be generated...These imply that Dropbox client stores the Rsync signature of files, instead of receiving from server side.

从以上信息可以看到，Dropbox 的 Delta Sync 技术实际上使用了 Rsync 算法计算差分文件，然后通过差分文件将本地操作增量同步给服务端。

犀牛云盘

https://tech.meituan.com/incre-sync-use-rsync.html

犀牛云盘是美团点评内部一个基于美团云的文件协作平台，核心是文件的结构化云存储以及上传和下载的体验优化。

这篇文章中描述的犀牛云盘的实现，基本是 zsync 和 diff/patch 差分算法的结合使用。文章的配图描述了上传和下载的流程。

ownCloud

ownCloud 是一套用于架设私有云的开源系统，它使用了 csync 作为文件同步机制。

至于增量同步方面，从 2015 年作者的这篇文章来看，ownCloud 还未能确定增量同步要使用的技术方案。但是这篇文章中对 rsync 和 zsync 两种算法的讨论，值得一看：

因为 rsync 算法的特性，在同步客户端较多时，会造成服务端负载过高；
zsync 基本上是 rsync 算法的反转，它将 checksum 的计算工作挪到了客户端上。但是 zsync 针对压缩文件的优化需要压缩方式的配合，在网盘场景下，不太现实；

seafile

seafile 是一款国产开源私有云系统。

不要轻轻地离开我，请留下点什么...

Comments