1 Data Replication 简介

作为一个分布式数据系统，Elasticsearch/OpenSearch 也有主副本( Primary Shard )和副本( Replica Shard )的概念。

Replica 是 Primary 的 Clone ( 复制体 )，这样就可以实现系统的高可用，具备容灾能力。

Replica 从 Primary 复制数据的过程被成为数据复制( Data Replication )，Data Replication 的核心考量指标是 Replica 和 Primary 的延迟( Lag )大小，如果 Lag 一直为0，那么就是实时复制，可靠性最高。

Data Replication 的方案有很多，接下来主要介绍基于文档的复制方案( Document Replication ) 和基于文件的复制方案 ( Segment Replication )。

1.1 Document Replication

Elasticsearch/OpenSearch 目前采用的是基于文档的复制方案，整个过程如下图所示：

TODO

Client 发送写请求到 Primary Shard Node
Primary Shard Node 将相关文档先写入本地的 translog，按需进行 refresh
上述步骤执行成功后，Primary Shard Node 转发写请求到 Replica Shard Nodes，此处转发的内容是实际的文档
Replica Shard Node 接收到写请求后，先写入本地的 translog，按需进行 refresh，返回 Primary Shard Node 执行成功
Primary Shard Node 返回 Client 写成功。
后续 Primary Shard Node 和 Replica Shard Node 会按照各自的配置独立进行 refresh 行为，生成各自的 segment 文件。

这里要注意的一点是：Primary Shard 和 Replica Shard 的 refresh 是独立的任务，执行时机和时间会有所差异，这也会导致两边实际生成和使用的 segment 文件有差异。

以上便是 Document Replication 的简易流程，对完整流程感兴趣的，可以通过下面的连接查看更详细的介绍。

https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-replication.html

1.2 Segment Replication

elasticsearch 数据写入最耗时的部分是生成 segment 文件的过程，因为这里涉及到分词、字典生成等等步骤，需要很多 CPU 和 Memory 资源。

而 Document Replication 方案需要在 Primary Node 和 Replica Nodes 上都执行 segment 文件的生成步骤，但是在 Replica Nodes 上的执行实际是一次浪费，如果可以避免这次运算，将节省不少 CPU 和 Memory 资源。

解决的方法也很简单，等 Primary Node 运行完毕后，直接将生成的 segment 文件复制到 Replica Nodes 就好了。这种方案就是 Segment Replication。

Segment Replication 的大致流程如下图所示：

TODO:

Client 发送写请求到 Primary Shard Node
Primary Shard Node 将相关文档先写入本地的 translog，按需和相关配置进行 refresh，此处不是一定触发 refresh
上述步骤执行成功后，Primary Shard Node 转发写请求到 Replica Shard Nodes，此处转发的内容是实际的文档
Replica Shard Node 接收到写请求后，写入本地的 translog，然后返回 Primary Shard Node 执行成功
Primary Shard Node 返回 Client 写成功。
Primary Shard Node 在触发 refresh 后，会通知 Replica Shard Nodes 同步新的 segment 文件。
Replica Shard Nodes 会对比本地和 Primary Shard Node 上的 segment 文件列表差异，然后请求同步本地缺失和发生变更的 segment 文件。
Primary Shard Node 根据 Replica Shard Nodes 的相关请求完成 segment 文件的发送
Replica Shard Nodes 在完整接收 segment 文件后，刷新 Lucene 的 DirectoryReader 载入最新的文件，使新文档可以被查询

这里和 Document Replication 最大的不同是 Replica Shard Nodes 不会在独立生成 segment 文件，而是直接从 Primary Shard Node 同步，本地的 translog 只是为了实现数据的可靠性，在 segment 文件同步过来后，就可以删除。

以上便是 Segment Replication 的简易流程，对完整流程感兴趣的，可以通过下面的连接查看更详细的介绍。

https://github.com/opensearch-project/OpenSearch/issues/2229

2 Segment Replication 初体验

OpenSearch 在 2.3 版本中发布了实验版本的 Segment Replication 功能，接下来就让我们一起体验一下吧~

2.1 准备 docker 环境和相关文件

本次体验基于 docker-compose 来执行，如下为相关内容(docker-compose.yml)：

version: '3'
services:
  opensearch-node1:
    image: opensearchproject/opensearch:2.3.0
    container_name: os23-node1
    environment:
      - cluster.name=opensearch-cluster
      - node.name=opensearch-node1
      - discovery.seed_hosts=opensearch-node1,opensearch-node2
      - cluster.initial_cluster_manager_nodes=opensearch-node1,opensearch-node2
      - bootstrap.memory_lock=true # along with the memlock settings below, disables swapping
      - plugins.security.disabled=true
      - "OPENSEARCH_JAVA_OPTS=-Xms512m -Xmx512m -Dopensearch.experimental.feature.replication_type.enabled=true" # minimum and maximum Java heap size, recommend setting both to 50% of system RAM
    ulimits:
      memlock:
        soft: -1
        hard: -1
      nofile:
        soft: 65536 # maximum number of open files for the OpenSearch user, set to at least 65536 on modern systems
        hard: 65536
    volumes:
      - ./os23data1:/usr/share/opensearch/data
    ports:
      - 9200:9200
      - 9600:9600 # required for Performance Analyzer
    networks:
      - opensearch-net
  opensearch-node2:
    image: opensearchproject/opensearch:2.3.0
    container_name: os23-node2
    environment:
      - cluster.name=opensearch-cluster
      - node.name=opensearch-node2
      - discovery.seed_hosts=opensearch-node1,opensearch-node2
      - cluster.initial_cluster_manager_nodes=opensearch-node1,opensearch-node2
      - bootstrap.memory_lock=true
      - plugins.security.disabled=true
      - "OPENSEARCH_JAVA_OPTS=-Xms512m -Xmx512m -Dopensearch.experimental.feature.replication_type.enabled=true"
    ulimits:
      memlock:
        soft: -1
        hard: -1
      nofile:
        soft: 65536
        hard: 65536
    volumes:
      - ./os23data2:/usr/share/opensearch/data
    networks:
      - opensearch-net
  opensearch-dashboards:
    image: opensearchproject/opensearch-dashboards:2.3.0
    container_name: os23-dashboards
    ports:
      - 5601:5601
    expose:
      - "5601"
    environment:
      OPENSEARCH_HOSTS: '["http://opensearch-node1:9200","http://opensearch-node2:9200"]'
      DISABLE_SECURITY_DASHBOARDS_PLUGIN: "true"
    networks:
      - opensearch-net

networks:
  opensearch-net:

简单说明如下：

为了演示方便，关闭了安全特性
要在 OPENSEARCH_JAVA_OPTS 中添加 -Dopensearch.experimental.feature.replication_type.enabled=true 才能开启segment replication 功能

2.2 运行 OpenSearch 集群

执行如下命令运行 OpenSearch Cluster:

docker-compose -f docker-compose.yml up

运行成功后，可以访问 http://127.0.0.1:5601 打开 Dashboards 界面，进入 Dev Tools 中执行后续的操作

2.3 测试 Segment Replication

测试思路如下：

创建两个 index，一个默认配置，一个启用 segment replication，主分片数为1，副本数为1
向两个 index 中插入若干条数据
比较两个 index 中 segment file 的数量和大小

相关命令如下：

PUT /test-rep-by-doc
{
  "settings": {
    "index": {
      "number_of_shards": 1,
      "number_of_replicas": 1
    }
  }
}

GET test-rep-by-doc/_settings

POST test-rep-by-doc/_doc
{
  "name": "rep by doc"
}

GET _cat/shards/test-rep-by-doc?v

GET _cat/segments/test-rep-by-doc?v&h=index,shard,prirep,segment,generation,docs.count,docs.deleted,size&s=index,segment,prirep


PUT /test-rep-by-seg
{
  "settings": {
    "index": {
      "replication.type": "SEGMENT",
      "number_of_shards": 1,
      "number_of_replicas": 1
    }
  }
}


GET test-rep-by-seg/_settings

POST test-rep-by-seg/_doc
{
  "name": "rep by seg"
}

GET _cat/shards/test-rep-by-seg

GET _cat/segments/test-rep-by-seg?v&h=index,shard,prirep,segment,generation,docs.count,docs.deleted,size&s=index,segment,prirep

插入文档后，通过 _cat/segments 可以得到 segment file 列表，然后通过 size 一列可以对比 segment 文件大小。

如下是默认基于文档复制的结果：

index           shard prirep segment generation docs.count docs.deleted  size
test-rep-by-doc 0     p      _0               0          2            0 3.7kb
test-rep-by-doc 0     r      _0               0          1            0 3.6kb
test-rep-by-doc 0     p      _1               1          2            0 3.7kb
test-rep-by-doc 0     r      _1               1          3            0 3.8kb
test-rep-by-doc 0     p      _2               2          1            0 3.6kb
test-rep-by-doc 0     r      _2               2          3            0 3.8kb
test-rep-by-doc 0     p      _3               3          6            0 3.9kb
test-rep-by-doc 0     r      _3               3          6            0 3.9kb
test-rep-by-doc 0     p      _4               4          5            0 3.9kb
test-rep-by-doc 0     r      _4               4          6            0 3.9kb
test-rep-by-doc 0     p      _5               5          6            0 3.9kb
test-rep-by-doc 0     r      _5               5          6            0 3.9kb
test-rep-by-doc 0     p      _6               6          4            0 3.8kb
test-rep-by-doc 0     r      _6               6          1            0 3.6kb

从中可以看到，虽然 Primary Shard 和 Replica Shard 的 segment 数相同，但是 size 大小是不同的，这也说明其底层的 segment 文件是独立管理的。

如下是基于 Segment 复制的结果：

index           shard prirep segment generation docs.count docs.deleted  size
test-rep-by-seg 0     p      _0               0          2            0 3.7kb
test-rep-by-seg 0     r      _0               0          2            0 3.7kb
test-rep-by-seg 0     p      _1               1          7            0   4kb
test-rep-by-seg 0     r      _1               1          7            0   4kb
test-rep-by-seg 0     p      _2               2          5            0 3.9kb
test-rep-by-seg 0     r      _2               2          5            0 3.9kb

从中可以看到 Primary Shard 和 Replica Shard 的 segment 完全一致。

除此之外也可以从磁盘文件中对比，同样可以得出相同的结论：Segment Replication 是基于文件的数据复制方案，Primary 和 Replica 的 segment 文件列表完全相同。

3 总结

根据 OpenSearch 社区的初步测试，Segment Replication 相较于 Document Replication 的性能结果如下：

在 Replica Nodes 上，CPU 和 Memory 资源减少 40%~50%
写入性能方面，整体**吞吐量提升约 50%**，P99 延迟下降了 20% 左右

这个测试结果还是很诱人的，但 Segment Replication 也有其自身的局限，下面简单列几点( 不一定准确 )：

Segment Replication 对于网络带宽资源要求更高，目前测试中发现有近1倍的增长，需要更合理的分配 Primary Shard 到不同的 Node 上，以分散网络带宽压力
Segment Replication 可能会由于文件传输的延迟而导致 Replica Shard 上可搜索的文档短时间内与 Primary Shard 不一致
Replica Shard 升级为 Primary Shard 的时间可能会因为重放 translog 文件而变长，导致 Cluster 不稳定

友情提示下，由于该特性目前还是实验阶段，还不具备上生产环境的能力，大家可以持续关注~