elasticsearch源码分析之Gateway(六)，

和通数据库htsjk.Com2019-08-18 06:17 来源:未知阅读:16341 评论 90 热度4

标签：

elasticsearch源码分析之Gateway(六)，

一、es数据文件

我们将探讨由es系统各个部分写入的目录文件，分别从broker(master)和node(data)的结构文件进行介绍，并给其一个简短说明。

其实在Elasticsearch中生成的数据文件一般来说主要有三种，分别是state、index和translog，后两种想必大家也都知晓，那么state主要就是Gateway存储数据的文件。Gateway模块用于存储es集群的MetaData。MetaData每一次改变（比如增加删除索引等），都要通过Gateway模块进行持久化。当集群第一次启动的时候，这些信息就会从Gateway模块中读出并应用。

一般情况下,Gateway我们都设置为Local方式，即将数据存储在本地，那么本文也就主要针对这种方式来说。

1.1文件目录

es配置多个路径：

path.home：用户的主目录运行es程序。默认是java系统属性user.dir，这是程序所有者的默认目录。
path.conf：一个包含配置文件的目录。
path.plugins：es插件目录。
path.logs：日志存储目录。
path.data：包含存储通过es的数据路径的目录。

本次介绍的目录文件基本都在path.data目录中。

1.2broker文件

------- /opt/es/broker $ tree data
data
└── debug_es                               # 集群名称
    └── nodes
        └── 0                                  # 本地第0个broker服务
            ├── _state                         # 状态文件目录
            │   └── global-96.st              # 全局元数据文件，里面包含集群信息、集群元数据版本号的信息，对应代码ClusterState类
            ├── indices                        # 索引文件目录
            │   └── twitter                   # 索引名称
            │       └── _state                # 状态文件目录
            │           └── state-1.st        # 索引的元数据文件，里面包含唯一标识、创建时间、settings、mappings，对应代码IndexMetaData类
            └── node.lock                      # 节点索引文件，保证全局只有一个es在本目录下读写

更有趣的是global-96.st文件，这个global-前缀和.st后缀表明这是一个元数据全局文件。正如你可能已经猜到，96前缀表示集群元数据版本，是一个严格递增的版本递增，每一次关闭集群都会递增。这个文件是二进制文件，当然你可以使用十六进制编辑器进行编辑，但我们建议你这么做，因为会很快导致数据丢失。

1.3node文件

-------- /opt/es/node $ tree data
data
└── debug_es                                # 集群名称
    └── nodes
        └── 0                                    # 本地第0个node服务
            ├── indices                          # 索引文件目录
            │   └── twitter
            │       └── 0                       # shard0文件目录
            │           ├── _state
            │           │   └── state-1.st     # 分片信息文件，里面包含版本号，是否主/副分片，对应代码ShardRouting类
            │           ├── index               # 真正索引和数据文件目录
            │           │   ├── _0.cfe
            │           │   ├── _0.cfs
            │           │   ├── _0.si
            │           │   ├── segments_7
            │           │   └── write.lock
            │           └── translog            #translog目录
            │               ├── translog-1.tlog
            │               └── translog.ckp
            └── node.lock                        # 节点索引文件，保证全局只有一个es在本目录下读写

1.4ClusterState类

UML类结构

类变量

RESTful请求

二、Gateway写入（state文件的写入）

其实Gateway是继承自ClusterStateListener，所以说只要有集群状态的变化就会触发相应的动作，这个动作就是clusterChanged，而这个变化就是ClusterChangedEvent。

0、Gateway.clusterChanged方法实现metaState.clusterChanged(event);，跳转到了GatewayMetaState类的clusterChanged方法。

1、首先会判断当前cluster是否block了，如果是的话会重置当前metadata，相当于回到初始状态。

2、如果没有block则继续，只有在当前node是master或是data的情况下才会做写入Gateway的操作，判断是否metadata写入当前node，需要通过查看shard routing，且仅当这index的shard被分配在此节点上才会写入。

3、但是已经关闭的index不会出现在shard routing中。如果已关闭索引的metadata被更新了，会从磁盘上load这个index的metadata，并将之添加到previouslyWrittenIndices中

4、检查global state是否变化，如果变化则写入

写入逻辑如下：MetaDataStateFormat类。

写入的文件的位置为STATE_DIR_NAME = "_state"，文件后缀是STATE_FILE_EXTENSION = ".st",如果文件不存在的话会自动创建，state文件会先序列化为一临时文件，然后自动转移为格式为{prefix}{version}.st的目标文件。要遍历这些需要写入的index，分别写入到指定的位置，这里的location其实是nodeEnv.indexPaths(new Index(indexMetaData.getIndex()))来生成的。

5、计算当前这个变化涉及到的index拿到，再才能生成最后要写入磁盘的数据，也即下面的writeInfo

6、写入完成之后，还需要做一件事，就是根据metadata处理dangling的index（就是指存在于磁盘上，但是在集群的metadata里面没有的index），将他们重新引入集群，这里需要从三步来走：

1、如果提供的metadata已经存在了，就清除掉dangling的index；
2、找到新的dangling的index，并加入；
3、将现在dangling中的index发送给master节点，用作allocation：

三、Gateway恢复（state文件的读取）

其实GatewayService是继承自ClusterStateListener，所以说只要有集群状态的变化就会触发相应的动作，这个动作就是clusterChanged，而这个变化就是ClusterChangedEvent。

0、GatewayService.clusterChanged方法先检查集群，判断是否要recover

// 检查集群，判断是否要recover
// 1. 检查集群是否因为没有master节点而处于global block
// 2. 检查集群中的节点是否已经达到gateway.recover_after_nodes配置
// 3. 检查集群中的dataNode是否已经达到gateway.recover_after_data_nodes配置
// 4. 检查集群中的master候选节点的数目是否已经达到gateway.recover_after_master_nodes配置(默认值是discovery.zen.minimum_master_nodes)
// 5. 如果没有设置gateway.expected_nodes、gateway.expected_data_nodes、gateway.expected_master_nodes，在gateway.recover_after_time（默认5min）开始recover
// 如果设置了gateway.expected_nodes、gateway.expected_data_nodes、gateway.expected_master_nodes，如果参数条件都满足，立即启动recover；否则，等待gateway.recover_after_time后开始recover

1、如果不是立即recovery，会启动一个GENERIC类型的threadPool.schedule，等时间到了再recovery，真正recovery的实现在Gateway.performStateRecovery里面完成。

2、Gateway.performStateRecovery方法会先得到各个MasterNode节点发来的State，并且根据配置文件计算requiredAllocation。requiredAllocation的数目必须不大于集群现在已发现的节点数

3、处理从各个节点上报来的globalState

3.1 从所有globalState中选择version最大的作为electedGlobalState，如果version都一样，就用第一个。
3.2 从所有globalState中获取indices names
3.3 计算汇总有效globalState的nodes数量，如果小于requiredAllocation，终止recovery

4、根据汇总来的各个globalState更新globalState

4.1 从electedGlobalState中新建globalState，但是清除其中的indices信息
4.2 遍历第3步得到的indices name
4.2.1 从各个节点汇总的globalState中找到version最大的index的indexMetaData，最为electedIndexMetaData，更新到新的globalState中
4.2.2 如果从各个汇总来的state中累计该index数量少于requiredAllocation，终止recover

注意在恢复meta过程中，所有操作都将堵塞，为了避免和集群真实的meta产生冲突。

至此文件读取完毕。

附录

1、lunece索引文件格式介绍

Segments File	segments_N	Stores information about a commit point
Lock File	write.lock	The Write lock prevents multiple IndexWriters from writing to the same file.
Segment Info	.si	Stores metadata about a segment
Compound File	.cfs, .cfe	An optional “virtual” file consisting of all the other index files for systems that frequently run out of file handles.
Fields	.fnm	Stores information about the fields
Field Index	.fdx	Contains pointers to field data
Field Data	.fdt	The stored fields for documents
Term Dictionary	.tim	The term dictionary, stores term info
Term Index	.tip	The index into the Term Dictionary
Frequencies	.doc	Contains the list of docs which contain each term along with frequency
Positions	.pos	Stores position information about where a term occurs in the index
Payloads	.pay	Stores additional per-position metadata information such as character offsets and user payloads
Norms	.nvd, .nvm	Encodes length and boost factors for docs and fields
Per-Document Values	.dvd, .dvm	Encodes additional scoring factors or other per-document information.
Term Vector Index	.tvx	Stores offset into the document data file
Term Vector Documents	.tvd	Contains information about each document that has term vectors
Term Vector Fields	.tvf	The field level info about term vectors
Live Documents	.liv	Info about what files are live