HBase学习笔记(二)HBase架构,
HBase Architectural Components(HBase架构组件)
HBase架构也是主从架构,由三部分构成HRegionServer、HBase Master和ZooKeeper。
RegionServer负责数据的读写与客户端交互,对于region的操作则是由HMaster处理,ZooKeeper则是负责维护运行中的节点。
在底层,它将数据存储于HDFS文件中,因而涉及到HDFS的NN、DN等。RegionServer会搭配HDFS中的DataNode节点,可以将数据放在本节点的DataNode上。NameNode则是维护每个物理数据块的元数据信息。
刚开始只要大致了解一下每个组件是干啥的,后面会进行详细介绍。
Physically, HBase is composed of three types of servers in a master slave type of architecture. Region servers serve data for reads and writes. When accessing data, clients communicate with HBase RegionServers directly. Region assignment, DDL (create, delete tables) operations are handled by the HBase Master process. Zookeeper, which is part of HDFS, maintains a live cluster state.
The Hadoop DataNode stores the data that the Region Server is managing. All HBase data is stored in HDFS files. Region Servers are collocated with the HDFS DataNodes, which enable data locality (putting the data close to where it is needed) for the data served by the RegionServers. HBase data is local when it is written, but when a region is moved, it is not local until compaction.
The NameNode maintains metadata information for all the physical data blocks that comprise the files.
Regions
上一篇文章说到HBase的一张表会被切分成若干块,每块叫做一个Region。每个Region中存储着从startKey到endKey中间的记录。这些Region会被分到集群的各个节点上去存储,每个节点叫做一个RegionServer,这些节点负责数据的读写,一个RegionServer可以处理大约1000个regions。
HBase Tables are divided horizontally by row key range into “Regions.” A region contains all rows in the table between the region’s start key and end key. Regions are assigned to the nodes in the cluster, called “Region Servers,” and these serve data for reads and writes. A region server can serve about 1,000 regions.
HBase HMaster
HMaster的功能是处理region的协调工作,具体包括以下的内容:
Region assignment, DDL (create, delete tables) operations are handled by the HBase Master.
A master is responsible for:
Coordinating the region servers
- Assigning regions on startup , re-assigning regions for recovery or load balancing
- Monitoring all RegionServer instances in the cluster (listens for notifications from zookeeper)
Admin functions
- Interface for creating, deleting, updating tables
ZooKeeper: The Coordinator
ZooKeeper作为协调者,负责维护节点的运行状态,也就是哪个节点是运行中的,哪个是已经挂了的。每个节点周期性地像它发送心跳信息,从而让它能时刻了解到每个节点的运行情况。如果发现有节点出现异常情况,它会发出提醒。
HBase uses ZooKeeper as a distributed coordination service to maintain server state in the cluster. Zookeeper maintains which servers are alive and available, and provides server failure notification. Zookeeper uses consensus to guarantee common shared state. Note that there should be three or five machines for consensus.
How the Components Work Together(各组件的协调工作)
那么这些组件是怎么协同工作的呢?
ZooKeeper作为一个关系协调者,协调整个系统中各个组件的一些状态信息。Region servers和运行中的HMaster会跟ZooKeeper建立一个会话连接。ZooKeeper通过心跳信息来维持一个正在进行会话的短时节点(ephemeral node)(还是直接看英文吧,翻译总感觉说不清楚想要表达的意思)。
每个RegionServer会创建一个短时节点(ephemeral node),而HMaster会监控这些节点来寻找空闲的节点,同时它也会检测这些节点当中是不是有节点已经挂了。HMaster会竞争创建短时节点,ZooKeeper会决定哪个HMaster作为主节点,并且时刻保持一个时间段只有一个活跃的HMaster。这个活跃的HMaster会向ZooKeeper发送心跳信息,而在之前竞争中失败的HMaster就是非活跃的HMaster,他们时刻都想着取代那个唯一的活跃的HMaster,所以他们就一直注意监听ZooKeeper中有没有那个活跃的HMaster挂了的通知。突然发现这个系统跟人类社会很相似啊,一个身居高位的人,底下有无数的人在盼望着他出丑下台从而自己取代他。
如果一个RegionServer或者那个活跃的HMaster没有发送心跳信息给ZooKeeper,那么建立的这个会话就会被关闭,与这些出问题节点有关的所有短时节点都会被删除。会有相应的监听器监听这些信息从而通知这些需要被删除的节点。因为活跃的那个HMaster会监听这些RegionServer的状态,所以一旦他们出问题,HMaster就会想办法恢复这些错误。如果挂掉的是HMaster,那么很简单,底下那些不活跃的HMaster会第一时间收到通知并且开始竞争这个“岗位”。
总结一下这个过程:
Zookeeper is used to coordinate shared state information for members of distributed systems. Region servers and the active HMaster connect with a session to ZooKeeper. The ZooKeeper maintains ephemeral nodes for active sessions via heartbeats.
Each Region Server creates an ephemeral node. The HMaster monitors these nodes to discover available region servers, and it also monitors these nodes for server failures. HMasters vie to create an ephemeral node. Zookeeper determines the first one and uses it to make sure that only one master is active. The active HMaster sends heartbeats to Zookeeper, and the inactive HMaster listens for notifications of the active HMaster failure.
If a region server or the active HMaster fails to send a heartbeat, the session is expired and the corresponding ephemeral node is deleted. Listeners for updates will be notified of the deleted nodes. The active HMaster listens for region servers, and will recover region servers on failure. The Inactive HMaster listens for active HMaster failure, and if an active HMaster fails, the inactive HMaster becomes active.
HBase First Read or Write
HBase中有个特殊的日志表叫做META table,这个表里存储着每个region在集群的哪个节点上,回想一下region是什么,它是一张完整的表被切分的每个数据块。这张表的地址存储在ZooKeeper上,也就是说这张表实际上是存在RegionServer中的,但具体是哪个RegionServer,只有ZooKeeper知道。
当客户端要读写数据的时候,无法避免的一个问题就是,我要访问的数据在哪个节点上或者要写到哪个节点上?这个问题的答案就与META table有关。
第一步,客户端要先去ZooKeeper找到这这表存在哪个RegionServer上了。
第二步,去那个RegionServer上找,怎么找呢?当然是根据要访问数据的key来找。找到后客户端会将这些数据以及META的地址存储到它的缓存中,这样下次再找到相同的数据时就不用再执行前面的步骤了,直接去缓存中找就完成了,如果缓存里没找到,再根据缓存的META地址去查META表就行了。
第三步,很简单,第二步已经知道键为key的数据在哪个RegionServer上了,直接去找就好了。
There is a special HBase Catalog table called the META table, which holds the location of the regions in the cluster. ZooKeeper stores the location of the META table.
This is what happens the first time a client reads or writes to HBase:
For future reads, the client uses the cache to retrieve the META location and previously read row keys. Over time, it does not need to query the META table, unless there is a miss because a region has moved; then it will re-query and update the cache.
HBase Meta Table
这个META表具体是什么呢?我们一起来看看,它里面是一个所有Region地址的列表。使用的数据结构是b树,key和value分别如下图所示:
This META table is an HBase table that keeps a list of all regions in the system.
The .META. table is like a b tree.
The .META. table structure is as follows:
- Key: region start key,region id
- Values: RegionServer
Region Server Components(RegionServer组成)
下面来了解一下RegionServer的组成。一个RegionServer由WAL、BlockCache、MemStore和HFile构成。
WAL:全程是Write Ahead Log,其实就是个日志文件,存储在分布式系统上。这个日志是用来存储那些还没有被写入硬盘永久保存的数据,是用来进行数据恢复的。换句话说,无论要写什么数据,都要现在这个日志里登记一下新数据,道理很简单,如果没有日志文件,那么如果在写数据的时候出问题了,比如停电啊这种故障,那么数据库如何恢复数据呢?而先写了日志之后,数据没写入硬盘没关系,直接通过日志就知道之前进行了哪些操作,通过对比日志和你现在数据库中的值就能知道是在哪个地方出故障的,接下来就按照日志的记录往后一步步执行就好了,数据库的数据就能恢复了。
BlockCache:它是读缓存,它存储了被经常读取的数据,使用的是LRU算法,也就是说,缓冲区满了之后,最近最少被使用的数据会被淘汰。
MemStore:是写缓存,用于存储还没有被写入到硬盘的数据,并且是排序过的。每一个列族对应一个MemStore。
HFile:它存储了按照KeyValue格式的每行数据。
A Region Server runs on an HDFS data node and has the following components:
- WAL: Write Ahead Log is a file on the distributed file system. The WAL is used to store new data that hasn’t yet been persisted to permanent storage; it is used for recovery in the case of failure.
- BlockCache: is the read cache. It stores frequently read data in memory. Least Recently Used data is evicted when full.
- MemStore: is the write cache. It stores new data which has not yet been written to disk. It is sorted before writing to disk. There is one MemStore per column family per region.
- Hfiles store the rows as sorted KeyValues on disk.
HBase Write Steps(HBase的写入步骤)
当客户端发起一个put请求的时候,首先根据RowKey寻址,从META表中查出该Put数据最终需要去的RegionServer。
然后就到了下面的这个图,客户端把这个put操作先写到RegionServer的WAL日志文件中。
一旦写入到日志成功后,RegionServer会根据put请求中的TableName和RowKey找到对应的Region,然后再根据column family找到该列族对应的MemStore,将数据写入MemStore。
最后,数据写成功后,给客户端一个应答,表明数据已经写好了 。
HBase Write Steps (1)
When the client issues a Put request, the first step is to write the data to the write-ahead log, the WAL:
- Edits are appended to the end of the WAL file that is stored on disk.
- The WAL is used to recover not-yet-persisted data in case a server crashes.
HBase Write Steps (2)
Once the data is written to the WAL, it is placed in the MemStore. Then, the put request acknowledgement returns to the client.
HBase MemStore
上一步我们说到数据会写到MemStore中,而MemStore是一个写缓冲区,什么意思呢?就是写操作会先将数据写到缓冲区中,而每条数据在缓冲区中是KeyValue结构的,并且是按照key排序的。
The MemStore stores updates in memory as sorted KeyValues, the same as it would be stored in an HFile. There is one MemStore per column family. The updates are sorted per column family.
HBase Region Flush
缓冲区有一定的大小,如果缓冲区满了,那么缓冲区的数据就会被flush到一个新的HFile文件中永久保存。HBase中,对于每个列族可以有多个HFile,HFile里的数据跟缓冲区中的格式也是一样的,都是一个Key对应一个Value的结构。
需要注意的是,如果一个MemStore满了,那么所有的MemStore都要将存储的数据flush到HFile中,这也就是为什么官方建议一个表中不要有太多列族,因为每个列族对应一个MemStore,如果列族太多的话会导致频繁刷新缓冲区等性能问题。
缓冲区在刷新写入到HFile的时候,还会保存一个序列数(sequence number),这个东西是干嘛的呢?其实是为了让系统知道目前HFile上保存了哪些数据。这个序列数作为元数据(meta field)存在HFile中,所以每次在刷新的时候都等于给HFile做了个标记。
When the MemStore accumulates enough data, the entire sorted set is written to a new HFile in HDFS. HBase uses multiple HFiles per column family, which contain the actual cells, or KeyValue instances. These files are created over time as KeyValue edits sorted in the MemStores are flushed as files to disk.
Note that this is one reason why there is a limit to the number of column families in HBase. There is one MemStore per CF; when one is full, they all flush. It also saves the last written sequence number so the system knows what was persisted so far.
The highest sequence number is stored as a meta field in each HFile, to reflect where persisting has ended and where to continue. On region startup, the sequence number is read, and the highest is used as the sequence number for new edits.
HBase HFile
缓冲区的数据都是根据key进行排序的,所以在flush到HFile上的时候,就直接按书序一条一条记录往下写就行了,这样顺序写的过程是非常快速的,因为他避免了磁盘磁头的移动。
Data is stored in an HFile which contains sorted key/values. When the MemStore accumulates enough data, the entire sorted KeyValue set is written to a new HFile in HDFS. This is a sequential write. It is very fast, as it avoids moving the disk drive head.
HBase HFile Structure
HFile的组成就相对来说比较复杂了,因为要考虑到查询的性能,最好别出现把整个文件都扫描一遍后才发现要访问的数据不再这个HFile中的情况。因此就需要在文件的组织形式上花点心思。怎样在不完全扫描文件的情况下知道要访问的数据在不再文件中呢?我们想到的答案可能是使用索引(Index)。HFile实际上也是这种思想,它使用的是多级索引,在形式上类似于b树。不得不感慨一句,b树这个数据结构在数据库中的应用程度真的是很高啊!
An HFile contains a multi-layered index which allows HBase to seek to the data without having to read the whole file. The multi-level index is like a b+tree:
- Key value pairs are stored in increasing order
- Indexes point by row key to the key value data in 64KB “blocks”
- Each block has its own leaf-index
- The last key of each block is put in the intermediate index
- The root index points to the intermediate index
The trailer points(位于文件最后) to the meta blocks, and is written at the end of persisting the data to the file. The trailer also has information like bloom filters(布隆过滤器) and time range info. Bloom filters help to skip files that do not contain a certain row key. The time range info is useful for skipping the file if it is not in the time range the read is looking for.
HFile Index
当一个HFile被打开的时候,这个HFile的索引就被加载到BlockCache中了,还记得我们之前说BlockCache是什么吗?就是读缓冲区。
The index, which we just discussed, is loaded when the HFile is opened and kept in memory. This allows lookups to be performed with a single disk seek.
HBase Read Merge(HBase读操作)
上面我们已经讨论了HBase的存储结构,在涉及读取操作的时候其实有个问题,一行数据可能存在的位置有哪些?一种是已经永久存储在HFile中了,一种是还没来得及写入到HFile,在缓冲区MemStore中,还有一种情况是读缓存,也就是经常读取的数据会被放到读缓存BlockCache中,那么读取数据的操作怎么去查询数据的存储位置呢?有这么几个步骤:
首先,设置读缓存的位置是什么?当时是为了高效地读取数据,所以读缓存绝对是第一优先级的,别忘了BlockCache中使用的是LRU算法。
其次,HFile和写缓存,选哪个?HFile的个数那么多,当然是效率最低的,而一个列族只有一个MemStore,效率必然比HFile高的多,所以它作为第二优先级,也就是说如果读缓存中没有找到数据,那么就去MemStore中去找。
最后,如果不幸上面两步都没能找到数据,那没办法只能去HFile上找了。
We have seen that the KeyValue cells corresponding to one row can be in multiple places, row cells already persisted are in Hfiles, recently updated cells are in the MemStore, and recently read cells are in the Block cache. So when you read a row, how does the system get the corresponding cells to return? A Read merges Key Values from the block cache, MemStore, and HFiles in the following steps:
As discussed earlier, there may be many HFiles per MemStore, which means for a read, multiple files may have to be examined, which can affect the performance. This is called read amplification.
HBase Minor Compaction
刚才我们说到写数据的时候会先写到缓冲区,缓冲区满了会将缓冲区的内容冲刷到HFile中永久保存,试想这个写数据的过程一直持续,那么HFile的数量会越来越多,管理起来就会不太方便,就有了compaction这个操作,意思是压紧压实的意思,实在是没找到合适的中文翻译,这个名词就不翻了。
HBase有两种compaction,一种是Minor Compaction,另一种是Major Compaction。
Minor Compaction是将一些小的HFile文件合并成大文件,很明显它可以减少HFile文件的数量,并且在这个过程中不会处理已经Deleted或Expired的Cell。一次Minor Compaction的结果是更少并且更大的HFile。
HBase will automatically pick some smaller HFiles and rewrite them into fewer bigger Hfiles. This process is called minor compaction. Minor compaction reduces the number of storage files by rewriting smaller files into fewer but larger ones, performing a merge sort.
HBase Major Compaction
Major Compaction是指将所有属于一个Region的HFile合并成一个HFile,也就是将同一列族的多个文件合并,在这个过程中,标记为Deleted的Cell会被删除,而那些已经Expired的Cell会被丢弃,那些已经超过最多版本数的Cell会被丢弃。但是这个合并操作会非常耗时。
Major compaction merges and rewrites all the HFiles in a region to one HFile per column family, and in the process, drops deleted or expired cells. This improves read performance; however, since major compaction rewrites all of the files, lots of disk I/O and network traffic might occur during the process. This is called write amplification.
Major compactions can be scheduled to run automatically. Due to write amplification, major compactions are usually scheduled for weekends or evenings. A major compaction also makes any data files that were remote, due to server failure or load balancing, local to the region server.