当前位置：网站首页>Kubernetes core components etcd details

Kubernetes core components etcd details

2022-06-25 06:39:00 【InfoQ】

1. Kubernetes Core components -ETCD Detailed explanation

Kubernetes It is a typical master-slave distributed architecture , from

Centralized management node

（Master Node）,

Distributed work nodes

（Worker Node） Composition and auxiliary tools . among ETCD Is the core component of the management node , Mainly responsible for centralized storage of cluster state , Functional architecture and Zookeeper similar . This article mainly explains in detail ETCD Architecture and core technology .

1.1. ETCD Development and evolution

ETCD Yes, it is

Go Language

To write , adopt

Raft Consistency algorithm

, The realization of a

Highly available distributed key values (key-value) database , The core milestones are as follows ：

2013 year 6 month ,

from CoreOS The team released open source .

2014 year 6 month

, As Kubernetes The core metadata is published together with the storage service , Since then ETCD The community has developed rapidly .

2015 year 2 month

, Released the first official stable version 2.0, Reconstructed Raft Consistency algorithm , Provides a user tree data view , Support more than per second 1000 Second write performance .

2017 year 1 month

, Released 3.1 edition , A whole new set of API, It also provides gRPC Interface , adopt gRPC Of proxy Expand and greatly improve ETCD Read performance , Support more than per second 10000 Second write .

2018 year 11 month ,

Project entry CNCF The incubation program of .

2019 year 8 month

, Released 3.4 edition , This version is made by Google、Alibaba And so on , Further improvement etcd Performance and stability .

2021 year 7 month

, Released 3.5 edition , Support Go Module Version number semantics and modularization , Improved performance and stability , Enhanced cluster operation and maintenance capability .

ETCD After a long period of continuous evolution, it has become mature , The performance and stability of the roof have been greatly improved . The current on kubernetes In the cluster List Pod etc. expensive request Lead to OOM And other unstable problems , as well as Range Stream,QoS characteristic , Plan in ETCD 3.6 Version implementation , Rub one's eyes and wait .

1.2. ETCD Architecture and function

User read request

Through HTTP Server, Forward to Store Carry out specific transactions ; If the node is involved

Modification request

, Give it to Raft The module changes the State , logging , Submit the synchronization to other nodes in the cluster and wait for confirmation , Finally, submit the data and synchronize again .ETCD The overall architecture is mainly divided into four modules ：

HTTP  Server：

Used to process the API request , And others ETCD Request for node synchronization and heartbeat information .

Store：

Used for processing ETCD Support all kinds of functions , Include data index 、 Node state change 、 Monitoring and feedback 、 Event handling and execution, etc , yes ETCD For most of the users API The concrete realization of the function .

Raft：

Raft The realization of strong consistency algorithm , yes ETCD At the heart of .

WAL：

Write Ahead Log（ Pre written logs ）, yes ETCD Data storage mode of . In addition to the state of all data stored in memory and the index of nodes ,ETCD Just through WAL For persistent storage .WAL All data in the will be recorded in advance before submission .Entry Represents the specific log content stored .Snapshot It is a state snapshot to prevent too much data ;

Usually a ETCD Clusters are created by 3 A or 5 The nodes make up , Multiple nodes pass through Raft The consistency algorithm completes the cooperation ; There will be a node as Leader Master node , Responsible for data synchronization and distribution . When Leader When the primary node fails, another node will be automatically selected as Leader Master node .

stay ETCD Architecture , There is a very key concept quorum,quorum Is defined as （n+1）/2, That is, the critical value of more than half of the nodes in the cluster ; Just make sure that quorum Number of nodes working properly , Then the whole cluster can work normally and provide external services ;

In order to ensure that after some nodes fail , Still able to continue to provide services , We need to solve a very complex distributed consistency problem .ETCD Is to use Raft Consistency algorithm , Through a set of data synchronization mechanism , stay Leader After switching, all submitted data can be resynchronized , Ensure that the data of the whole cluster is consistent .

The client is in multiple nodes , Any one of them can be selected to complete data reading and writing , The internal state and data collaboration are ETCD Its complete .ETCD The internal mechanism is more complex , However, the user oriented interface is very simple , It can be through the client or through HTTP Connect the cluster and operate the data , Interfaces can be roughly divided into 5 Group ：

Put(key,value)/Delete(key)：Put

Write data to the cluster ,

Delete

Delete the data in the cluster .

Get(key)/Get(keyFrom,keyEnd)：

Two query methods are supported , Single assignment key Query with a specified key Range query .

Watch(key)/Watch(keyPrefix)：

Support Watch Mechanism to realize incremental data update monitoring , Support specifying individual key With a key Prefix , It is suggested to use key Prefix .

Transactions(if/then/else ops).Commit()：

Support simple transactions , You can perform certain actions by specifying a set of conditions when they are met , Perform another set of operations when the condition is not true .

Leases: Grant/Revoke/KeepAlive：

Support lease mechanism , Usually, it is necessary to detect whether a node is alive in a distributed system , We need a lease mechanism .

1.3. ETCD Core technology and principle

1.3.1.  from Paxos To Raft, Achieve distributed consistency

Paxos Algorithm

yes 1990 year Leslie Proposed , It is a classic and complete distributed consistency algorithm . The goal is to between different nodes , To a key There is a consensus on the value of （ The same value ）. stay Paxos A decision-making process in （Round、Phase） There are two stages ：

Preparation stage  phase1：

Proposer To more than half of （n/2+1） Of Acceptor launch prepare news ( Send number ); If Acceptor Received a number of N Of Prepare request , And N More than it should be Acceptor All that has been responded to Prepare The number of the request , Then we will N Give feedback to as a response Proposer, At the same time Acceptor Promise no longer to accept any number less than N Proposal for , Otherwise refuse to return .

Voting stage  phase2：

If more than half Acceptor reply promise,Proposer towards Acceptor send out accept news .Acceptor Check accept Whether the message meets the rules , As long as Acceptor No pair number greater than N Of Prepare The request was answered , It accepted the proposal .

In actual development ,Paxos The algorithm has also evolved a series of variants ：

PBFT Algorithm

It's a consensus algorithm , It solved the problem of Byzantine Generals more efficiently ;

Multi-Paxos Algorithm

To optimize the prepare The efficiency of the stage , Allow multiple at the same time Leader I suggest ; as well as FastPaxos、EPaxos etc. , These evolutions are optimizations for some problems , The core idea still depends on Paxos thought .

Paxos It has been the standard protocol of distributed consistency algorithm since it was proposed , But it is not easy to understand and the implementation is very complex , So far, there is no complete implementation scheme , Many implementations are

Paxos-like

Raft Algorithm

It's Stanford University 2017 in ,Raft The algorithm flow is divided into three sub problems through divide and conquer thought , Namely

The election （Leader election）、 Log copy （Log replication）、 Security （Safety）

（ One ） The election （Leader election）：

Raft Cluster nodes are defined as 4 States , Namely Leader、Follower、Candidate、PreCandidate.

Under normal circumstances ,

Leader The node will follow the heartbeat interval , Regularly broadcast heartbeat messages to Follower node , To maintain Leader identity .Follower After receiving the message, reply to Leader.Leader Will have a term number (term), Used to compare the data of each node , Identify expired Leader etc. .

When Leader When the node is abnormal ,

Follower The node will receive Leader Your heartbeat message timed out , When the timeout is greater than the election timeout , Will enter PreCandidate state , Do not increase the term of office number, only initiate pre voting , After obtaining the approval of most nodes , Get into Candidate State and wait for a random time , Then launch the election process , Self increasing term number to vote for yourself , And send campaign voting information to other nodes . When a node receives campaign messages from other nodes , First, judge that the data and term number of the campaign node are greater than that of this node , And no election has been initiated to vote for yourself in this node , You can vote for the campaign node , Or refuse to vote .

（ Two ） Log copy （Log replication）：

Raft The log consists of entries indexed in order , Each log entry contains the term number and the content of the proposal .Leader Track each by maintaining two fields Follower Progress information for . One is

NextIndex,

Express Leader Send to this Follower Index of the next log entry of the node ; The other is

MatchIndex

, It means that we should Follower The maximum log entry index that the node has replicated .

With “hello=world” For example , from Client Take the whole process from submitting a proposal to receiving a response as an example , Exhibition Raft Log the whole process , Pass it 8 Step Leader Synchronize log entries to each Follower, Guarantee etcd Data consistency of cluster ：

1.  When Leader Received Client After submitting the proposal information , Generate log entries , Traversing all at the same time Follower Log progress of , Generate for each Follower Append log RPC news ;

2.  The log will be appended through the network module RPC The message is broadcast to all Follower;

3. Follower After the additional log message is received and persisted , reply Leader Maximum log entry index copied , namely MatchIndex;

4. Leader Received Follower After response , Update correspondence Follower Of MatchIndex;

5. Leader According to each Follower The submitted MatchIndex Information , Calculate the log entry committed index location , This location represents that the log entries are persisted by more than half of the nodes ;

6. Leader Tell everyone by heartbeat Follower Committed log index location ;

7.  When Client Proposal for , After being identified as submitted ,Leader reply Client The proposal passed .

（ 3、 ... and ） Security （Safety）：

By adding a series of rules for election and log replication , To guarantee Raft The security of the algorithm ：

1.  A term number , There can only be one Leader Be elected ,Leader The election requires more than half of the cluster nodes to support

2.  When a node receives an election vote , If the term number of the candidate's latest log entry is less than his own , Refuse to vote , The tenure number is the same, but the log is shorter than yourself , Also refuse to vote .

3. Leader Complete characteristics , If a log entry has been submitted in a tenure number , Then this log entry must appear in all of the larger term numbers Leader in ;

4. Leader Only log entries can be appended , Persistent log entries cannot be deleted ;

5.  Log matching feature ,Leader When sending log append information , The index position of the previous log entry will be brought ( use P Express ) And term number ,Follower Received Leader After adding information to the log , Will check the index position P Your term of office is the same as Leader Is it consistent , Only consistent can we add .

ETCD Use

Raft agreement

To maintain the consistency of the state of each node in the cluster , Multiple distributed nodes communicate with each other to form an overall external service , Each node stores complete data , And through Raft The protocol ensures that the data maintained by each node is consistent . Every ETCD Each node maintains a

State machine

, And there is at most one effective master node at any time . The master node handles all writes from the client , adopt Raft The protocol ensures that changes to the state machine by write operations will be reliably synchronized to other nodes .

1.3.2.  With boltdb Based on  , Achieve efficiency KV Storage engine  

ETCD It's using boltdb To persist storage KV Core metadata ,boltdb The whole database is just one db file , be based on B+ The index of the tree , Reading efficiency, efficiency and stability ; Read transactions can be multiple concurrent , Write transactions can only be serial , And the overhead is large and cannot be concurrent , Only batch operations can alleviate performance problems .

（ One ）page File block ：

boltdb With 4k Fixed length divides space for storage units , namely

page File block

, All other objects extend the abstraction on this basis .

boltdb The snap in for storage is page, Every page All by header+ data form ,page There are several types of ：

meta page：

Metadata information . Contains storage B+ Trees root buck Location , Number of available blocks freelist, The current largest offset etc. . It is divided into metaA And metaB, Used to control ongoing transactions and completed transactions ( There is only one write transaction in progress ).   according to meta Medium txid To determine which is currently in effect meta.（txid Will be incremented according to the transaction operation ）

bucket page：

Storage bucket Name data ,bucket The name is also B+ Tree structure storage . root root bucket Of pageid from meat Of root Field assignment

branch page：

Store branch data content . There are only... In the branch data structure key Information , No, value Information . All points to the next level node pageid Information

leaf page：

Store leaf data content .

freelist page：

Freelist Used to manage currently available pageid list . from Meta In metadata information freelist Field to locate the pageid Location .

（ Two ）B+ Tree index ：

boltdb It's a variation B+ Tree index ： The number of node branches is not fixed ; Leaf nodes do not perceive each other , It is not guaranteed that all leaf nodes are on the same layer . But index lookup and data organization are standard B+ Trees .

among

Bucket

It's essentially a namespace , It's some key/value Set , Different Bucket Can have the same name key/value ;node yes B+ Abstract encapsulation of tree nodes ,

page Is the concept of Disk Physics

node Is a logical abstraction

; stay boltdb in ,Bucket It can be nested ,boltdb There is a natural Bucket, This is automatically generated , from meta Point to , Not user created , Created later Bucket It's all this Bucket Of sub bucket.

boltdb The way to implement transactions is very simple , That is, never overwrite the updated data . among meta It is backed up by two page Implementation of page wheel transfer , Write a new place every time , Finally, change the path reference , The specific steps are as follows ：.

1.  adopt key Found at B+ The tree page;

2.  Put this page Read it out , structure node node , to update node The content of ;

3.  hold node Write the content of free page, And layer by layer ;

4.  Final update meta Index content ;

1.3.3. MVCC Many versions &Watch Mechanism

ETCD adopt

term

Identify cluster Leader The term of office of , When the cluster happens Leader Switch ,term It's worth it +1. adopt

revision

Identifies the version of the global data , When the data changes （ establish 、 modify 、 Delete ）,revision Metropolis +1. There are many in the same cluster Leader Term of office ,revision It will still keep the global monotonic increase , Make any modification of the cluster correspond to a unique revision, So as to realize the of data

MVCC Many versions

And data

Watch Mechanism

（ One ）MVCC Many versions ：

Pictured above , In the same Leader During the term of office , All modification operations correspond to term Value is always equal to 2, and revision Keep increasing monotonously . When you restart the cluster , All modification operations correspond to term Value has become 3. In the new Leader Term of office , be-all term Value is equal to 3, And it won't change , And the corresponding revision The value also keeps increasing monotonously . Look at... From a larger dimension , It can be found in term=2 and term=3 Of the two Leader Between terms of office , Data corresponds to revision The value still keeps increasing monotonously .

（ Two ）Watch Mechanism ：

stay clientv3 In the library ,Watch Features are abstracted as Watch、Close、RequestProgress Three simple API For developers to use , Shielded client And gRPCWatchServer The complex details of the interaction , Implemented a client Support multiple gRPCStream, One gRPCStream Support multiple watcher, Significantly reduce your development complexity . At the same time when watch The connected node is faulty ,clientv3 The library supports automatic reconnection to health nodes , And create a new version with the maximum version number received previously watcher, Avoid old event playback, etc .

1.4.  summary

ETCD Is in ZooKeeper And doozer The next project , Deep drawing on the essence of the first two projects. , At the same time, it also solves the problems of the first two projects . The usage scenarios are almost the same , Mainly distributed collaborative coordination scenarios , contain

Cluster monitoring , The cluster chooses the host , Service discovery , Message distribution and subscription , Load balancing , Distributed notification and coordination , Distributed locks and distributed queues

etc. .

meanwhile ETCD stay kubernetes Led by the ecosystem , Rapid development to stability , Whether it's the overall architecture design , performance , Stability and user experience are far beyond ZooKeeper And doozer; Not only in kubernetes Ecology has a large number of product integration dependencies , Not kubernetes More and more products of the system are integrated and used .