当前位置:网站首页>DPVS fullnat mode kept
DPVS fullnat mode kept
2022-06-26 00:40:00 【tinychen777】
This article mainly introduces based on CentOS7.9 System deployment DPVS Of FullNAT Mode in use keepalived Some problems encountered in the implementation of the on-line production environment of the active / standby mode configuration High Availability Cluster and the handling ideas .
Everything in the article IP Address 、 Host name 、MAC The address information has been desensitized or modified , client IP Use the simulator to generate , But it doesn't affect the reading experience .
1、keepalived framework
1.1 Stand alone architecture diagram
[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-fVB7KSeW-1645782891340)(https://resource.tinychen.com/202111161128671.svg)]
To facilitate understanding, we can divide the above architecture diagram into DPVS Network stack 、Linux Network stack 、RS Clusters and consumers (SA and Users) These four parts . stay Linux The physical network card in the network stack is used eth Express , stay DPVS The physical network card in the network stack is used dpdk Express ,DPVS The network card in the network stack is virtual to Linux The network stack uses kni suffix Express , In two network stacks bonding All the network cards use BOND Express .
By default , For all kni Network card , Their traffic will be DPVS Program hijacking .
1.2 Purpose of network card
keepalived Dual arm mode architecture , Each station DPVS The machine requires at least three groups of network cards ,bonding Do not do , It does not affect the architecture diagram . The picture above shows bonding4 Network card architecture , Therefore, the network card name uses bond0、1、2 To express , As long as you understand the role of each group of network cards , You can easily understand the architecture in the diagram .
bond0:bond0The network card is mainly used by the operation and maintenance personnel to manage the machine and keepalived Program to backend RS Node to probeIs only found in Linux The network card in the network stack , because DPVS Network card of network stack ( Including its virtual kni network card ) All along with DPVS The existence of the program , So there must be an independent DPVS The network card outside the process is used to manage the machine ( Machine information monitoring alarm ,ssh Login operation, etc ).
keepalived Program to backend RS When node detection is active, you can only use Linux Network stack , So in the architecture above , It happens to use bond0 The network card is used to detect activity , If there are more than one Linux The intranet card of the network stack , According to Linux The origin judgment in the system ( When a single network card is used, it is also judged according to the route ).
bond1.kni:bond1.kniWhen the above architecture is running normally, it has no effectbond1.kniAs DPVS Medium bond1 Network card in Linux Virtual network card in network stack , In positioning andbond0Is almost complete coincidence , therefore It is best to turn it off to avoid damage to bond0 To interfere with .Why not delete it completely , It's because of dang DPVS The program runs abnormally or needs to be corrected
bond1When I grab the bag , Can bebond1Of traffic forward Tobond1.kniTo operate .bond2.kni:bond2.kni Mainly used to refresh VIP Of MAC Addresskni NIC mac Address and DPVS Medium bond NIC mac The address is the same , Because we often use ping and arping And so on DPVS Network card operation in , So when we need to send garp Packets or gna Packets to refresh IPv4 perhaps IPv6 Of VIP Address in the switch MAC At the address , Can pass DPVS Network card corresponding to kni Network card to operate .
bond1: Service traffic network card , Mainly used for loading LIP、 And RS Establish a connection and forward the requestlocal_address_groupThis field is configured LIP Generally, it is configured in bond1 network card .bond2: Service traffic network card , Mainly used for loading VIP、 Establish a connection with the client and forward the requestdpdk_interfaceThis field is DPVS Custom version keepalived Program specific fields , To be able to VIP Configuration to dpvs NIC .
Be careful :keepalived The communication between the active and standby nodes must use Linux The network card in the network stack , In this architecture, it can be bond0 Or is it bond2.kni network card
2、dpdk Network card related
2.1 Principle analysis
DPVS The network card in is named according to PCIe The order of numbering , Use dpdk-devbind Tool we can see the corresponding network card PCIe Number and Linux The name of the network card in the network stack .
If Linux The network card of the system is named using eth* And in /etc/udev/rules.d/70-persistent-net.rules The document is solidified MAC The correspondence between address and network card name , So we need to pay special attention to PCIe Number 、DPVS adapter name 、MAC Address 、Linux adapter name The corresponding relationship among the four .
Especially when the machine has network cards of multiple network segments and bonding When ,Linux In the network card eth* and DPVS In the network card dpdk* It doesn't necessarily correspond to each other , At this time, it is best to modify the relevant configuration and let the students in the computer room adjust the network card wiring ( Directly, of course dpvs You can also modify the corresponding network card order in the configuration file of ).
2.2 Solution
The following case is a combination that is not easy to cause problems for reference only ,eth The network card is based on the corresponding PCIe Name in ascending order , and dpdk The naming rules of the network card are consistent , meanwhile eth[0-3] For intranet card ,eth[4-5] For the Internet card , Corresponding to the architecture diagram of this article , Not easy to make mistakes .
[[email protected] dpvs]# dpdk-devbind --status-dev net
Network devices using DPDK-compatible driver
============================================
0000:04:00.0 '82599ES 10-Gigabit SFI/SFP+ Network Connection 10fb' drv=igb_uio unused=ixgbe
0000:04:00.1 '82599ES 10-Gigabit SFI/SFP+ Network Connection 10fb' drv=igb_uio unused=ixgbe
0000:82:00.0 'Ethernet 10G 2P X520 Adapter 154d' drv=igb_uio unused=ixgbe
0000:82:00.1 'Ethernet 10G 2P X520 Adapter 154d' drv=igb_uio unused=ixgbe
Network devices using kernel driver
===================================
0000:01:00.0 '82599ES 10-Gigabit SFI/SFP+ Network Connection 10fb' if=eth0 drv=ixgbe unused=igb_uio
0000:01:00.1 '82599ES 10-Gigabit SFI/SFP+ Network Connection 10fb' if=eth1 drv=ixgbe unused=igb_uio
[[email protected] dpvs]# cat /etc/udev/rules.d/70-persistent-net.rules
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{
address}=="28:6e:45:c4:0e:48", NAME="eth0"
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{
address}=="28:6e:45:c4:0e:4a", NAME="eth1"
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{
address}=="38:e2:ba:1c:dd:74", NAME="eth2"
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{
address}=="38:e2:ba:1c:dd:76", NAME="eth3"
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{
address}=="b4:45:99:18:6c:5c", NAME="eth4"
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{
address}=="b4:45:99:18:6c:5e", NAME="eth5"
[[email protected] dpvs]# dpip link -v show | grep -A4 dpdk
1: dpdk0: socket 0 mtu 1500 rx-queue 16 tx-queue 16
UP 10000 Mbps full-duplex auto-nego promisc
addr 38:E2:BA:1C:DD:74 OF_RX_IP_CSUM OF_TX_IP_CSUM OF_TX_TCP_CSUM OF_TX_UDP_CSUM
pci_addr driver_name
0000:04:00:0 net_ixgbe
--
2: dpdk1: socket 0 mtu 1500 rx-queue 16 tx-queue 16
UP 10000 Mbps full-duplex auto-nego promisc
addr 38:E2:BA:1C:DD:76 OF_RX_IP_CSUM OF_TX_IP_CSUM OF_TX_TCP_CSUM OF_TX_UDP_CSUM
pci_addr driver_name
0000:04:00:1 net_ixgbe
--
3: dpdk2: socket 0 mtu 1500 rx-queue 16 tx-queue 16
UP 10000 Mbps full-duplex auto-nego promisc
addr B4:45:99:18:6C:5C OF_RX_IP_CSUM OF_TX_IP_CSUM OF_TX_TCP_CSUM OF_TX_UDP_CSUM
pci_addr driver_name
0000:82:00:0 net_ixgbe
--
4: dpdk3: socket 0 mtu 1500 rx-queue 16 tx-queue 16
UP 10000 Mbps full-duplex auto-nego promisc
addr B4:45:99:18:6C:5E OF_RX_IP_CSUM OF_TX_IP_CSUM OF_TX_TCP_CSUM OF_TX_UDP_CSUM
pci_addr driver_name
0000:82:00:1 net_ixgbe
2.3 Grab bags and remove obstacles
Under normal circumstances ,DPVS The network stack will hijack the corresponding DPVS The total traffic of the network card reaches DPVS Network stack , So we use tcpdump The tool pair corresponds to kni There is no way to catch relevant packets when the network card captures packets , The more convenient solution is to use dpip The relevant command puts dpvs Network card traffic forward To the corresponding kni NIC , Right again kni Network card for packet capture .
dpip link set <port> forward2kni on # enable forward2kni on <port>
dpip link set <port> forward2kni off # disable forward2kni on <port>
For the dpvs node , In the command <port> Generally used for dpip Command to see bond1 network card and bond2 network card .
You can also see the following official reference link :
https://github.com/iqiyi/dpvs/blob/master/doc/tutorial.md#packet-capture-and-tcpdump
Be careful :forward2kni Operation greatly affects performance , Please do not perform this operation on the online service node !
3、kni Network card related
Here we mainly undertake the above introduction kni The function of network card and some related problems and solutions
3.1 kni The function of network card
Generally speaking DPVS All network cards will be in Linux Virtual a corresponding... In the network stack kni network card ,** Consider that by default , For all kni Network card , Their traffic will be DPVS Program hijacking .** In the framework of this paper ,kni The main function of the network card is to assist in locating faults and doing a small amount of supplementary work .
- When dpvs When there is a problem with the network card , You can put the flow forward To kni Network card DEBUG, When VIP When something goes wrong , Can be used to refresh VIP Of MAC Address
- kni The network card itself is also a virtual network card , It's just that all traffic is DPVS hijacked , Can be in DPVS Configure the route to release specific traffic to kni The network card implements supplementary work , Such as DPVS Occasionally, nodes need to connect to the Internet through bond2.kni Release the extranet IP Then visit the Internet
3.2 kni Network card routing interference
3.2.1 Case recurrence
In the architecture of this figure ,bond1.kni and bond0 Network cards are all internal network cards in positioning , If two network cards are the same network segment , You need to pay special attention to the internal network traffic in and out of the network card . Here we use a virtual machine as an example :
[[email protected] ~]# ip a
...
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
link/ether 52:54:00:66:3b:08 brd ff:ff:ff:ff:ff:ff
inet 10.31.100.2/16 brd 10.31.255.255 scope global noprefixroute eth0
valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
link/ether 52:54:00:47:37:3e brd ff:ff:ff:ff:ff:ff
inet 10.31.100.22/16 brd 10.31.255.255 scope global noprefixroute eth1
valid_lft forever preferred_lft forever
...
[[email protected] ~]# ip r
...
10.31.0.0/16 dev eth0 proto kernel scope link src 10.31.100.2 metric 100
10.31.0.0/16 dev eth1 proto kernel scope link src 10.31.100.22 metric 101
...
There are two virtual machines on the top 10.31.0.0/16 Network card of network segment , Namely eth0(10.31.100.2) and eth1(10.31.100.22), When you view the routing table, you can see that 10.31.0.0/16 This network segment has two routes pointing to eth0 and eth1 Of IP, The difference is between the two metric. Now let's do a test :
First of all we have 10.31.100.1 On this machine ping This virtual machine's eth1(10.31.100.22), Then use it directly tcpdump Carry out the bag
# Yes eth1(10.31.100.22) When the network card is capturing packets, it cannot capture the corresponding icmp package
[[email protected] ~]# tcpdump -A -n -vv -i eth1 icmp
tcpdump: listening on eth1, link-type EN10MB (Ethernet), capture size 262144 bytes
^C
0 packets captured
0 packets received by filter
0 packets dropped by kernel
# Then we are right eth0(10.31.100.2) When the network card is capturing packets, it is found that it can catch external machines (10.31.100.1) Yes eth1(10.31.100.22) NIC icmp Data packets
[[email protected] ~]# tcpdump -A -n -vv -i eth0 icmp
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
16:29:44.789831 IP (tos 0x0, ttl 64, id 1197, offset 0, flags [DF], proto ICMP (1), length 84)
10.31.100.1 > 10.31.100.22: ICMP echo request, id 16846, seq 54, length 64
E..T..@[email protected]
.d.
.d...VSA..6y2.a....yA...................... !"#$%&'()*+,-./01234567 16:29:44.789898 IP (tos 0x0, ttl 64, id 16187, offset 0, flags [none], proto ICMP (1), length 84) 10.31.100.22 > 10.31.100.1: ICMP echo reply, id 16846, seq 54, length 64 E..T?;[email protected]_. .d. .d...^SA..6y2.a....yA...................... !"#$%&'()*+,-./01234567
16:29:45.813740 IP (tos 0x0, ttl 64, id 1891, offset 0, flags [DF], proto ICMP (1), length 84)
10.31.100.1 > 10.31.100.22: ICMP echo request, id 16846, seq 55, length 64
E..[email protected]@.V.
.d.
3.2.2 Principle analysis
Here we can find : Even though 10.31.100.22 Is in eth1 above , But in fact, the flow goes through eth0, in other words eth1 There is actually no traffic on it . This corresponds to the routing table 10.31.100.2 metric 100 To be less than 10.31.100.22 metric 101, accord with metric The smaller, the higher the priority Principles .
Apply the above to bond0 and bond1.kni NIC , There will be similar problems , If it's on IPv6 The Internet , It is also necessary to consider whether there will be bond1.kni Network card in IPv6 The network route advertisement issues the default gateway route . In this way, it is easy for routing traffic to go bond0 Or maybe go bond1.kni The problem of , Put aside the performance gap between the physical network card and the virtual network card , what's more :
- By default
bond1.kniThe network card traffic will be DPVS Program hijacking , So gobond1.kniThe network card requests are abnormal ; - And just RS Node detection is performed through Linux Network stack implementation , If this time comes RS The routing of a node is to
bond1.kninetwork card , Will let keepalived Mistakenly think that the backend RS Node is not available , Thus it weight Reduced to 0; - If the entire cluster RS Is so , This will lead to the cluster VIP No later available RS(weight Are all 0), The end result is that the request cannot be forwarded to RS Resulting in complete unavailability of the service .
3.2.3 Solutions
Therefore, one of the most convenient solutions here is to close it directly bond1.kni, Directly to disable , Only if necessary DEBUG And then enable , Can effectively avoid such problems .
3.3 kni network card IP no
3.3.1 Principle analysis
because Linux In the network stack kni Network card and DPVS The network card in the network stack actually corresponds to a physical network card ( Or a group of physical network cards ), The network traffic flowing through the network card can only be handled by one network stack .** By default , For all kni Network card , Their traffic will be DPVS Program hijacking .** That means bond2.kni Network card IP Not only is it impossible ping through , It is also unable to perform other normal access operations . however DPVS Program support is specific to IP Release the relevant flow to Linux Network stack ( adopt kni_host Routing implementation ), You can achieve this IP Normal access to .
for instance : A group of x520 Network card group bonding, stay Linux The network stack is displayed as bond2.kni, stay DPVS The network is displayed as bond2, And the other bond0 The network card is just a network card linux In the network stack bonding network card , And DPVS irrelevant . We use some simple commands for comparison :
Use ethtool Tool View ,bond0 The network card can normally obtain the network card speed and other information , and kni The network card is completely unable to obtain any valid information , At the same time DPVS Network stack dpip The command can see the bond2 Very detailed physical hardware information of the network card .
Use lspci -nnv Command to view the details of the two groups of network cards , We can still see that bond0 The network card uses Linux Network card driver for ixgbe, and bond2 The network card uses DPVS Of PMDigb_uio.
[[email protected] dpvs]# ethtool bond0
Settings for bond0:
Supported ports: [ ]
Supported link modes: Not reported
Supported pause frame use: No
Supports auto-negotiation: No
Supported FEC modes: Not reported
Advertised link modes: Not reported
Advertised pause frame use: No
Advertised auto-negotiation: No
Advertised FEC modes: Not reported
Speed: 20000Mb/s
Duplex: Full
Port: Other
PHYAD: 0
Transceiver: internal
Auto-negotiation: off
Link detected: yes
[[email protected] dpvs]# ethtool bond2.kni
Settings for bond2.kni:
No data available
[[email protected] dpvs]# dpip -s -v link show bond2
3: bond2: socket 0 mtu 1500 rx-queue 16 tx-queue 16
UP 20000 Mbps full-duplex auto-nego
addr 00:1C:34:EE:46:E4
ipackets opackets ibytes obytes
15451492 31306 6110603685 4922260
ierrors oerrors imissed rx_nombuf
0 0 0 0
mbuf-avail mbuf-inuse
1012315 36260
pci_addr driver_name
net_bonding
if_index min_rx_bufsize max_rx_pktlen max_mac_addrs
0 0 15872 16
max_rx_queues max_tx_queues max_hash_addrs max_vfs
127 63 0 0
max_vmdq_pools rx_ol_capa tx_ol_capa reta_size
0 0x1AE9F 0x2A03F 128
hash_key_size flowtype_rss_ol vmdq_que_base vmdq_que_num
0 0x38D34 0 0
rx_desc_max rx_desc_min rx_desc_align vmdq_pool_base
4096 0 1 0
tx_desc_max tx_desc_min tx_desc_align speed_capa
4096 0 1 0
Queue Configuration:
rx0-tx0 cpu1-cpu1
rx1-tx1 cpu2-cpu2
rx2-tx2 cpu3-cpu3
rx3-tx3 cpu4-cpu4
rx4-tx4 cpu5-cpu5
rx5-tx5 cpu6-cpu6
rx6-tx6 cpu7-cpu7
rx7-tx7 cpu8-cpu8
rx8-tx8 cpu9-cpu9
rx9-tx9 cpu10-cpu10
rx10-tx10 cpu11-cpu11
rx11-tx11 cpu12-cpu12
rx12-tx12 cpu13-cpu13
rx13-tx13 cpu14-cpu14
rx14-tx14 cpu15-cpu15
rx15-tx15 cpu16-cpu16
HW mcast list:
link 33:33:00:00:00:01
link 33:33:00:00:00:02
link 01:80:c2:00:00:0e
link 01:80:c2:00:00:03
link 01:80:c2:00:00:00
link 01:00:5e:00:00:01
link 33:33:ff:bf:43:e4
[[email protected] dpvs]# lspci -nnv
...
01:00.0 Ethernet controller [0200]: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection [8086:10fb] (rev 01)
...
Kernel driver in use: ixgbe
Kernel modules: ixgbe
...
81:00.0 Ethernet controller [0200]: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection [8086:10fb] (rev 01)
...
Kernel driver in use: igb_uio
Kernel modules: ixgbe
3.3.2 Solutions
If you want to bond2.kni On your network card IP Relevant operation is normal , Can be directed at the IP add to kni_host route , The specific operation is as follows :
# bond2.kni_ip It can be replaced by any one IP
dpip route add <bond2.kni_ip>/32 scope kni_host dev bond2
dpip route del <bond2.kni_ip>/32 scope kni_host dev bond2
# ipv6 The same is true of network operation
dpip route -6 add <bond2.kni_ip>/128 scope kni_host dev bond2
dpip route -6 del <bond2.kni_ip>/128 scope kni_host dev bond2
Be careful : There must be one at the time of release IP One IP release , The mask must be 32 Or is it 128, Batch release multiple at a time IP It will affect performance very much !
3.4 VIP can ping Tongdan http Request exception
about DPVS Come on ,ping Request and http The request processing logic is completely different . about ping Requested icmp and icmpv6 Data packets , It's all by DPVS The network stack itself , It doesn't involve the following RS node .
can ping General rules DPVS The program works normally ,http If the request is abnormal, it indicates that the backend is RS Node status is abnormal , It could be LIP and RS There is a problem in the communication between the two, which makes the data packet unable to arrive smoothly .
Of course, there is another possibility :LIP and RS The communication between them is normal , But for RS Explore live network cards and RS Abnormal communication between the two leads to keepalived The process is mistaken for RS There is a problem with the node and it will weight Reduced to 0.
A common case of this is IPv6 Under the Internet DPVS Nodes and RS Nodes communicate across network segments , and DPVS Node is not added ipv6 Cross segment routing .
4、keepalived relevant
keepalived The problems can be divided into two aspects : Cerebral fissure and active / standby switching .
4.1 Split brain
Generally speaking ,keepalived The most fundamental reason for brain crack is that both machines think they are the boss (master), There are two main reasons for this : The network is not available or the configuration file is incorrect . These two kinds of fault causes and troubleshooting ideas are very common in many posts on the Internet , No more details here . What is relatively rare is the existence of a switch BUG, To make the same vlan There are two different groups of vrrp_instance Use the same virtual_router_id Caused by the brain crack .
4.1.1 Switch BUG Cause brain cracking
When using multicast communication , For the part there are BUG Switchboard , Different vrrp_instance If virtual_router_id It is also possible to have a cerebral fissure . Pay attention to what is said here virtual_router_id Consistency means simply virtual_router_id This parameter is the same , Even if the authentication Configure different passwords , You will also receive multicast packets from the other party , But it will only give an error prompt received an invalid passwd, There will be no brain crack ( Because it's different here vrrp_instance Not the same vrrp_instance Different nodes in ).
Generally speaking virtual_router_id The range is 1-255, Obviously, this variable was designed with the assumption that vlan Internal IP Do not exceed one C, such virtual_router_id You can use it directly IP The last section of . In fact, if vlan If the division is reasonable or the planning is proper, it is not easy to encounter such problems , But if the computer room network vlan The division is too large , Or the network quality of the computer room is poor 、 When the switch was old , You need extra attention virtual_router_id The question of conflict .
Here is an excerpt keepalived Relevant descriptions on the official website :
arbitrary unique number from 1 to 255 used to differentiate multiple instances of vrrpd running on the same network interface and address family (and hence same socket).
Note: using the same virtual_router_id with the same address family on different interfaces has been known to cause problems with some network switches; if you are experiencing problems with using the same virtual_router_id on different interfaces, but the problems are resolved by not duplicating virtual_router_ids, the your network switches are probably not functioning correctly.
4.1.2 Solutions
Multicast
keepalived By default, the communication between the active and standby nodes is through multicast , The principle of multicast will not be repeated here , By default, whether it is IPv4 still IPv6 Will use a multicast address , For some common such as BGP、VRRP Protocol packets ,RFC The corresponding multicast addresses are defined and divided in advance for their use ,keepalived The officially used multicast address follows the definition specification , As follows :
Multicast Group to use for IPv4 VRRP adverts Defaults to the RFC5798 IANA assigned VRRP multicast address 224.0.0.18 which You typically do not want to change.
vrrp_mcast_group4 224.0.0.18Multicast Group to use for IPv6 VRRP adverts (default: ff02::12)
vrrp_mcast_group6 ff02::12
If we can't confirm whether the node is in the network that uses the virtual_router_id Of vrrp_instance, You can try to modify the multicast address to avoid conflicts .
unicast
Another solution is not to use multicast , Use unicast instead . Unicast is not only better than multicast in network communication quality , And it's hard to see virtual_router_id The question of conflict . alike , If keepalived Between clusters, the poor multicast communication quality between the active and standby nodes leads to frequent active and standby handovers , In addition to improving the communication network quality between nodes , You can also try to change the communication mode to unicast .
# ipv4 Network configuration unicast
unicast_src_ip 192.168.229.1
unicast_peer {
192.168.229.2
}
# ipv6 Network configuration unicast
unicast_src_ip 2000::1
unicast_peer {
2000::2
}
In the picture above unicast_src_ip It's local IP, and unicast_peer Is the opposite end IP, Notice the unicast_peer Yes, there can be more than one IP Of ( Corresponding to one active standby, one active multiple standby or multiple standby preemption ).
Unicast performs better in stability , But the corresponding amount of configuration is also greatly increased , Need o & M students in each group vrrp_instance Add the corresponding unicast configuration , And the configuration contents between the active and standby nodes are also different ( It's usually unicast_src_ip and unicast_peer Swap ). Moreover, once the unicast related configuration is corrected, brain splitting will almost occur , This puts forward higher requirements for configuration management inspection and distribution .
4.2 Active standby switching
keepalived The main problems that often occur during active / standby switchover are IP Has switched to another machine , But corresponding to the switch MAC Recorded in the address table VIP Corresponding MAC The address hasn't been updated yet . The common solution to this situation is to use arping(IPv4) Operation or ping6(IPv6) To quickly refresh manually MAC Record or configure keepalived To automatically refresh continuously MAC Record .
4.2.1 Manually refresh MAC
about DPVS Program , Refresh IPv4 Of VIP Of MAC Address time , If VIP And corresponding kni Network card IP It's the same network segment , It can be used directly to kni Network card use arping Command to refresh MAC Address (kni Network card and DPVS NIC MAC The address is consistent ); however IPv6 The Internet does not arp This thing , Refresh MAC Records require the use of ping6 command , And this is in DPVS Of kni The network card does not work . I suggest using python or go Such as network programming language to write a simple program to achieve sending gna Data packets , And then in keepalived When the configuration script enters master Execute script refresh when the status is MAC Address , Can solve IPv6 Next VIP Switched MAC Address update problem .
4.2.2 keepalived Refresh MAC
Another solution is to configure keepalived, Give Way keepalived Send it yourself garp and gna Data packets ,keepalived There are many in the configuration vrrp_garp* The related configuration can adjust the sending garp and gna The parameters of the packet
# send out garp/gna Packet interval . Here is every 10 Seconds to send a round
vrrp_garp_master_refresh 10
# Send three at a time garp/gna Data packets
vrrp_garp_master_refresh_repeat 3
# Every garp The sending interval of data packets is 0.001 second
vrrp_garp_interval 0.001
# Every gna The sending interval of data packets is 0.001 second
vrrp_gna_interval 0.001
But the use of keepalived There is another problem with configuration :
- keepalived send out
garp/gnaThe network card of the packet isinterfaceThe network card specified by the parameter , That is, the network card used by the primary and standby nodes for communication - keepalived Sent
garp/gnaTo be effective, a packet must be VIP Where DPVS The network card in or the correspondingkninetwork card
So if you want keepalived To refresh VIP Of MAC Address , This network card needs to be modified to the bond2.kni network card , That is, the external network card corresponding to the dual arm network architecture mode , At the same time, if unicast communication is used, it is also necessary to add the corresponding node kni_host Route to ensure unicast can communicate normally .
4.2.3 Summary
The above two schemes have their own advantages and disadvantages , It needs to combine the internal and external network quality 、 Use unicast or multicast 、 Network routing configuration management 、keepalived Document management and other factors .
5、 The maximum number of cluster connections
This part mainly analyzes and compares the traditional LVS-DR Patterns and DPVS-FNAT The maximum of the two modes TCP Connection count performance limit bottleneck .
5.1 LVS-DR Pattern
First, let's look at the traditional LVS-DR Connection table in mode
[[email protected]]# ipvsadm -lnc | head | column -t
IPVS connection entries
pro expire state source virtual destination
TCP 00:16 FIN_WAIT 44.73.152.152:54300 10.0.96.104:80 192.168.229.111:80
TCP 00:34 FIN_WAIT 225.155.149.221:55182 10.0.96.104:80 192.168.229.117:80
TCP 00:22 ESTABLISHED 99.251.37.22:53601 10.0.96.104:80 192.168.229.116:80
TCP 01:05 FIN_WAIT 107.111.180.141:15997 10.0.96.104:80 192.168.229.117:80
TCP 00:46 FIN_WAIT 44.108.145.205:57801 10.0.96.104:80 192.168.229.116:80
TCP 12:01 ESTABLISHED 236.231.219.215:36811 10.0.96.104:80 192.168.229.111:80
TCP 01:36 FIN_WAIT 91.90.162.249:52287 10.0.96.104:80 192.168.229.116:80
TCP 01:41 FIN_WAIT 85.35.41.0:44148 10.0.96.104:80 192.168.229.112:80
We can see from the top DPVS Connection table and LVS The connection table of is basically the same ,DPVS There's an extra column CPU Number of cores and LIP Information , But there are great differences in principle .
The following analysis assumes that other performance constraints have no bottlenecks
First of all, for LVS-DR In terms of mode , We know Client It's direct and RS Connected ,LVS In this process, only packets are forwarded , It does not involve the step of establishing a connection , So the number of connections Namely Protocol、CIP:Port、RIP:Port These five variables , It is also known as five tuples .
Protocol CIP:Port RIP:Port
in consideration of Protocol No TCP Namely UDP, Think of it as a constant , That is to say in the light of LVS-DR for : Real impact TCP The number of connections is CIP:Port(RIP:Port Often fixed ), But because of CIP:Port Theoretically, there can be enough , So at this point TCP The maximum limit on the number of connections is often in RS above , That is to say RS The number and RS The performance of determines the whole LVS-DR The maximum size of the cluster TCP The number of connections .
5.2 DPVS-FNAT Pattern
Then we use ipvsadm -lnc Order to check FNAT Mode of Client<-->DPVS<-->RS Connection between :
[[email protected] dpvs]# ipvsadm -lnc | head
[1]tcp 90s TCP_EST 197.194.123.33:56058 10.0.96.216:443 192.168.228.1:41136 192.168.229.80:443
[1]tcp 7s TIME_WAIT 26.251.198.234:21164 10.0.96.216:80 192.168.228.1:44896 192.168.229.89:80
[1]tcp 7s TIME_WAIT 181.112.211.168:46863 10.0.96.216:80 192.168.228.1:62976 192.168.229.50:80
[1]tcp 90s TCP_EST 242.73.154.166:9611 10.0.96.216:443 192.168.228.1:29552 192.168.229.87:443
[1]tcp 3s TCP_CLOSE 173.137.182.178:53264 10.0.96.216:443 192.168.228.1:8512 192.168.229.87:443
[1]tcp 90s TCP_EST 14.53.6.35:23820 10.0.96.216:443 192.168.228.1:44000 192.168.229.50:443
[1]tcp 3s TCP_CLOSE 35.13.251.48:15348 10.0.96.216:443 192.168.228.1:16672 192.168.229.79:443
[1]tcp 90s TCP_EST 249.109.242.104:5566 10.0.96.216:443 192.168.228.1:10112 192.168.229.77:443
[1]tcp 3s TCP_CLOSE 20.145.41.157:6179 10.0.96.216:443 192.168.228.1:15136 192.168.229.86:443
[1]tcp 90s TCP_EST 123.34.92.153:15118 10.0.96.216:443 192.168.228.1:9232 192.168.229.87:443
[[email protected] dpvs]# ipvsadm -lnc | tail
[16]tcp 90s TCP_EST 89.99.59.41:65197 10.0.96.216:443 192.168.228.1:7023 192.168.229.50:443
[16]tcp 3s TCP_CLOSE 185.97.221.45:18862 10.0.96.216:443 192.168.228.1:48159 192.168.229.50:443
[16]tcp 90s TCP_EST 108.240.236.85:64013 10.0.96.216:443 192.168.228.1:49199 192.168.229.50:443
[16]tcp 90s TCP_EST 85.173.18.255:53586 10.0.96.216:443 192.168.228.1:63007 192.168.229.87:443
[16]tcp 90s TCP_EST 182.123.32.10:5912 10.0.96.216:443 192.168.228.1:19263 192.168.229.77:443
[16]tcp 90s TCP_EST 135.35.212.181:51666 10.0.96.216:443 192.168.228.1:22223 192.168.229.88:443
[16]tcp 90s TCP_EST 134.210.227.47:29393 10.0.96.216:443 192.168.228.1:26975 192.168.229.90:443
[16]tcp 7s TIME_WAIT 110.140.221.121:54046 10.0.96.216:443 192.168.228.1:5967 192.168.229.84:443
[16]tcp 3s TCP_CLOSE 123.129.23.120:18550 10.0.96.216:443 192.168.228.1:7567 192.168.229.83:443
[16]tcp 90s TCP_EST 72.250.60.207:33043 10.0.96.216:443 192.168.228.1:53279 192.168.229.86:443
Then we analyze the meaning of these fields one by one :
[1]: This number represents CPU The core number , We are in dpvs.conf Configured inworker cpuOfcpu_id, From this field you can see each DPVS Process worker Load of thread worktcp:tcp perhaps udp, Corresponding to the type of this connection , There is no need to explain this90s、30s、7s、3s: The time corresponding to this connectionCLOSE_WAIT、FIN_WAIT、SYN_RECV、TCP_CLOSE、TCP_EST、TIME_WAIT: Corresponding to this one tcp The state of the connectionThe last group of four IP+Port The combination of is
Client<-->DPVS<-->RSCorrespondence of :CIP:Port VIP:Port LIP:Port RIP:Port
So for DPVS-FNAT Mode , Joined the LIP Then it became four groups IP+Port The combination of , Plus the one in front cpu_id and Protocol Is the ten tuples that affect the number of connections .
cpu_id Protocol CIP:Port VIP:Port LIP:Port RIP:Port
- We need to know the above four groups before we start the analysis IP+Port The combination of is actually divided into two quads , namely
CIP:Port VIP:PortFor a quadruple ,LIP:Port RIP:PortFor a quadruple , There is a one-to-one correspondence between two quaternions - First of all, let's eliminate
Protocol、VIP:PortandRIP:Port, Because these three groups of five variables are basically fixed , Sure As a constant - Next is
CIP:PortIn theory, it can Enough Of , Not for our cluster TCP The number of connections has an impact - And then there was
cpu_id, Although a machine can have at most 16 individualwork cpu, But it doesn't meanMaximum number of connections for ten tuples=except cpu_id The maximum number of connections of the outer nine tuples *16,DPVS The program will putcpu_idaccording to LIP The port number of the , So as to distribute the load equally to all CPU above . So herecpu_idand LIP The port number of is also One-to-one correspondence The relationship between - And finally
LIP:Port, We know one IP The maximum number of ports that can be used is 65536 individual , becauseRIP:PortIs constant , So theThe maximum of a quadruple TCP The number of connections <=LIP Number *65536*RIP Number
And because there is a one-to-one correspondence between two quaternions ,cpu_id and LIP The port number of is also One-to-one correspondence The relationship between , So for DPVS-FNAT Mode ,LIP The number of connections is often the key to limiting the maximum number of connections in the entire cluster , If the cluster has multiple connections , It is suggested that a sufficient number of IP to LIP Use .
By the way , Combined with official documents and actual measurements ,x520/82599、x710 The network card is in use
igb_uioThis PMD When , stay ipv6 Under the network fdir I won't support itperfectPattern , It is recommended to usesignaturePattern , But note that only one... Can be used in this mode LIP, There will be a limit on the maximum number of connections to the cluster .The official document link can be clicked on here see .
We found there exists some NICs do not (fully) support Flow Control of IPv6 required by IPv6. For example, the rte_flow of 82599 10GE Controller (ixgbe PMD) relies on an old fashion flow type
flow director(fdir), which doesn’t support IPv6 in its perfect mode, and support only one local IPv4 or IPv6 in its signature mode. DPVS supports the fdir mode config for compatibility.
6、 At the end
DPVS Indeed, it has excellent performance in terms of performance and function , It is also true that many pits will be trampled at the initial stage of landing , It is recommended to read more documents , Check more information , Look at the source code , When it is really used, it will really bring us a lot of surprises and gains . Finally, by the way , If you just want to build a small-scale cluster to have a try , ordinary IPv4 Network and common x520 A network card is enough , Of course, qualified students can try to use ECMP Architecture and some good network cards ( Such as Mellanox).
边栏推荐
- logstash丢弃没有精准匹配到文件名的日志数据
- Methods to realize asynchrony
- [image detection] vascular tracking and diameter estimation based on Gaussian process and Radon transform with matlab code
- 什么是微服务
- Is camkiia the same as gcamp6f?
- leetcode. 14 --- longest public prefix
- 《SQL优化核心思想》
- Why do we need to make panels and edges in PCB production
- MySQL master-slave replication
- Installing redis on Linux
猜你喜欢

After being trapped by the sequelae of the new crown for 15 months, Stanford Xueba was forced to miss the graduation ceremony. Now he still needs to stay in bed for 16 hours every day: I should have e

ORA-01153 :激活了不兼容的介质恢复

How to deliver a shelter hospital within 48 hours?

什么是微服务

Idea set the template of mapper mapping file

元宇宙中的法律与自我监管

Flink reports error: a JNI error has occurred, please check your installation and try again
![[advanced ROS] Lecture 1 Introduction to common APIs](/img/25/85e8c55605f5cc999a8e85f0a05f93.jpg)
[advanced ROS] Lecture 1 Introduction to common APIs

1-9network configuration in VMWare

每日刷题记录 (四)
随机推荐
SQL中只要用到聚合函数就一定要用到group by 吗?
Law and self-regulation in the meta universe
Correct writing methods of case, number and punctuation in Chinese and English papers
Sentinel of redis
Logstash discards log data that does not match the file name exactly
Function and principle of SPI solder paste inspection machine
Multi-Instance Redo Apply
深圳台电:联合国的“沟通”之道
js数组中修改元素的方法
86. (cesium chapter) cesium overlay surface receiving shadow effect (gltf model)
11.1.1 overview of Flink_ Flink overview
CaMKIIa和GCaMP6f是一样的嘛?
How ASA configures port mapping and pat
Explain from a process perspective what happens to the browser after entering a URL?
mtb13_Perform extract_blend_Super{Candidate(PrimaryAlternate)_Unique(可NULL过滤_Foreign_index_granulari
QT excellent open source project 9: qtox
Servlet response download file
Idea set the template of mapper mapping file
[image detection] vascular tracking and diameter estimation based on Gaussian process and Radon transform with matlab code
1-9network configuration in VMWare