当前位置:网站首页>DPVS fullnat mode kept

DPVS fullnat mode kept

2022-06-26 00:40:00 tinychen777

This article mainly introduces based on CentOS7.9 System deployment DPVS Of FullNAT Mode in use keepalived Some problems encountered in the implementation of the on-line production environment of the active / standby mode configuration High Availability Cluster and the handling ideas .

Everything in the article IP Address 、 Host name 、MAC The address information has been desensitized or modified , client IP Use the simulator to generate , But it doesn't affect the reading experience .

1、keepalived framework

1.1 Stand alone architecture diagram

[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-fVB7KSeW-1645782891340)(https://resource.tinychen.com/202111161128671.svg)]

To facilitate understanding, we can divide the above architecture diagram into DPVS Network stack 、Linux Network stack 、RS Clusters and consumers (SA and Users) These four parts . stay Linux The physical network card in the network stack is used eth Express , stay DPVS The physical network card in the network stack is used dpdk Express ,DPVS The network card in the network stack is virtual to Linux The network stack uses kni suffix Express , In two network stacks bonding All the network cards use BOND Express .

By default , For all kni Network card , Their traffic will be DPVS Program hijacking .

1.2 Purpose of network card

keepalived Dual arm mode architecture , Each station DPVS The machine requires at least three groups of network cards ,bonding Do not do , It does not affect the architecture diagram . The picture above shows bonding4 Network card architecture , Therefore, the network card name uses bond0、1、2 To express , As long as you understand the role of each group of network cards , You can easily understand the architecture in the diagram .

  • bond0bond0 The network card is mainly used by the operation and maintenance personnel to manage the machine and keepalived Program to backend RS Node to probe

    Is only found in Linux The network card in the network stack , because DPVS Network card of network stack ( Including its virtual kni network card ) All along with DPVS The existence of the program , So there must be an independent DPVS The network card outside the process is used to manage the machine ( Machine information monitoring alarm ,ssh Login operation, etc ).

    keepalived Program to backend RS When node detection is active, you can only use Linux Network stack , So in the architecture above , It happens to use bond0 The network card is used to detect activity , If there are more than one Linux The intranet card of the network stack , According to Linux The origin judgment in the system ( When a single network card is used, it is also judged according to the route ).

  • bond1.knibond1.kni When the above architecture is running normally, it has no effect

    bond1.kni As DPVS Medium bond1 Network card in Linux Virtual network card in network stack , In positioning and bond0 Is almost complete coincidence , therefore It is best to turn it off to avoid damage to bond0 To interfere with .

    Why not delete it completely , It's because of dang DPVS The program runs abnormally or needs to be corrected bond1 When I grab the bag , Can be bond1 Of traffic forward To bond1.kni To operate .

  • bond2.knibond2.kni Mainly used to refresh VIP Of MAC Address

    kni NIC mac Address and DPVS Medium bond NIC mac The address is the same , Because we often use ping and arping And so on DPVS Network card operation in , So when we need to send garp Packets or gna Packets to refresh IPv4 perhaps IPv6 Of VIP Address in the switch MAC At the address , Can pass DPVS Network card corresponding to kni Network card to operate .

  • bond1: Service traffic network card , Mainly used for loading LIP、 And RS Establish a connection and forward the request

    local_address_group This field is configured LIP Generally, it is configured in bond1 network card .

  • bond2: Service traffic network card , Mainly used for loading VIP、 Establish a connection with the client and forward the request

    dpdk_interface This field is DPVS Custom version keepalived Program specific fields , To be able to VIP Configuration to dpvs NIC .

Be careful :keepalived The communication between the active and standby nodes must use Linux The network card in the network stack , In this architecture, it can be bond0 Or is it bond2.kni network card

2、dpdk Network card related

2.1 Principle analysis

DPVS The network card in is named according to PCIe The order of numbering , Use dpdk-devbind Tool we can see the corresponding network card PCIe Number and Linux The name of the network card in the network stack .

If Linux The network card of the system is named using eth* And in /etc/udev/rules.d/70-persistent-net.rules The document is solidified MAC The correspondence between address and network card name , So we need to pay special attention to PCIe Number DPVS adapter name MAC Address Linux adapter name The corresponding relationship among the four .

Especially when the machine has network cards of multiple network segments and bonding When ,Linux In the network card eth* and DPVS In the network card dpdk* It doesn't necessarily correspond to each other , At this time, it is best to modify the relevant configuration and let the students in the computer room adjust the network card wiring ( Directly, of course dpvs You can also modify the corresponding network card order in the configuration file of ).

2.2 Solution

The following case is a combination that is not easy to cause problems for reference only ,eth The network card is based on the corresponding PCIe Name in ascending order , and dpdk The naming rules of the network card are consistent , meanwhile eth[0-3] For intranet card ,eth[4-5] For the Internet card , Corresponding to the architecture diagram of this article , Not easy to make mistakes .

[[email protected] dpvs]# dpdk-devbind --status-dev net

Network devices using DPDK-compatible driver
============================================
0000:04:00.0 '82599ES 10-Gigabit SFI/SFP+ Network Connection 10fb' drv=igb_uio unused=ixgbe
0000:04:00.1 '82599ES 10-Gigabit SFI/SFP+ Network Connection 10fb' drv=igb_uio unused=ixgbe
0000:82:00.0 'Ethernet 10G 2P X520 Adapter 154d' drv=igb_uio unused=ixgbe
0000:82:00.1 'Ethernet 10G 2P X520 Adapter 154d' drv=igb_uio unused=ixgbe

Network devices using kernel driver
===================================
0000:01:00.0 '82599ES 10-Gigabit SFI/SFP+ Network Connection 10fb' if=eth0 drv=ixgbe unused=igb_uio
0000:01:00.1 '82599ES 10-Gigabit SFI/SFP+ Network Connection 10fb' if=eth1 drv=ixgbe unused=igb_uio


[[email protected] dpvs]# cat /etc/udev/rules.d/70-persistent-net.rules
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{
    address}=="28:6e:45:c4:0e:48", NAME="eth0"
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{
    address}=="28:6e:45:c4:0e:4a", NAME="eth1"
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{
    address}=="38:e2:ba:1c:dd:74", NAME="eth2"
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{
    address}=="38:e2:ba:1c:dd:76", NAME="eth3"
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{
    address}=="b4:45:99:18:6c:5c", NAME="eth4"
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{
    address}=="b4:45:99:18:6c:5e", NAME="eth5"

[[email protected] dpvs]# dpip link -v show | grep -A4 dpdk
1: dpdk0: socket 0 mtu 1500 rx-queue 16 tx-queue 16
    UP 10000 Mbps full-duplex auto-nego promisc
    addr 38:E2:BA:1C:DD:74 OF_RX_IP_CSUM OF_TX_IP_CSUM OF_TX_TCP_CSUM OF_TX_UDP_CSUM
    pci_addr                        driver_name
    0000:04:00:0                    net_ixgbe
--
2: dpdk1: socket 0 mtu 1500 rx-queue 16 tx-queue 16
    UP 10000 Mbps full-duplex auto-nego promisc
    addr 38:E2:BA:1C:DD:76 OF_RX_IP_CSUM OF_TX_IP_CSUM OF_TX_TCP_CSUM OF_TX_UDP_CSUM
    pci_addr                        driver_name
    0000:04:00:1                    net_ixgbe
--
3: dpdk2: socket 0 mtu 1500 rx-queue 16 tx-queue 16
    UP 10000 Mbps full-duplex auto-nego promisc
    addr B4:45:99:18:6C:5C OF_RX_IP_CSUM OF_TX_IP_CSUM OF_TX_TCP_CSUM OF_TX_UDP_CSUM
    pci_addr                        driver_name
    0000:82:00:0                    net_ixgbe
--
4: dpdk3: socket 0 mtu 1500 rx-queue 16 tx-queue 16
    UP 10000 Mbps full-duplex auto-nego promisc
    addr B4:45:99:18:6C:5E OF_RX_IP_CSUM OF_TX_IP_CSUM OF_TX_TCP_CSUM OF_TX_UDP_CSUM
    pci_addr                        driver_name
    0000:82:00:1                    net_ixgbe

2.3 Grab bags and remove obstacles

Under normal circumstances ,DPVS The network stack will hijack the corresponding DPVS The total traffic of the network card reaches DPVS Network stack , So we use tcpdump The tool pair corresponds to kni There is no way to catch relevant packets when the network card captures packets , The more convenient solution is to use dpip The relevant command puts dpvs Network card traffic forward To the corresponding kni NIC , Right again kni Network card for packet capture .

dpip link set <port> forward2kni on      # enable forward2kni on <port>
dpip link set <port> forward2kni off     # disable forward2kni on <port>

For the dpvs node , In the command <port> Generally used for dpip Command to see bond1 network card and bond2 network card .

You can also see the following official reference link :

https://github.com/iqiyi/dpvs/blob/master/doc/tutorial.md#packet-capture-and-tcpdump

Be careful :forward2kni Operation greatly affects performance , Please do not perform this operation on the online service node !

3、kni Network card related

Here we mainly undertake the above introduction kni The function of network card and some related problems and solutions

3.1 kni The function of network card

Generally speaking DPVS All network cards will be in Linux Virtual a corresponding... In the network stack kni network card ,** Consider that by default , For all kni Network card , Their traffic will be DPVS Program hijacking .** In the framework of this paper ,kni The main function of the network card is to assist in locating faults and doing a small amount of supplementary work .

  • When dpvs When there is a problem with the network card , You can put the flow forward To kni Network card DEBUG, When VIP When something goes wrong , Can be used to refresh VIP Of MAC Address
  • kni The network card itself is also a virtual network card , It's just that all traffic is DPVS hijacked , Can be in DPVS Configure the route to release specific traffic to kni The network card implements supplementary work , Such as DPVS Occasionally, nodes need to connect to the Internet through bond2.kni Release the extranet IP Then visit the Internet

3.2 kni Network card routing interference

3.2.1 Case recurrence

In the architecture of this figure ,bond1.kni and bond0 Network cards are all internal network cards in positioning , If two network cards are the same network segment , You need to pay special attention to the internal network traffic in and out of the network card . Here we use a virtual machine as an example :

[[email protected] ~]# ip a
...
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 52:54:00:66:3b:08 brd ff:ff:ff:ff:ff:ff
    inet 10.31.100.2/16 brd 10.31.255.255 scope global noprefixroute eth0
       valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 52:54:00:47:37:3e brd ff:ff:ff:ff:ff:ff
    inet 10.31.100.22/16 brd 10.31.255.255 scope global noprefixroute eth1
       valid_lft forever preferred_lft forever
...
[[email protected] ~]# ip r
...
10.31.0.0/16 dev eth0 proto kernel scope link src 10.31.100.2 metric 100
10.31.0.0/16 dev eth1 proto kernel scope link src 10.31.100.22 metric 101
...

There are two virtual machines on the top 10.31.0.0/16 Network card of network segment , Namely eth0(10.31.100.2) and eth1(10.31.100.22), When you view the routing table, you can see that 10.31.0.0/16 This network segment has two routes pointing to eth0 and eth1 Of IP, The difference is between the two metric. Now let's do a test :

First of all we have 10.31.100.1 On this machine ping This virtual machine's eth1(10.31.100.22), Then use it directly tcpdump Carry out the bag

#  Yes eth1(10.31.100.22) When the network card is capturing packets, it cannot capture the corresponding icmp package 
[[email protected] ~]# tcpdump -A -n -vv -i eth1 icmp
tcpdump: listening on eth1, link-type EN10MB (Ethernet), capture size 262144 bytes
^C
0 packets captured
0 packets received by filter
0 packets dropped by kernel

#  Then we are right eth0(10.31.100.2) When the network card is capturing packets, it is found that it can catch external machines (10.31.100.1) Yes eth1(10.31.100.22) NIC icmp Data packets 
[[email protected] ~]# tcpdump -A -n -vv -i eth0 icmp
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
16:29:44.789831 IP (tos 0x0, ttl 64, id 1197, offset 0, flags [DF], proto ICMP (1), length 84)
    10.31.100.1 > 10.31.100.22: ICMP echo request, id 16846, seq 54, length 64
E..T..@[email protected]
.d.
.d...VSA..6y2.a....yA...................... !"#$%&'()*+,-./01234567 16:29:44.789898 IP (tos 0x0, ttl 64, id 16187, offset 0, flags [none], proto ICMP (1), length 84) 10.31.100.22 > 10.31.100.1: ICMP echo reply, id 16846, seq 54, length 64 E..T?;[email protected]_. .d. .d...^SA..6y2.a....yA...................... !"#$%&'()*+,-./01234567
16:29:45.813740 IP (tos 0x0, ttl 64, id 1891, offset 0, flags [DF], proto ICMP (1), length 84)
    10.31.100.1 > 10.31.100.22: ICMP echo request, id 16846, seq 55, length 64
E..[email protected]@.V.
.d.

3.2.2 Principle analysis

Here we can find : Even though 10.31.100.22 Is in eth1 above , But in fact, the flow goes through eth0, in other words eth1 There is actually no traffic on it . This corresponds to the routing table 10.31.100.2 metric 100 To be less than 10.31.100.22 metric 101, accord with metric The smaller, the higher the priority Principles .

Apply the above to bond0 and bond1.kni NIC , There will be similar problems , If it's on IPv6 The Internet , It is also necessary to consider whether there will be bond1.kni Network card in IPv6 The network route advertisement issues the default gateway route . In this way, it is easy for routing traffic to go bond0 Or maybe go bond1.kni The problem of , Put aside the performance gap between the physical network card and the virtual network card , what's more :

  • By default bond1.kni The network card traffic will be DPVS Program hijacking , So go bond1.kni The network card requests are abnormal ;
  • And just RS Node detection is performed through Linux Network stack implementation , If this time comes RS The routing of a node is to bond1.kni network card , Will let keepalived Mistakenly think that the backend RS Node is not available , Thus it weight Reduced to 0;
  • If the entire cluster RS Is so , This will lead to the cluster VIP No later available RS(weight Are all 0), The end result is that the request cannot be forwarded to RS Resulting in complete unavailability of the service .

3.2.3 Solutions

Therefore, one of the most convenient solutions here is to close it directly bond1.kni, Directly to disable , Only if necessary DEBUG And then enable , Can effectively avoid such problems .

3.3 kni network card IP no

3.3.1 Principle analysis

because Linux In the network stack kni Network card and DPVS The network card in the network stack actually corresponds to a physical network card ( Or a group of physical network cards ), The network traffic flowing through the network card can only be handled by one network stack .** By default , For all kni Network card , Their traffic will be DPVS Program hijacking .** That means bond2.kni Network card IP Not only is it impossible ping through , It is also unable to perform other normal access operations . however DPVS Program support is specific to IP Release the relevant flow to Linux Network stack ( adopt kni_host Routing implementation ), You can achieve this IP Normal access to .

for instance : A group of x520 Network card group bonding, stay Linux The network stack is displayed as bond2.kni, stay DPVS The network is displayed as bond2, And the other bond0 The network card is just a network card linux In the network stack bonding network card , And DPVS irrelevant . We use some simple commands for comparison :

Use ethtool Tool View ,bond0 The network card can normally obtain the network card speed and other information , and kni The network card is completely unable to obtain any valid information , At the same time DPVS Network stack dpip The command can see the bond2 Very detailed physical hardware information of the network card .

Use lspci -nnv Command to view the details of the two groups of network cards , We can still see that bond0 The network card uses Linux Network card driver for ixgbe, and bond2 The network card uses DPVS Of PMDigb_uio.

[[email protected] dpvs]# ethtool bond0
Settings for bond0:
        Supported ports: [ ]
        Supported link modes:   Not reported
        Supported pause frame use: No
        Supports auto-negotiation: No
        Supported FEC modes: Not reported
        Advertised link modes:  Not reported
        Advertised pause frame use: No
        Advertised auto-negotiation: No
        Advertised FEC modes: Not reported
        Speed: 20000Mb/s
        Duplex: Full
        Port: Other
        PHYAD: 0
        Transceiver: internal
        Auto-negotiation: off
        Link detected: yes
[[email protected] dpvs]# ethtool bond2.kni
Settings for bond2.kni:
No data available

[[email protected] dpvs]# dpip -s -v link show bond2
3: bond2: socket 0 mtu 1500 rx-queue 16 tx-queue 16
    UP 20000 Mbps full-duplex auto-nego
    addr 00:1C:34:EE:46:E4
    ipackets            opackets            ibytes              obytes
    15451492            31306               6110603685          4922260
    ierrors             oerrors             imissed             rx_nombuf
    0                   0                   0                   0
    mbuf-avail          mbuf-inuse
    1012315             36260
    pci_addr                        driver_name
                                    net_bonding
    if_index        min_rx_bufsize  max_rx_pktlen   max_mac_addrs
    0               0               15872           16
    max_rx_queues   max_tx_queues   max_hash_addrs  max_vfs
    127             63              0               0
    max_vmdq_pools  rx_ol_capa      tx_ol_capa      reta_size
    0               0x1AE9F         0x2A03F         128
    hash_key_size   flowtype_rss_ol vmdq_que_base   vmdq_que_num
    0               0x38D34         0               0
    rx_desc_max     rx_desc_min     rx_desc_align   vmdq_pool_base
    4096            0               1               0
    tx_desc_max     tx_desc_min     tx_desc_align   speed_capa
    4096            0               1               0
    Queue Configuration:
    rx0-tx0     cpu1-cpu1
    rx1-tx1     cpu2-cpu2
    rx2-tx2     cpu3-cpu3
    rx3-tx3     cpu4-cpu4
    rx4-tx4     cpu5-cpu5
    rx5-tx5     cpu6-cpu6
    rx6-tx6     cpu7-cpu7
    rx7-tx7     cpu8-cpu8
    rx8-tx8     cpu9-cpu9
    rx9-tx9     cpu10-cpu10
    rx10-tx10   cpu11-cpu11
    rx11-tx11   cpu12-cpu12
    rx12-tx12   cpu13-cpu13
    rx13-tx13   cpu14-cpu14
    rx14-tx14   cpu15-cpu15
    rx15-tx15   cpu16-cpu16
    HW mcast list:
        link 33:33:00:00:00:01
        link 33:33:00:00:00:02
        link 01:80:c2:00:00:0e
        link 01:80:c2:00:00:03
        link 01:80:c2:00:00:00
        link 01:00:5e:00:00:01
        link 33:33:ff:bf:43:e4
        
        
        
[[email protected] dpvs]# lspci -nnv
...

01:00.0 Ethernet controller [0200]: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection [8086:10fb] (rev 01)
...
        Kernel driver in use: ixgbe
        Kernel modules: ixgbe

...

81:00.0 Ethernet controller [0200]: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection [8086:10fb] (rev 01)
...
        Kernel driver in use: igb_uio
        Kernel modules: ixgbe

3.3.2 Solutions

If you want to bond2.kni On your network card IP Relevant operation is normal , Can be directed at the IP add to kni_host route , The specific operation is as follows :

# bond2.kni_ip It can be replaced by any one IP
dpip route add <bond2.kni_ip>/32 scope kni_host dev bond2
dpip route del <bond2.kni_ip>/32 scope kni_host dev bond2
# ipv6 The same is true of network operation 
dpip route -6 add <bond2.kni_ip>/128 scope kni_host dev bond2
dpip route -6 del <bond2.kni_ip>/128 scope kni_host dev bond2

Be careful : There must be one at the time of release IP One IP release , The mask must be 32 Or is it 128, Batch release multiple at a time IP It will affect performance very much !

3.4 VIP can ping Tongdan http Request exception

about DPVS Come on ,ping Request and http The request processing logic is completely different . about ping Requested icmp and icmpv6 Data packets , It's all by DPVS The network stack itself , It doesn't involve the following RS node .

can ping General rules DPVS The program works normally ,http If the request is abnormal, it indicates that the backend is RS Node status is abnormal , It could be LIP and RS There is a problem in the communication between the two, which makes the data packet unable to arrive smoothly .

Of course, there is another possibility :LIP and RS The communication between them is normal , But for RS Explore live network cards and RS Abnormal communication between the two leads to keepalived The process is mistaken for RS There is a problem with the node and it will weight Reduced to 0.

A common case of this is IPv6 Under the Internet DPVS Nodes and RS Nodes communicate across network segments , and DPVS Node is not added ipv6 Cross segment routing .

4、keepalived relevant

keepalived The problems can be divided into two aspects : Cerebral fissure and active / standby switching .

4.1 Split brain

Generally speaking ,keepalived The most fundamental reason for brain crack is that both machines think they are the boss (master), There are two main reasons for this : The network is not available or the configuration file is incorrect . These two kinds of fault causes and troubleshooting ideas are very common in many posts on the Internet , No more details here . What is relatively rare is the existence of a switch BUG, To make the same vlan There are two different groups of vrrp_instance Use the same virtual_router_id Caused by the brain crack .

4.1.1 Switch BUG Cause brain cracking

When using multicast communication , For the part there are BUG Switchboard , Different vrrp_instance If virtual_router_id It is also possible to have a cerebral fissure . Pay attention to what is said here virtual_router_id Consistency means simply virtual_router_id This parameter is the same , Even if the authentication Configure different passwords , You will also receive multicast packets from the other party , But it will only give an error prompt received an invalid passwd, There will be no brain crack ( Because it's different here vrrp_instance Not the same vrrp_instance Different nodes in ).

Generally speaking virtual_router_id The range is 1-255, Obviously, this variable was designed with the assumption that vlan Internal IP Do not exceed one C, such virtual_router_id You can use it directly IP The last section of . In fact, if vlan If the division is reasonable or the planning is proper, it is not easy to encounter such problems , But if the computer room network vlan The division is too large , Or the network quality of the computer room is poor 、 When the switch was old , You need extra attention virtual_router_id The question of conflict .

Here is an excerpt keepalived Relevant descriptions on the official website :

arbitrary unique number from 1 to 255 used to differentiate multiple instances of vrrpd running on the same network interface and address family (and hence same socket).

Note: using the same virtual_router_id with the same address family on different interfaces has been known to cause problems with some network switches; if you are experiencing problems with using the same virtual_router_id on different interfaces, but the problems are resolved by not duplicating virtual_router_ids, the your network switches are probably not functioning correctly.

4.1.2 Solutions

Multicast

keepalived By default, the communication between the active and standby nodes is through multicast , The principle of multicast will not be repeated here , By default, whether it is IPv4 still IPv6 Will use a multicast address , For some common such as BGP、VRRP Protocol packets ,RFC The corresponding multicast addresses are defined and divided in advance for their use ,keepalived The officially used multicast address follows the definition specification , As follows :

Multicast Group to use for IPv4 VRRP adverts Defaults to the RFC5798 IANA assigned VRRP multicast address 224.0.0.18 which You typically do not want to change.
vrrp_mcast_group4 224.0.0.18

Multicast Group to use for IPv6 VRRP adverts (default: ff02::12)
vrrp_mcast_group6 ff02::12

If we can't confirm whether the node is in the network that uses the virtual_router_id Of vrrp_instance, You can try to modify the multicast address to avoid conflicts .

unicast

Another solution is not to use multicast , Use unicast instead . Unicast is not only better than multicast in network communication quality , And it's hard to see virtual_router_id The question of conflict . alike , If keepalived Between clusters, the poor multicast communication quality between the active and standby nodes leads to frequent active and standby handovers , In addition to improving the communication network quality between nodes , You can also try to change the communication mode to unicast .

    # ipv4 Network configuration unicast 
    unicast_src_ip 192.168.229.1
    unicast_peer {
    
        192.168.229.2
    }
    
    # ipv6 Network configuration unicast 
    unicast_src_ip 2000::1
    unicast_peer {
    
        2000::2
    }

In the picture above unicast_src_ip It's local IP, and unicast_peer Is the opposite end IP, Notice the unicast_peer Yes, there can be more than one IP Of ( Corresponding to one active standby, one active multiple standby or multiple standby preemption ).

Unicast performs better in stability , But the corresponding amount of configuration is also greatly increased , Need o & M students in each group vrrp_instance Add the corresponding unicast configuration , And the configuration contents between the active and standby nodes are also different ( It's usually unicast_src_ip and unicast_peer Swap ). Moreover, once the unicast related configuration is corrected, brain splitting will almost occur , This puts forward higher requirements for configuration management inspection and distribution .

4.2 Active standby switching

keepalived The main problems that often occur during active / standby switchover are IP Has switched to another machine , But corresponding to the switch MAC Recorded in the address table VIP Corresponding MAC The address hasn't been updated yet . The common solution to this situation is to use arping(IPv4) Operation or ping6(IPv6) To quickly refresh manually MAC Record or configure keepalived To automatically refresh continuously MAC Record .

4.2.1 Manually refresh MAC

about DPVS Program , Refresh IPv4 Of VIP Of MAC Address time , If VIP And corresponding kni Network card IP It's the same network segment , It can be used directly to kni Network card use arping Command to refresh MAC Address (kni Network card and DPVS NIC MAC The address is consistent ); however IPv6 The Internet does not arp This thing , Refresh MAC Records require the use of ping6 command , And this is in DPVS Of kni The network card does not work . I suggest using python or go Such as network programming language to write a simple program to achieve sending gna Data packets , And then in keepalived When the configuration script enters master Execute script refresh when the status is MAC Address , Can solve IPv6 Next VIP Switched MAC Address update problem .

4.2.2 keepalived Refresh MAC

Another solution is to configure keepalived, Give Way keepalived Send it yourself garp and gna Data packets ,keepalived There are many in the configuration vrrp_garp* The related configuration can adjust the sending garp and gna The parameters of the packet

#  send out garp/gna Packet interval . Here is every 10 Seconds to send a round 
vrrp_garp_master_refresh 10
#  Send three at a time garp/gna Data packets 
vrrp_garp_master_refresh_repeat 3
#  Every garp The sending interval of data packets is 0.001 second 
vrrp_garp_interval 0.001
#  Every gna The sending interval of data packets is 0.001 second 
vrrp_gna_interval 0.001

But the use of keepalived There is another problem with configuration :

  • keepalived send out garp/gna The network card of the packet is interface The network card specified by the parameter , That is, the network card used by the primary and standby nodes for communication
  • keepalived Sent garp/gna To be effective, a packet must be VIP Where DPVS The network card in or the corresponding kni network card

So if you want keepalived To refresh VIP Of MAC Address , This network card needs to be modified to the bond2.kni network card , That is, the external network card corresponding to the dual arm network architecture mode , At the same time, if unicast communication is used, it is also necessary to add the corresponding node kni_host Route to ensure unicast can communicate normally .

4.2.3 Summary

The above two schemes have their own advantages and disadvantages , It needs to combine the internal and external network quality 、 Use unicast or multicast 、 Network routing configuration management 、keepalived Document management and other factors .

5、 The maximum number of cluster connections

This part mainly analyzes and compares the traditional LVS-DR Patterns and DPVS-FNAT The maximum of the two modes TCP Connection count performance limit bottleneck .

5.1 LVS-DR Pattern

First, let's look at the traditional LVS-DR Connection table in mode

[[email protected]]# ipvsadm -lnc | head | column -t
IPVS  connection  entries
pro   expire      state        source                 virtual            destination
TCP   00:16       FIN_WAIT     44.73.152.152:54300    10.0.96.104:80  192.168.229.111:80
TCP   00:34       FIN_WAIT     225.155.149.221:55182  10.0.96.104:80  192.168.229.117:80
TCP   00:22       ESTABLISHED  99.251.37.22:53601     10.0.96.104:80  192.168.229.116:80
TCP   01:05       FIN_WAIT     107.111.180.141:15997  10.0.96.104:80  192.168.229.117:80
TCP   00:46       FIN_WAIT     44.108.145.205:57801   10.0.96.104:80  192.168.229.116:80
TCP   12:01       ESTABLISHED  236.231.219.215:36811  10.0.96.104:80  192.168.229.111:80
TCP   01:36       FIN_WAIT     91.90.162.249:52287    10.0.96.104:80  192.168.229.116:80
TCP   01:41       FIN_WAIT     85.35.41.0:44148       10.0.96.104:80  192.168.229.112:80

We can see from the top DPVS Connection table and LVS The connection table of is basically the same ,DPVS There's an extra column CPU Number of cores and LIP Information , But there are great differences in principle .

The following analysis assumes that other performance constraints have no bottlenecks

First of all, for LVS-DR In terms of mode , We know Client It's direct and RS Connected ,LVS In this process, only packets are forwarded , It does not involve the step of establishing a connection , So the number of connections Namely ProtocolCIP:PortRIP:Port These five variables , It is also known as five tuples .

Protocol CIP:Port RIP:Port

in consideration of Protocol No TCP Namely UDP, Think of it as a constant , That is to say in the light of LVS-DR for : Real impact TCP The number of connections is CIP:Port(RIP:Port Often fixed ), But because of CIP:Port Theoretically, there can be enough , So at this point TCP The maximum limit on the number of connections is often in RS above , That is to say RS The number and RS The performance of determines the whole LVS-DR The maximum size of the cluster TCP The number of connections .

5.2 DPVS-FNAT Pattern

Then we use ipvsadm -lnc Order to check FNAT Mode of Client<-->DPVS<-->RS Connection between :

[[email protected] dpvs]# ipvsadm -lnc | head
[1]tcp  90s  TCP_EST    197.194.123.33:56058   10.0.96.216:443  192.168.228.1:41136  192.168.229.80:443
[1]tcp  7s   TIME_WAIT  26.251.198.234:21164   10.0.96.216:80   192.168.228.1:44896  192.168.229.89:80
[1]tcp  7s   TIME_WAIT  181.112.211.168:46863  10.0.96.216:80   192.168.228.1:62976  192.168.229.50:80
[1]tcp  90s  TCP_EST    242.73.154.166:9611    10.0.96.216:443  192.168.228.1:29552  192.168.229.87:443
[1]tcp  3s   TCP_CLOSE  173.137.182.178:53264  10.0.96.216:443  192.168.228.1:8512   192.168.229.87:443
[1]tcp  90s  TCP_EST    14.53.6.35:23820       10.0.96.216:443  192.168.228.1:44000  192.168.229.50:443
[1]tcp  3s   TCP_CLOSE  35.13.251.48:15348     10.0.96.216:443  192.168.228.1:16672  192.168.229.79:443
[1]tcp  90s  TCP_EST    249.109.242.104:5566   10.0.96.216:443  192.168.228.1:10112  192.168.229.77:443
[1]tcp  3s   TCP_CLOSE  20.145.41.157:6179     10.0.96.216:443  192.168.228.1:15136  192.168.229.86:443
[1]tcp  90s  TCP_EST    123.34.92.153:15118    10.0.96.216:443  192.168.228.1:9232   192.168.229.87:443
[[email protected] dpvs]# ipvsadm -lnc | tail
[16]tcp  90s  TCP_EST    89.99.59.41:65197      10.0.96.216:443  192.168.228.1:7023   192.168.229.50:443
[16]tcp  3s   TCP_CLOSE  185.97.221.45:18862    10.0.96.216:443  192.168.228.1:48159  192.168.229.50:443
[16]tcp  90s  TCP_EST    108.240.236.85:64013   10.0.96.216:443  192.168.228.1:49199  192.168.229.50:443
[16]tcp  90s  TCP_EST    85.173.18.255:53586    10.0.96.216:443  192.168.228.1:63007  192.168.229.87:443
[16]tcp  90s  TCP_EST    182.123.32.10:5912     10.0.96.216:443  192.168.228.1:19263  192.168.229.77:443
[16]tcp  90s  TCP_EST    135.35.212.181:51666   10.0.96.216:443  192.168.228.1:22223  192.168.229.88:443
[16]tcp  90s  TCP_EST    134.210.227.47:29393   10.0.96.216:443  192.168.228.1:26975  192.168.229.90:443
[16]tcp  7s   TIME_WAIT  110.140.221.121:54046  10.0.96.216:443  192.168.228.1:5967   192.168.229.84:443
[16]tcp  3s   TCP_CLOSE  123.129.23.120:18550   10.0.96.216:443  192.168.228.1:7567   192.168.229.83:443
[16]tcp  90s  TCP_EST    72.250.60.207:33043    10.0.96.216:443  192.168.228.1:53279  192.168.229.86:443

Then we analyze the meaning of these fields one by one :

  • [1]: This number represents CPU The core number , We are in dpvs.conf Configured in worker cpu Of cpu_id, From this field you can see each DPVS Process worker Load of thread work

  • tcp:tcp perhaps udp, Corresponding to the type of this connection , There is no need to explain this

  • 90s30s7s3s: The time corresponding to this connection

  • CLOSE_WAITFIN_WAITSYN_RECVTCP_CLOSETCP_ESTTIME_WAIT: Corresponding to this one tcp The state of the connection

  • The last group of four IP+Port The combination of is Client<-->DPVS<-->RS Correspondence of :

    CIP:Port VIP:Port LIP:Port RIP:Port
    

So for DPVS-FNAT Mode , Joined the LIP Then it became four groups IP+Port The combination of , Plus the one in front cpu_id and Protocol Is the ten tuples that affect the number of connections .

cpu_id Protocol CIP:Port VIP:Port LIP:Port RIP:Port
  • We need to know the above four groups before we start the analysis IP+Port The combination of is actually divided into two quads , namely CIP:Port VIP:Port For a quadruple ,LIP:Port RIP:Port For a quadruple , There is a one-to-one correspondence between two quaternions
  • First of all, let's eliminate ProtocolVIP:Port and RIP:Port, Because these three groups of five variables are basically fixed , Sure As a constant
  • Next is CIP:Port In theory, it can Enough Of , Not for our cluster TCP The number of connections has an impact
  • And then there was cpu_id, Although a machine can have at most 16 individual work cpu, But it doesn't mean Maximum number of connections for ten tuples = except cpu_id The maximum number of connections of the outer nine tuples *16,DPVS The program will put cpu_id according to LIP The port number of the , So as to distribute the load equally to all CPU above . So here cpu_id and LIP The port number of is also One-to-one correspondence The relationship between
  • And finally LIP:Port, We know one IP The maximum number of ports that can be used is 65536 individual , because RIP:Port Is constant , So the The maximum of a quadruple TCP The number of connections <=LIP Number *65536*RIP Number

And because there is a one-to-one correspondence between two quaternions ,cpu_id and LIP The port number of is also One-to-one correspondence The relationship between , So for DPVS-FNAT Mode ,LIP The number of connections is often the key to limiting the maximum number of connections in the entire cluster , If the cluster has multiple connections , It is suggested that a sufficient number of IP to LIP Use .

By the way , Combined with official documents and actual measurements ,x520/82599、x710 The network card is in use igb_uio This PMD When , stay ipv6 Under the network fdir I won't support it perfect Pattern , It is recommended to use signature Pattern , But note that only one... Can be used in this mode LIP, There will be a limit on the maximum number of connections to the cluster .

The official document link can be clicked on here see .

We found there exists some NICs do not (fully) support Flow Control of IPv6 required by IPv6. For example, the rte_flow of 82599 10GE Controller (ixgbe PMD) relies on an old fashion flow type flow director (fdir), which doesn’t support IPv6 in its perfect mode, and support only one local IPv4 or IPv6 in its signature mode. DPVS supports the fdir mode config for compatibility.

6、 At the end

DPVS Indeed, it has excellent performance in terms of performance and function , It is also true that many pits will be trampled at the initial stage of landing , It is recommended to read more documents , Check more information , Look at the source code , When it is really used, it will really bring us a lot of surprises and gains . Finally, by the way , If you just want to build a small-scale cluster to have a try , ordinary IPv4 Network and common x520 A network card is enough , Of course, qualified students can try to use ECMP Architecture and some good network cards ( Such as Mellanox).

原网站

版权声明
本文为[tinychen777]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/176/202206252054159183.html