当前位置:网站首页>Where should DNS start? I -- from the failure of Facebook

Where should DNS start? I -- from the failure of Facebook

2022-06-23 20:57:00 hermanzeng

This article was written in 2021 year 11 month 01 Japan , Share on the company's internal platform , After desensitization, it will be published in Yunjia community .

Introduction involve DNS There are many related concepts and vocabulary , Many technical practitioners are catchy, such as domain name hijacking , Or the operator hijacks , Root images are deployed in China , Domain name registration , Domain name filing , Domain name resolution exception ,DNS Enlarge the attack , Random subdomain attack ,DNS It's broken down ,DNS It broke down again, etc . In determining how we should talk DNS The title of , I haven't figured out the logic of this series of articles yet , From recursion to root domain name server to TLD Server to authoritative service server or domain name ,DNS What is it? , A website visit and other logic to start talking about . reasoning , With Facebook The occurrence of six hour disconnection fault , I want to start with the fault , Understand through multiple faults DNS Hierarchical access system , Be right DNS With the understanding of the hierarchical system , We fill in the knowledge points little by little ;

Example 1 :DNS Hierarchical access system

The protagonist of this article is figure 1 Auth DNS;

Fault one :20211004 Facebook Six hour break

2021 year 10 month 4 Japan ,FB The network connection was inadvertently interrupted during routine maintenance and global backbone network capacity assessment , And the built-in audit tool triggers bug Failed to block command execution ,FB Of Auth DNS Will shut down when the data center cannot be connected BGP radio broadcast ,Auth DNS After the service is abnormal , Many internal tools do not work properly , Engineers cannot repair it remotely , In the end 6 Hours of downtime ;

  Auth DNS, Its full name is authoritative nameserver, We call him authority DNS、 Authoritative domain name resolution server 、 Or an authoritative server , If we were to com Called primary domain name , Then the authoritative server stores the information corresponding to the secondary domain name and its sub domain name , such as qq.com,facebook.com The resolution record of the subdomain corresponding to the domain name is stored in the server , We can go through linux Self contained dig The tool obtains the authoritative servers corresponding to the domain name , You can also use some tools on the Internet to obtain the authoritative server corresponding to the domain name ; Here we go through dnslookup facebook.com You can see the following : 

name	TTL	record type	value
a.ns.facebook.com.	172800	A	129.134.30.12
a.ns.facebook.com.	172800	AAAA	2a03:2880:f0fc:c:face:b00c:0:35
b.ns.facebook.com.	172800	A	129.134.31.12
b.ns.facebook.com.	172800	AAAA	2a03:2880:f0fd:c:face:b00c:0:35
c.ns.facebook.com.	172800	A	185.89.218.12
c.ns.facebook.com.	172800	AAAA	2a03:2880:f1fc:c:face:b00c:0:35
d.ns.facebook.com.	172800	A	185.89.219.12
d.ns.facebook.com.	172800	AAAA	2a03:2880:f1fd:c:face:b00c:0:35

Example 2 :Facebook Authoritative server

 facebook.com The authoritative server of is provided by 4 individual ipv4 and 4 individual IPv6 The address of a group of authorities , What you see here IP It's through BGP Anycast In the same way at multiple points around the world IP Seeding ,BGP Anycast One of the benefits , When a single point of failure is detected , The offline of the fault point can be completed by route cancellation , Realize fault isolation . In addition to hosting facebook.com This second level domain name is outside , Also managed a lot facebook Other secondary domain names , such as intagram.com、fb.com、m.me、fb.me、 Protected registered domain name facbook.com【 No, e】、intagram.com【 No, s】 And so on, thousands of secondary subdomains ; With facebook.com Take the suffix of the secondary domain name as an example , About 40000 domain names have been exposed on the public network ,eg

Domain
facebook.com
web.facebook.com
developers.facebook.com
de-de.facebook.com
l.facebook.com
apps.facebook.com
business.facebook.com
en-gb.facebook.com
ja-jp.facebook.com
es-la.facebook.com
fr-fr.facebook.com
it-it.facebook.com
es-es.facebook.com
pt-br.facebook.com
new.facebook.com
zh-tw.facebook.com
id-id.facebook.com
...

Example 3 :facebook.com subdomain  

    This fault is right facebook Auth DNS The impact is on the whole Auth DNS IP Cannot be routed normally , That is to say , All... In example 2 DNS The server cannot return example 3 facebook.com The resolution result corresponding to the subdomain name of , The impact has been further expanded ; As mentioned above , These authorities IP It is broadcast by multiple sites around the world IP Released , After a single point of failure, the isolation of the point of failure can be completed by canceling routing , So why did the fault happen ?

     Based on fiber repair , Capacity expansion , Software update and other scenarios ,Facebook Our network administrator needs to take the initiative to simulate the offline backbone network , stay 10.4 Simulation of No , The built-in audit tool triggers bug, Failed to block the running of some commands that are expected to be blocked , The data center is disconnected from the Internet ;AuthDNS The fault isolation mechanism is triggered , Cancels the routing of its own broadcast .

Example 4 :facebook Fault diagram ( Guess the version according to the fault performance )

    Now ,facebook.com Related domain name TTL stay Localdns Before the cache time in expires , Parsing is normal , But once TTL expire ,LocalDNS It is necessary to obtain through iteration IP when , The parsing will fail , Tools for troubleshooting and solving such network problems , Finally, it relies on the authoritative server , As a result, the problem can not be solved quickly through tools . So we need to go to the scene , Physical security authentication on site 、 Confirmation of relevant authorities , Both make the failure time longer . Caused by network failure AuthDNS The offline of has aggravated the impact of this problem ,Instagram and whatsapp The access layer of the master station IP Also in the Facebook Intranets , Therefore, it has also been implicated .

    Through this fault , We reflect on Tencent Auth, And how to design a robust authoritative parsing server ;

Example 5 :Tencent AuthDNS Network signals

From the network level ,Tencent Auth DNS server Three networks, multiple places and multiple activities deployment , Tencent has its own as At home 45090、 overseas 132203+ Domestic tri network static deployment , Multi region + Cross operator deployment is relatively robust , similar fb have as The route disappears , We do it across carriers , many as Do the authoritative broadcast step , Even if single as The next route disappeared from the Internet , The impact on Tencent's authoritative services is controllable , The authoritative fault of the network can be temporarily recovered through the cross network authority , At present, the three networks statically summarize routes by city , In the theoretical network level, there will be no failures caused by the aggregation of full routing to a single network entity . Analyze the software level , For the time being, I did not expect that all the components of the current network would commit suicide ;

name	TTL	record type	value
ns1.qq.com.	172800	A	101.89.19.165
ns1.qq.com.	172800	A	157.255.246.101
ns1.qq.com.	172800	A	183.36.112.46
ns1.qq.com.	172800	A	203.205.220.251
ns1.qq.com.	172800	AAAA	2402:4e00:8030::115
ns2.qq.com.	172800	A	121.51.160.100
ns2.qq.com.	172800	A	123.151.66.78
ns2.qq.com.	172800	A	203.205.249.143
ns2.qq.com.	172800	AAAA	2402:4e00:8010:1::11c
ns3.qq.com.	172800	A	112.60.1.69
ns3.qq.com.	172800	A	183.192.164.81
ns3.qq.com.	172800	A	203.205.195.94
ns4.qq.com.	172800	A	125.39.46.125
ns4.qq.com.	172800	A	203.205.195.104
ns4.qq.com.	172800	A	203.205.221.79
ns4.qq.com.	172800	A	58.144.154.100
ns4.qq.com.	172800	A	59.36.132.142

  Example 6 :qq.com authority

    Does that mean ,Tencent Auth Will not be affected by the network level and the control level , Of course not. , The above conclusion is only for Facebook In terms of the problems encountered , Next, I will start from the network and software levels , Talk a Tencent Auth Two recent failures .

Fault two :20210405 China Unicom Tencent Auth Domain name resolution timeout --- See internal sharing

Fault three : Abnormal domain name resolution results in Tencent News list 1 Hour open failed --- See internal sharing

There are many authorities here DNS Failure of , such as 2016 year 10 month 21 Japan , Us domain name provider DYN Of DNS The Internet suffers DDOS attack , This has led to widespread paralysis of the US internet ;2020 year 7 month 16 Japan ,CloudflareDNS Server failure causes a large number of websites at home and abroad to be unable to parse and access normally ;2021 year 7 month 22 Japan ,Akamai DNS fault , Lead to Fnac、Amazon Cloud services, etc 2w Several large websites are down ; We passed fault one Facebook Failure of , See AuthDNS Dependence on the network and DNS Analyze the impact of services on the business , Through the analysis of the exception of fault 2, we can get , Although we have made multiple cross network deployments , However, human factors also have a significant impact on services , We have also experienced resolution failures in a single network ; Through fault three , You can see the authoritative service software itself , The impact on authoritative services is also huge .AuthDNS stay DNS The importance of the system is self-evident ;

    Through the above fault , We are right. Auth DNS Make a simple definition , Contains information about a specific secondary domain name and its subdomains ( Most of them are , There are also authoritative servers with unique authorization in the sub domain list ), stay Localdns lookup IP The last part of the address process ( Legend 1 ).facebook.com Of authdns Storage facebook.com Domain name and related subdomains , It also includes Facebook( It should be Meta) Other secondary domain names under ;qq.com Of authdns Storage qq.com Domain name and others Tencent Secondary domain name information . Baidu 、 Ali 、 Tencent manages authoritative domain names through self-development , In addition to Tencent's proprietary external domain name hosting, it has used self-developed GSLB Outside , Tencent internal self research authdns Our team also has DNSpod, Provide domain name hosting services for external Internet enterprises , such as mi.com、bilibili.com; The good news is ,meituan.com Except for the use of DNSPod Outside the custody , It also adds a self-developed authoritative server ,AUTH DNS The practitioner's company chooses to add one . And a flat one TDNS The team , combination CDN Business scheduling requirements , By aggregating operators Ldns as well as CDN The resource condition of the node returns to the theoretical optimal scheduling .

mi.com.			172800	IN	NS	ns3.dnsv5.com.
mi.com.			172800	IN	NS	ns4.dnsv5.com.
-----
 bilibili.com.		172800	IN	NS	ns3.dnsv5.com.
 bilibili.com.		172800	IN	NS	ns4.dnsv5.com.
------
meituan.com.		172800	IN	NS	ns3.dnsv5.com.
meituan.com.		172800	IN	NS	ns4.dnsv5.com.
meituan.com.		172800	IN	NS	edns1.sankuai.com.
meituan.com.		172800	IN	NS	edns2.sankuai.com.

      Last , Simple sublimation .

    One side DNS It's robust ,DNS As the core infrastructure of the Internet , From the original LAN host name to IP Mapping to today's resource scheduling , Service entrance , It provides resolution services for nearly 400million secondary domain names around the world ; One side DNS It's also fragile , The tree like hierarchical system opens the boundaries , Different network entities are introduced , With the development and expansion of a single entity , The impact of a single entity service disaster is huge and uncontrollable ; No matter from the macro point of view, the 14th five year plan is very important to the construction of Digital China , Requirements for the layout of critical information infrastructure , On the micro level, the business is right DNS Dependence , Can reflect DNS Importance ; Through a series of articles , I hope all bosses understand DNS, Attaches great importance to DNS, The real precaution is to nip in the bud .

原网站

版权声明
本文为[hermanzeng]所创,转载请带上原文链接,感谢
https://yzsam.com/2021/12/202112282147408871.html