当前位置:网站首页>Where should DNS start? I -- from the failure of Facebook
Where should DNS start? I -- from the failure of Facebook
2022-06-23 20:57:00 【hermanzeng】
This article was written in 2021 year 11 month 01 Japan , Share on the company's internal platform , After desensitization, it will be published in Yunjia community .
Introduction involve DNS There are many related concepts and vocabulary , Many technical practitioners are catchy, such as domain name hijacking , Or the operator hijacks , Root images are deployed in China , Domain name registration , Domain name filing , Domain name resolution exception ,DNS Enlarge the attack , Random subdomain attack ,DNS It's broken down ,DNS It broke down again, etc . In determining how we should talk DNS The title of , I haven't figured out the logic of this series of articles yet , From recursion to root domain name server to TLD Server to authoritative service server or domain name ,DNS What is it? , A website visit and other logic to start talking about . reasoning , With Facebook The occurrence of six hour disconnection fault , I want to start with the fault , Understand through multiple faults DNS Hierarchical access system , Be right DNS With the understanding of the hierarchical system , We fill in the knowledge points little by little ;
The protagonist of this article is figure 1 Auth DNS;
Fault one :20211004 Facebook Six hour break
2021 year 10 month 4 Japan ,FB The network connection was inadvertently interrupted during routine maintenance and global backbone network capacity assessment , And the built-in audit tool triggers bug Failed to block command execution ,FB Of Auth DNS Will shut down when the data center cannot be connected BGP radio broadcast ,Auth DNS After the service is abnormal , Many internal tools do not work properly , Engineers cannot repair it remotely , In the end 6 Hours of downtime ;
Auth DNS, Its full name is authoritative nameserver, We call him authority DNS、 Authoritative domain name resolution server 、 Or an authoritative server , If we were to com Called primary domain name , Then the authoritative server stores the information corresponding to the secondary domain name and its sub domain name , such as qq.com,facebook.com The resolution record of the subdomain corresponding to the domain name is stored in the server , We can go through linux Self contained dig The tool obtains the authoritative servers corresponding to the domain name , You can also use some tools on the Internet to obtain the authoritative server corresponding to the domain name ; Here we go through dnslookup facebook.com You can see the following :
name TTL record type value a.ns.facebook.com. 172800 A 129.134.30.12 a.ns.facebook.com. 172800 AAAA 2a03:2880:f0fc:c:face:b00c:0:35 b.ns.facebook.com. 172800 A 129.134.31.12 b.ns.facebook.com. 172800 AAAA 2a03:2880:f0fd:c:face:b00c:0:35 c.ns.facebook.com. 172800 A 185.89.218.12 c.ns.facebook.com. 172800 AAAA 2a03:2880:f1fc:c:face:b00c:0:35 d.ns.facebook.com. 172800 A 185.89.219.12 d.ns.facebook.com. 172800 AAAA 2a03:2880:f1fd:c:face:b00c:0:35
Example 2 :Facebook Authoritative server
facebook.com The authoritative server of is provided by 4 individual ipv4 and 4 individual IPv6 The address of a group of authorities , What you see here IP It's through BGP Anycast In the same way at multiple points around the world IP Seeding ,BGP Anycast One of the benefits , When a single point of failure is detected , The offline of the fault point can be completed by route cancellation , Realize fault isolation . In addition to hosting facebook.com This second level domain name is outside , Also managed a lot facebook Other secondary domain names , such as intagram.com、fb.com、m.me、fb.me、 Protected registered domain name facbook.com【 No, e】、intagram.com【 No, s】 And so on, thousands of secondary subdomains ; With facebook.com Take the suffix of the secondary domain name as an example , About 40000 domain names have been exposed on the public network ,eg
Domain facebook.com web.facebook.com developers.facebook.com de-de.facebook.com l.facebook.com apps.facebook.com business.facebook.com en-gb.facebook.com ja-jp.facebook.com es-la.facebook.com fr-fr.facebook.com it-it.facebook.com es-es.facebook.com pt-br.facebook.com new.facebook.com zh-tw.facebook.com id-id.facebook.com ...
Example 3 :facebook.com subdomain
This fault is right facebook Auth DNS The impact is on the whole Auth DNS IP Cannot be routed normally , That is to say , All... In example 2 DNS The server cannot return example 3 facebook.com The resolution result corresponding to the subdomain name of , The impact has been further expanded ; As mentioned above , These authorities IP It is broadcast by multiple sites around the world IP Released , After a single point of failure, the isolation of the point of failure can be completed by canceling routing , So why did the fault happen ?
Based on fiber repair , Capacity expansion , Software update and other scenarios ,Facebook Our network administrator needs to take the initiative to simulate the offline backbone network , stay 10.4 Simulation of No , The built-in audit tool triggers bug, Failed to block the running of some commands that are expected to be blocked , The data center is disconnected from the Internet ;AuthDNS The fault isolation mechanism is triggered , Cancels the routing of its own broadcast .
Now ,facebook.com Related domain name TTL stay Localdns Before the cache time in expires , Parsing is normal , But once TTL expire ,LocalDNS It is necessary to obtain through iteration IP when , The parsing will fail , Tools for troubleshooting and solving such network problems , Finally, it relies on the authoritative server , As a result, the problem can not be solved quickly through tools . So we need to go to the scene , Physical security authentication on site 、 Confirmation of relevant authorities , Both make the failure time longer . Caused by network failure AuthDNS The offline of has aggravated the impact of this problem ,Instagram and whatsapp The access layer of the master station IP Also in the Facebook Intranets , Therefore, it has also been implicated .
Through this fault , We reflect on Tencent Auth, And how to design a robust authoritative parsing server ;
From the network level ,Tencent Auth DNS server Three networks, multiple places and multiple activities deployment , Tencent has its own as At home 45090、 overseas 132203+ Domestic tri network static deployment , Multi region + Cross operator deployment is relatively robust , similar fb have as The route disappears , We do it across carriers , many as Do the authoritative broadcast step , Even if single as The next route disappeared from the Internet , The impact on Tencent's authoritative services is controllable , The authoritative fault of the network can be temporarily recovered through the cross network authority , At present, the three networks statically summarize routes by city , In the theoretical network level, there will be no failures caused by the aggregation of full routing to a single network entity . Analyze the software level , For the time being, I did not expect that all the components of the current network would commit suicide ;
name TTL record type value ns1.qq.com. 172800 A 101.89.19.165 ns1.qq.com. 172800 A 157.255.246.101 ns1.qq.com. 172800 A 183.36.112.46 ns1.qq.com. 172800 A 203.205.220.251 ns1.qq.com. 172800 AAAA 2402:4e00:8030::115 ns2.qq.com. 172800 A 121.51.160.100 ns2.qq.com. 172800 A 123.151.66.78 ns2.qq.com. 172800 A 203.205.249.143 ns2.qq.com. 172800 AAAA 2402:4e00:8010:1::11c ns3.qq.com. 172800 A 112.60.1.69 ns3.qq.com. 172800 A 183.192.164.81 ns3.qq.com. 172800 A 203.205.195.94 ns4.qq.com. 172800 A 125.39.46.125 ns4.qq.com. 172800 A 203.205.195.104 ns4.qq.com. 172800 A 203.205.221.79 ns4.qq.com. 172800 A 58.144.154.100 ns4.qq.com. 172800 A 59.36.132.142
Example 6 :qq.com authority
Does that mean ,Tencent Auth Will not be affected by the network level and the control level , Of course not. , The above conclusion is only for Facebook In terms of the problems encountered , Next, I will start from the network and software levels , Talk a Tencent Auth Two recent failures .
Fault two :20210405 China Unicom Tencent Auth Domain name resolution timeout --- See internal sharing
Fault three : Abnormal domain name resolution results in Tencent News list 1 Hour open failed --- See internal sharing
There are many authorities here DNS Failure of , such as 2016 year 10 month 21 Japan , Us domain name provider DYN Of DNS The Internet suffers DDOS attack , This has led to widespread paralysis of the US internet ;2020 year 7 month 16 Japan ,CloudflareDNS Server failure causes a large number of websites at home and abroad to be unable to parse and access normally ;2021 year 7 month 22 Japan ,Akamai DNS fault , Lead to Fnac、Amazon Cloud services, etc 2w Several large websites are down ; We passed fault one Facebook Failure of , See AuthDNS Dependence on the network and DNS Analyze the impact of services on the business , Through the analysis of the exception of fault 2, we can get , Although we have made multiple cross network deployments , However, human factors also have a significant impact on services , We have also experienced resolution failures in a single network ; Through fault three , You can see the authoritative service software itself , The impact on authoritative services is also huge .AuthDNS stay DNS The importance of the system is self-evident ;
Through the above fault , We are right. Auth DNS Make a simple definition , Contains information about a specific secondary domain name and its subdomains ( Most of them are , There are also authoritative servers with unique authorization in the sub domain list ), stay Localdns lookup IP The last part of the address process ( Legend 1 ).facebook.com Of authdns Storage facebook.com Domain name and related subdomains , It also includes Facebook( It should be Meta) Other secondary domain names under ;qq.com Of authdns Storage qq.com Domain name and others Tencent Secondary domain name information . Baidu 、 Ali 、 Tencent manages authoritative domain names through self-development , In addition to Tencent's proprietary external domain name hosting, it has used self-developed GSLB Outside , Tencent internal self research authdns Our team also has DNSpod, Provide domain name hosting services for external Internet enterprises , such as mi.com、bilibili.com; The good news is ,meituan.com Except for the use of DNSPod Outside the custody , It also adds a self-developed authoritative server ,AUTH DNS The practitioner's company chooses to add one . And a flat one TDNS The team , combination CDN Business scheduling requirements , By aggregating operators Ldns as well as CDN The resource condition of the node returns to the theoretical optimal scheduling .
mi.com. 172800 IN NS ns3.dnsv5.com. mi.com. 172800 IN NS ns4.dnsv5.com. ----- bilibili.com. 172800 IN NS ns3.dnsv5.com. bilibili.com. 172800 IN NS ns4.dnsv5.com. ------ meituan.com. 172800 IN NS ns3.dnsv5.com. meituan.com. 172800 IN NS ns4.dnsv5.com. meituan.com. 172800 IN NS edns1.sankuai.com. meituan.com. 172800 IN NS edns2.sankuai.com.
Last , Simple sublimation .
One side DNS It's robust ,DNS As the core infrastructure of the Internet , From the original LAN host name to IP Mapping to today's resource scheduling , Service entrance , It provides resolution services for nearly 400million secondary domain names around the world ; One side DNS It's also fragile , The tree like hierarchical system opens the boundaries , Different network entities are introduced , With the development and expansion of a single entity , The impact of a single entity service disaster is huge and uncontrollable ; No matter from the macro point of view, the 14th five year plan is very important to the construction of Digital China , Requirements for the layout of critical information infrastructure , On the micro level, the business is right DNS Dependence , Can reflect DNS Importance ; Through a series of articles , I hope all bosses understand DNS, Attaches great importance to DNS, The real precaution is to nip in the bud .
边栏推荐
- JS naming conventions
- 怎么开户?在国海证券开户安全吗?需要带什么?
- How is the picture mosaic clear? What is mosaic for?
- 券商选哪个比较好尼?本人小白不懂,在线开户安全么?
- 3000 frame animation illustrating why MySQL needs binlog, redo log and undo log
- QPS fails to go up due to frequency limitation of public network CLB bandwidth
- 「开源摘星计划」Containerd拉取Harbor中的私有镜像,云原生进阶必备技能
- The background receives the post data passed by the fetch
- How to check whether the service is running normally after easydss cluster transcoding is set up?
- Configure two databases in master-slave database mode (master and slave)
猜你喜欢

Syntaxe des requêtes fédérées SQL (inline, left, right, full)

3000 frame animation illustrating why MySQL needs binlog, redo log and undo log

SQL聯合查詢(內聯、左聯、右聯、全聯)的語法

Add two factor authentication, not afraid of password disclosure, let alone 123456

Ugeek's theory 𞓜 application and design of observable hyperfusion storage system

Eight misunderstandings, broken one by one (final): the cloud is difficult to expand, the customization is poor, and the administrator will lose control?

FPGA based electromagnetic ultrasonic pulse compression detection system paper + source file
Application of JDBC in performance test

The "open source star picking program" container pulls private images from harbor, which is a necessary skill for cloud native advanced technology

小程序开发框架推荐
随机推荐
Use of paging components in fusiondesign
What cloud disk types does Tencent cloud provide? What are the characteristics of cloud disk service?
Ugeek's theory 𞓜 application and design of observable hyperfusion storage system
Development and code analysis of easycvr national standard user defined streaming address function
What is the role of short video AI intelligent audit? Why do I need intelligent auditing?
Strokeit- the joy of one handed fishing you can't imagine
How to deal with unclear pictures? What are the techniques for taking clear pictures?
JS namespace
Add two factor authentication, not afraid of password disclosure, let alone 123456
JS naming conventions
【Golang】来几道题以加强Slice
How to build a cloud game platform? Disadvantages of traditional games
How to dispose of the words on the picture? How do I add text to a picture?
Can Tencent cloud disk service share data? What are the advantages of cloud disk service?
@@Script implementation of ishell automatic deployment
How to separate image processing? What should I pay attention to when separating layers?
[golang] quick review guide quickreview (IV) -- functions
Use of the vs2022scanf function. An error is reported when using scanf - the return value is ignored: Solutions
JS five methods to judge whether a certain value exists in an array
Postman tutorial - teach you API interface testing by hand