当前位置:网站首页>Exploration of web application component automatic discovery
Exploration of web application component automatic discovery
2022-06-24 03:51:00 【Tencent Security Emergency Response Center】
Introduction
mention Web fingerprint identification , You are familiar with it , There are a lot of related projects , Well known Wappalyzer、WhatWeb etc. . And in operation , All enterprises are similar , Use to build a fingerprint database in advance , Through scanning the company's global assets, the company's assets can be thoroughly investigated , When some components have high-risk vulnerabilities , We can achieve the rapid convergence of the attack surface at the first time .
But what this article talks about Web Application components , It's not a small component in single finger fingerprint recognition , such as jQuery、Bootstrap etc. , More refers to the relatively independent application itself of the site . For example , One site is phpMyAdmin Built , Although used jQuery, But we will still Web Application components are classified as phpMyAdmin.
Why such a definition ? In fact, this is mainly to solve the high risk Web External problems of application components , For a long time , In our operation, we found that some businesses use such as phpMyAdmin、Kibana、Eleastic Search etc. Web Application components and directly open to the public network , These components may have weak passwords or unauthorized , This led to security incidents , In order to turn passivity into initiative , So we worked together with the application security team to maintain a company's high-risk Web Application component blacklist , It is recommended that businesses avoid using such components as much as possible , Especially avoid opening on the public network .
as everyone knows , The emergence of each measure , In fact, it has to be accompanied by continuous verification , Otherwise, there is no effect . So in terms of operation , Through the active discovery of the company's self-developed scanner hole rhinoceros and the passive discovery of aegis traffic analysis system, we realize the high risk of business Web Operation of application components , And achieved good results . Of course, this part is not the focus of this paper , I won't repeat .
Ask Qu that to be clear
Under construction Web When applying component blacklists , We first include the notorious high-risk components in history . however Later, in actual operation , We found a thorny problem , That's the source of the blacklist , At first, it was extracted from some external security incidents ; The second is to find some new information from some safety notices Web Application components ; The third is AI Identify and manage the online of background policies ( see 《 Machine learning based Web Management background identification method exploration 》 ), However, it is still the management background with login behavior that has more policy identification . To make a long story short , These solutions can only solve part of the problem .
So how can we comprehensively solve the source problem , Good operation Web A key to the blacklist of application components . Of course, there are many obvious ways , such as github Find some star High number of components , Understand the common components of each business line through research . But without exception , The cost of these measures is relatively high , Not universal . So how to find a more cost-effective 、 More automated solutions !
It's on paper
Solve the problem of the source of the blacklist , In fact, it can be simply described as the following three steps : Find new Web Components —> Determine the component identification rules —> Determine if there is a high risk .
1、 Find new web Components :
The first priority is how to find a new solution Web Components , from Web For the distribution of components , At first we thought about github Upper star Climb the higher number of components , Then analyze and deal with , But obviously, this method is not cheap , Even worse , It's hard to start the project with a mixture of good and bad people , So the follow-up still focuses on Web After the components are built , That is, after external service . As mentioned above, yes Web The definition of application components , Under this definition , We think of Web Some of the response pages under the application component are similar , For example, different versions of phpMyAdmin In fact, the home page responses are similar , So obviously , The clustering of similar pages can be realized by comparing the algorithm of page similarity , And similar pages can basically be determined as similar components , To this end, we downloaded some ports of the whole network Web Application disclosure HTTP Test the response data . Actually test it , The biggest problem is the cost , Because of the similarity comparison between two pages , Suppose there is 10000 A page , Then the consumption is undoubtedly exponential .
No problem in thinking , Now the core of the problem is how to reduce consumption , There is no more specific thinking process here , It's a direct solution : That is to give several certain values to each page , Then the degree of similar pages can be determined according to the same determined value , The author will specifically talk about what is the determined value , And how to obtain these determined values . In general, for a response , Usually contains header and body, First of all, we are right body Do the following :
After these steps, the set of tags in the figure below is obtained ( Include order ), A simple explanation , The main idea of the above steps is to deal with some labels that are often involved in secondary development , That is, delete or merge , Like WordPress Medium meta Information is often modified, etc .
After getting these labels , We call it web skeleton , After our analysis, we found that , It can be divided into three paragraphs , I simply classify it into three paragraphs . The reason for this division , Because there are many of the same kind Web The skeleton of application components after processing is not exactly the same , For example, some different versions of applications , But the upper and lower paragraphs are the same , The middle is different . After the division , You can calculate the upper paragraph 、 The next paragraph and all MD5 The value of , And these MD5 The value is determined . After that, you can perform page aggregation to discover similar components , There are three possibilities , Exactly the same 、 A paragraph is consistent 、 The two paragraphs are consistent , One advantage of the three-stage operation is that it can do hierarchical operation .
At this point, readers may want to know how to divide it into three paragraphs , In short, there is no definite formula , Mainly experience , We confirm it according to the total length of the obtained web page skeleton characters , For example, the total length is 5000-6000 Between , Then the third paragraph is 0-1000,1000-4xxx,4xxx-5xxx such . Of course, in some cases, there is no need to distinguish between three paragraphs , Because sometimes you get the skeleton MD5 Will match exactly . Of course, this is only one case , Another is that the return value of the response is directly json, non-existent html label , At this time, we usually json Of value When the value is set to null, we can make comparison again . Of course, the return of web pages is complicated , except json Besides xml、 And Pure Strings and so on , These require additional special treatment . I'll simplify it to the following figure . The use of title And other information , Then you can get N A collection of ( Each set contains N A page ).
2、 Determine the component identification rules :
With the solution of the first problem , The answer to the second question is just around the corner , From the analysis above, we can get the of web page skeleton MD5 Isn't the value exactly the rule to identify whether it is a similar component ? Of course , This answer is really an answer , But compared with the traditional fingerprint database, our common fingerprint database is regular , The answer is not perfect . What we need more is how to automatically generate regular expressions ( Pure characters are also a kind of expression ), In this way, these regularities can be applied to the passive monitoring of traffic system . But at present, we have tried many methods , There is still no particularly good way to solve this problem perfectly . At first, we wanted to extract two kinds of information from web pages ( label + Text combination 、 Tag attributes + Tag attribute value combination , Here's the picture ) As a string fingerprint , Then by satisfying A Set, but it does not meet the requirements of other sets to determine the recognizability of fingerprints , Finally, these strings are deconstructed into regular , But this scheme is very cumbersome , So let's go . In addition, there are some preset fingerprint features that often appear to match , such as header Medium Server Field etc. , It can't solve this problem perfectly . Of course , adopt MD5 The way of value is still feasible .
3、 Determine if there is a high risk :
The third problem is to determine the risk of components , Relatively, it mainly depends on manual analysis , Of course, some automation and other measures can still be taken to assist in the realization of some of these , One is to pass. AI Identify the strategy of management background and determine the risk , Another way is to obtain screenshots through headless browser to improve the efficiency of manual risk determination .
On the vision , Sometimes our eyes deceive us , But the code of the web page will not . In the following two figures, the program classifies them into similar components , We initially thought it was a false positive , But after the actual in-depth analysis , The two websites do use the same framework .
Conclusion
The data set we used above is the public network IP Of HTTP Response data , If it is directly used as a new high-risk component of the company, it is found that it is a little overqualified , This part of the data is useful , More can be used to learn about global open source Web Development trend of application components 、 Proportion, etc . If you want to find out the current situation of the company , And do early convergence for some unknown high-risk components , Then you only need to use the company wide HTTP Just respond to the flow , Another thing worth mentioning , In response js The problem of dynamic web pages , Strongly recommended Chrome Browser headless mode , Although this method is not the most efficient , But it's enough for the size of the company , What's more? , Accuracy has also been greatly improved .
This article is only for Web Exploration of automatic discovery of application components , Some of these practices are not mature , There are also some outstanding issues , Welcome all colleagues with lofty ideals to communicate !
Finally, special thanks to Dong Xi yiiyi, Thank them for their help in the exploration .
appendix
1、 Threat discovery under automated data analysis
2、 Machine learning based Web Management background identification method exploration
边栏推荐
- Self built DNS to realize the automatic intranet resolution of tke cluster apiserver domain name
- 3. go deep into tidb: perform optimization explanation
- Event id:7001: after restarting the machine, the World Wide Web failed to start automatically, resulting in inaccessible websites
- Technical dry goods - how to use AI technology to accurately identify mining Trojans
- 讲讲我的不丰富的远程办公经验和推荐一些办公利器 | 社区征文
- Why can't the fortress machine log in? What are the ways to solve the problem
- 黑帽实战SEO之永不被发现的劫持
- What does elastic scaling of cloud computing mean? What are the application scenarios for elastic scaling of cloud computing?
- The request was aborted: Could not create SSL/TLS secure channel.
- 元气森林推“有矿”,农夫山泉们跟着“卷”?
猜你喜欢

浅谈游戏安全 (一)

Black hat actual combat SEO: never be found hijacking

Koom of memory leak

Flutter series: offstage in flutter

Modstartcms enterprise content site building system (supporting laravel9) v4.2.0

Yuanqi forest pushes "youkuang", and farmers' mountain springs follow the "roll"?

Do you understand TLS protocol?

Idea 1 of SQL injection bypassing the security dog

Halcon knowledge: contour operator on region (2)

内存泄漏之KOOM
随机推荐
Disk partition extension using graphical interface and PowerShell code
高斯光束及其MATLAB仿真
Tencent cloud ASR product -php realizes the authentication request of the extremely fast version of recording file identification
Implement the throttling de dithering function
ClickHouse(02)ClickHouse架构设计介绍概述与ClickHouse数据分片设计
Tell you about mvcc
Black hat actual combat SEO: never be found hijacking
Getlocationinwindow source code
What is elastic scaling in cloud computing? What are the main applications of elastic scaling in cloud computing?
【代码随想录-动态规划】T392.判断子序列
Thank you for your recognition! One thank-you note after another
Black hat SEO actual combat directory wheel chain generates millions of pages in batch
Browser rendering mechanism
Self built DNS to realize the automatic intranet resolution of tke cluster apiserver domain name
Web penetration test - 5. Brute force cracking vulnerability - (3) FTP password cracking
Web penetration test - 5. Brute force cracking vulnerability - (2) SNMP password cracking
TRTC audio quality problem
Received status code 502 from server: Bad Gateway
讲讲我的不丰富的远程办公经验和推荐一些办公利器 | 社区征文
[Numpy] Numpy对于NaN值的判断