当前位置：网站首页>Exploration of web application component automatic discovery

Exploration of web application component automatic discovery

2022-06-24 03:51:00 【Tencent Security Emergency Response Center】

Introduction

mention Web fingerprint identification , You are familiar with it , There are a lot of related projects , Well known Wappalyzer、WhatWeb etc. . And in operation , All enterprises are similar , Use to build a fingerprint database in advance , Through scanning the company's global assets, the company's assets can be thoroughly investigated , When some components have high-risk vulnerabilities , We can achieve the rapid convergence of the attack surface at the first time .

But what this article talks about Web Application components , It's not a small component in single finger fingerprint recognition , such as jQuery、Bootstrap etc. , More refers to the relatively independent application itself of the site . For example , One site is phpMyAdmin Built , Although used jQuery, But we will still Web Application components are classified as phpMyAdmin.

Why such a definition ？ In fact, this is mainly to solve the high risk Web External problems of application components , For a long time , In our operation, we found that some businesses use such as phpMyAdmin、Kibana、Eleastic Search etc. Web Application components and directly open to the public network , These components may have weak passwords or unauthorized , This led to security incidents , In order to turn passivity into initiative , So we worked together with the application security team to maintain a company's high-risk Web Application component blacklist , It is recommended that businesses avoid using such components as much as possible , Especially avoid opening on the public network .

as everyone knows , The emergence of each measure , In fact, it has to be accompanied by continuous verification , Otherwise, there is no effect . So in terms of operation , Through the active discovery of the company's self-developed scanner hole rhinoceros and the passive discovery of aegis traffic analysis system, we realize the high risk of business Web Operation of application components , And achieved good results . Of course, this part is not the focus of this paper , I won't repeat .

Ask Qu that to be clear

Under construction Web When applying component blacklists , We first include the notorious high-risk components in history . however Later, in actual operation , We found a thorny problem , That's the source of the blacklist , At first, it was extracted from some external security incidents ; The second is to find some new information from some safety notices Web Application components ; The third is AI Identify and manage the online of background policies （ see 《 Machine learning based Web Management background identification method exploration 》）, However, it is still the management background with login behavior that has more policy identification . To make a long story short , These solutions can only solve part of the problem .

So how can we comprehensively solve the source problem , Good operation Web A key to the blacklist of application components . Of course, there are many obvious ways , such as github Find some star High number of components , Understand the common components of each business line through research . But without exception , The cost of these measures is relatively high , Not universal . So how to find a more cost-effective 、 More automated solutions ！

It's on paper

Solve the problem of the source of the blacklist , In fact, it can be simply described as the following three steps ： Find new Web Components —> Determine the component identification rules —> Determine if there is a high risk .

1、 Find new web Components ：

The first priority is how to find a new solution Web Components , from Web For the distribution of components , At first we thought about github Upper star Climb the higher number of components , Then analyze and deal with , But obviously, this method is not cheap , Even worse , It's hard to start the project with a mixture of good and bad people , So the follow-up still focuses on Web After the components are built , That is, after external service . As mentioned above, yes Web The definition of application components , Under this definition , We think of Web Some of the response pages under the application component are similar , For example, different versions of phpMyAdmin In fact, the home page responses are similar , So obviously , The clustering of similar pages can be realized by comparing the algorithm of page similarity , And similar pages can basically be determined as similar components , To this end, we downloaded some ports of the whole network Web Application disclosure HTTP Test the response data . Actually test it , The biggest problem is the cost , Because of the similarity comparison between two pages , Suppose there is 10000 A page , Then the consumption is undoubtedly exponential .

No problem in thinking , Now the core of the problem is how to reduce consumption , There is no more specific thinking process here , It's a direct solution ： That is to give several certain values to each page , Then the degree of similar pages can be determined according to the same determined value , The author will specifically talk about what is the determined value , And how to obtain these determined values . In general, for a response , Usually contains header and body, First of all, we are right body Do the following ：

After these steps, the set of tags in the figure below is obtained （ Include order ）, A simple explanation , The main idea of the above steps is to deal with some labels that are often involved in secondary development , That is, delete or merge , Like WordPress Medium meta Information is often modified, etc .

After getting these labels , We call it web skeleton , After our analysis, we found that , It can be divided into three paragraphs , I simply classify it into three paragraphs . The reason for this division , Because there are many of the same kind Web The skeleton of application components after processing is not exactly the same , For example, some different versions of applications , But the upper and lower paragraphs are the same , The middle is different . After the division , You can calculate the upper paragraph 、 The next paragraph and all MD5 The value of , And these MD5 The value is determined . After that, you can perform page aggregation to discover similar components , There are three possibilities , Exactly the same 、 A paragraph is consistent 、 The two paragraphs are consistent , One advantage of the three-stage operation is that it can do hierarchical operation .

At this point, readers may want to know how to divide it into three paragraphs , In short, there is no definite formula , Mainly experience , We confirm it according to the total length of the obtained web page skeleton characters , For example, the total length is 5000-6000 Between , Then the third paragraph is 0-1000,1000-4xxx,4xxx-5xxx such . Of course, in some cases, there is no need to distinguish between three paragraphs , Because sometimes you get the skeleton MD5 Will match exactly . Of course, this is only one case , Another is that the return value of the response is directly json, non-existent html label , At this time, we usually json Of value When the value is set to null, we can make comparison again . Of course, the return of web pages is complicated , except json Besides xml、 And Pure Strings and so on , These require additional special treatment . I'll simplify it to the following figure . The use of title And other information , Then you can get N A collection of ( Each set contains N A page ).

2、 Determine the component identification rules ：

With the solution of the first problem , The answer to the second question is just around the corner , From the analysis above, we can get the of web page skeleton MD5 Isn't the value exactly the rule to identify whether it is a similar component ？ Of course , This answer is really an answer , But compared with the traditional fingerprint database, our common fingerprint database is regular , The answer is not perfect . What we need more is how to automatically generate regular expressions （ Pure characters are also a kind of expression ）, In this way, these regularities can be applied to the passive monitoring of traffic system . But at present, we have tried many methods , There is still no particularly good way to solve this problem perfectly . At first, we wanted to extract two kinds of information from web pages （ label + Text combination 、 Tag attributes + Tag attribute value combination , Here's the picture ） As a string fingerprint , Then by satisfying A Set, but it does not meet the requirements of other sets to determine the recognizability of fingerprints , Finally, these strings are deconstructed into regular , But this scheme is very cumbersome , So let's go . In addition, there are some preset fingerprint features that often appear to match , such as header Medium Server Field etc. , It can't solve this problem perfectly . Of course , adopt MD5 The way of value is still feasible .

3、 Determine if there is a high risk ：

The third problem is to determine the risk of components , Relatively, it mainly depends on manual analysis , Of course, some automation and other measures can still be taken to assist in the realization of some of these , One is to pass. AI Identify the strategy of management background and determine the risk , Another way is to obtain screenshots through headless browser to improve the efficiency of manual risk determination .

On the vision , Sometimes our eyes deceive us , But the code of the web page will not . In the following two figures, the program classifies them into similar components , We initially thought it was a false positive , But after the actual in-depth analysis , The two websites do use the same framework .

Conclusion

The data set we used above is the public network IP Of HTTP Response data , If it is directly used as a new high-risk component of the company, it is found that it is a little overqualified , This part of the data is useful , More can be used to learn about global open source Web Development trend of application components 、 Proportion, etc . If you want to find out the current situation of the company , And do early convergence for some unknown high-risk components , Then you only need to use the company wide HTTP Just respond to the flow , Another thing worth mentioning , In response js The problem of dynamic web pages , Strongly recommended Chrome Browser headless mode , Although this method is not the most efficient , But it's enough for the size of the company , What's more? , Accuracy has also been greatly improved .

This article is only for Web Exploration of automatic discovery of application components , Some of these practices are not mature , There are also some outstanding issues , Welcome all colleagues with lofty ideals to communicate ！

Finally, special thanks to Dong Xi yiiyi, Thank them for their help in the exploration .