当前位置:网站首页>Crawler frame

Crawler frame

2022-06-23 08:03:00 Mysterious world

Go online every day , Do you want to collect useful data ? If you want to, let's take a look at the following methods —— Reptiles

What is a reptile ?

  • Reptiles : A program that automatically grabs Internet Information .
  • TA Can be from a URL set out , Visit all URL. And we can extract the value data we need from each page .
  • Automatically discover target data —— Automatic download —— Auto parse —— Autostore .

So what is the value of reptile Technology ?

  • value —— Exaggeration is :" Internet data , For my use "!
  • Interconnected data can be better secondary 、 Three or even unlimited uses

Simple crawler framework

  • Crawler scheduling side

  • URL Manager

  • Web downloader ( for example :urllib2)

  • Web parser ( for example :BeautifulSoup)

  • Data storage

Operation principle

Image

1. First , Dispatcher asked URL Manager : Whether to climb URL?

2. then ,URL The manager replies to the scheduler : yes / no

3. Next , Scheduler notification URL Manager : Get a to crawl URL

4. Next , The scheduler notifies the downloader : Please download URL The content of , Download it and give it to me

5. Next , The scheduler notifies the parser : Please analyze URL Content , After parsing, return it to me

6. Last , The scheduler notifies the memory : Please keep the data I gave you

Python Simple crawler framework program ( part )

Crawler scheduling side

principle

  • Traditional crawlers from one or several initial web pages URL Start , Get the URL, In the process of grabbing web pages , Constantly extract new URL Put in queue , Until certain stop conditions of the system are met .
  • Crawler focused workflow is complex , We need to filter the links irrelevant to the theme according to certain web page analysis algorithm , Keep the useful links and put them in the waiting URL queue . then , It will select the next web page from the queue according to a certain search strategy URL, And repeat the process , Stop... Until a certain condition of the system is reached .
  • in addition , All crawled pages will be stored in the system , Do some analysis 、 Filter , And index it , For later query and retrieval ; For focused reptiles , The analysis results obtained in this process may also give feedback and guidance to the later grabbing process .

URL Manager

  • Add a single url
  • Determine whether there is a new UN crawled in the array url
  • To obtain a new url
  • Add batch urls

Web downloader (urllib2)

### Fetching strategy

Depth first traversal strategy

  • The web crawler will start from the start page , One link, one link to follow , After processing this line, go to the next start page , Continue tracking links .

Width first traversal strategy

  • The basic idea is , Link found in the newly downloaded Web page directly ** To be seized URL At the end of the queue . That is, the web crawler will first grab all the linked pages in the starting page , Then select one of the linked pages , Continue to crawl all pages linked in this page . Take the picture above as an example :

Backlink count policy

  • It refers to the number of links of a web page to other web pages . The number of backlinks indicates the degree to which the content of a web page is recommended by others . therefore , Most of the time, the crawling system of search engine will use this index to evaluate the importance of web pages , So as to determine the order of capturing different web pages .

Partial PageRank Strategy

  • This strategy draws on PageRank The idea of algorithm : For downloaded pages , Together with the to be grabbed URL Queue URL, Form a web page collection , Calculate for each page PageRank value , After the calculation , To be grabbed URL Queue URL according to PageRank Value size arrangement , And grab pages in this order .

OPIC Strategy strategy

  • In fact, the algorithm scores the importance of the page . Before the algorithm starts , Give all pages the same initial cash (cash). When you download a page P after , take P Of the cash allocated to all from P Links analyzed in , And will P Cash emptied . For the to be captured URL All pages in the queue are sorted by cash amount .

Station priority strategy

  • For the to be captured URL All pages in the queue , Classify according to the website you belong to . For websites with a large number of pages to download , Priority download . This strategy is also called big station priority strategy .

The depth first traversal strategy used in this example

Web parser (BeautifulSoup)

The question to consider : Analysis and filtering mode for web pages or data ?

  • This example applies tags extracted from web pages and class Extract data for nodes

    	#<dd class="lemmaWgt-lemmaTitle-title"><h1>Python</h1>
    	title_node=soup.find('dd',class_="lemmaWgt-lemmaTitle-title").find("h1")
    	res_data['title'] = title_node.get_text()
    
    	#<div class="lemma-summary" label-module="lemmaSummary">
    	summary_node = soup.find("div",class_="lemma-summary")
    	res_data['summary'] = summary_node.get_text()

Data storage

  • It is mainly a container for storing data records downloaded from web pages , And provide the target source for generating the index . Medium and large database products include :Oracle、Sql Server etc. .
  • We use html generator

	def output_html(self):
    	fout = open('output.html','w')
    
    	fout.write("<html><meta charset=\"utf-8\" />")
    	fout.write("<body>")
    	fout.write("< >")
    
    	for data in self.datas:
        	fout.write("<tr>")
       	 	fout.write("<td>%s</td>" % data['url'])
        	fout.write("<td>%s</td>" % data['title'].encode('utf-8'))
        	fout.write("<td>%s</td>" % data['summary'].encode('utf-8'))
        	fout.write("</tr>")
    
    	fout.write("</table>")
    	fout.write("</body>")
    	fout.write("</html>")
    
    	fout.close()

Page trial summary

  • It's just a simple reptile , If you want to use , You also need to solve the problem of login 、 Verification Code 、Ajax、 Server and other problems . To be shared next time .

The following is a picture of the trial results

Image

原网站

版权声明
本文为[Mysterious world]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/174/202206230737045592.html