当前位置:网站首页>Crawler frame
Crawler frame
2022-06-23 08:03:00 【Mysterious world】
Go online every day , Do you want to collect useful data ? If you want to, let's take a look at the following methods —— Reptiles
What is a reptile ?
- Reptiles : A program that automatically grabs Internet Information .
- TA Can be from a URL set out , Visit all URL. And we can extract the value data we need from each page .
- Automatically discover target data —— Automatic download —— Auto parse —— Autostore .
So what is the value of reptile Technology ?
- value —— Exaggeration is :" Internet data , For my use "!
- Interconnected data can be better secondary 、 Three or even unlimited uses
Simple crawler framework
Crawler scheduling side
URL Manager
Web downloader ( for example :urllib2)
Web parser ( for example :BeautifulSoup)
Data storage
Operation principle
1. First , Dispatcher asked URL Manager : Whether to climb URL?
2. then ,URL The manager replies to the scheduler : yes / no
3. Next , Scheduler notification URL Manager : Get a to crawl URL
4. Next , The scheduler notifies the downloader : Please download URL The content of , Download it and give it to me
5. Next , The scheduler notifies the parser : Please analyze URL Content , After parsing, return it to me
6. Last , The scheduler notifies the memory : Please keep the data I gave you
Python Simple crawler framework program ( part )
Crawler scheduling side
principle
- Traditional crawlers from one or several initial web pages URL Start , Get the URL, In the process of grabbing web pages , Constantly extract new URL Put in queue , Until certain stop conditions of the system are met .
- Crawler focused workflow is complex , We need to filter the links irrelevant to the theme according to certain web page analysis algorithm , Keep the useful links and put them in the waiting URL queue . then , It will select the next web page from the queue according to a certain search strategy URL, And repeat the process , Stop... Until a certain condition of the system is reached .
- in addition , All crawled pages will be stored in the system , Do some analysis 、 Filter , And index it , For later query and retrieval ; For focused reptiles , The analysis results obtained in this process may also give feedback and guidance to the later grabbing process .
URL Manager
- Add a single url
- Determine whether there is a new UN crawled in the array url
- To obtain a new url
- Add batch urls
Web downloader (urllib2)
### Fetching strategy
Depth first traversal strategy
- The web crawler will start from the start page , One link, one link to follow , After processing this line, go to the next start page , Continue tracking links .
Width first traversal strategy
- The basic idea is , Link found in the newly downloaded Web page directly ** To be seized URL At the end of the queue . That is, the web crawler will first grab all the linked pages in the starting page , Then select one of the linked pages , Continue to crawl all pages linked in this page . Take the picture above as an example :
Backlink count policy
- It refers to the number of links of a web page to other web pages . The number of backlinks indicates the degree to which the content of a web page is recommended by others . therefore , Most of the time, the crawling system of search engine will use this index to evaluate the importance of web pages , So as to determine the order of capturing different web pages .
Partial PageRank Strategy
- This strategy draws on PageRank The idea of algorithm : For downloaded pages , Together with the to be grabbed URL Queue URL, Form a web page collection , Calculate for each page PageRank value , After the calculation , To be grabbed URL Queue URL according to PageRank Value size arrangement , And grab pages in this order .
OPIC Strategy strategy
- In fact, the algorithm scores the importance of the page . Before the algorithm starts , Give all pages the same initial cash (cash). When you download a page P after , take P Of the cash allocated to all from P Links analyzed in , And will P Cash emptied . For the to be captured URL All pages in the queue are sorted by cash amount .
Station priority strategy
- For the to be captured URL All pages in the queue , Classify according to the website you belong to . For websites with a large number of pages to download , Priority download . This strategy is also called big station priority strategy .
The depth first traversal strategy used in this example
Web parser (BeautifulSoup)
The question to consider : Analysis and filtering mode for web pages or data ?
- This example applies tags extracted from web pages and class Extract data for nodes
#<dd class="lemmaWgt-lemmaTitle-title"><h1>Python</h1>
title_node=soup.find('dd',class_="lemmaWgt-lemmaTitle-title").find("h1")
res_data['title'] = title_node.get_text()
#<div class="lemma-summary" label-module="lemmaSummary">
summary_node = soup.find("div",class_="lemma-summary")
res_data['summary'] = summary_node.get_text()
Data storage
- It is mainly a container for storing data records downloaded from web pages , And provide the target source for generating the index . Medium and large database products include :Oracle、Sql Server etc. .
- We use html generator
def output_html(self):
fout = open('output.html','w')
fout.write("<html><meta charset=\"utf-8\" />")
fout.write("<body>")
fout.write("< >")
for data in self.datas:
fout.write("<tr>")
fout.write("<td>%s</td>" % data['url'])
fout.write("<td>%s</td>" % data['title'].encode('utf-8'))
fout.write("<td>%s</td>" % data['summary'].encode('utf-8'))
fout.write("</tr>")
fout.write("</table>")
fout.write("</body>")
fout.write("</html>")
fout.close()
Page trial summary
- It's just a simple reptile , If you want to use , You also need to solve the problem of login 、 Verification Code 、Ajax、 Server and other problems . To be shared next time .
The following is a picture of the trial results
边栏推荐
- Acwing第 56 場周賽【完結】
- qt 不规则图形 消除锯齿
- QT reading XML files using qdomdocument
- @What is the difference between controller and @restcontroller?
- Gif verification code analysis
- 爬虫框架
- Qt工程报错:-1: error: Cannot run compiler ‘clang++‘. Output:mingw32-make.exe
- Introduction to Excel VBA and practical examples
- Mathematical knowledge: fast power inverse element - fast power
- firewalld 配置文件的位置
猜你喜欢

记一次高校学生账户的“从无到有”

YGG Spain subdao Ola GG officially established

INT 104_LEC 06

Apache Solr 任意文件读取复现

看了5本书,我总结出财富自由的这些理论

Imperva- method of finding regular match timeout

Apache Solr arbitrary file read replication

开源软件、自由软件、Copyleft、CC都是啥,傻傻分不清楚?

Interview questions of a company in a certain month of a certain year (1)

Start appium
随机推荐
MIT CMS.300 Session 12 – IDENTITY CONSTRUCTION 虚拟世界中身份认同的建立 part 2
浅谈ThreadLocal和InheritableThreadLocal,源码解析
Commonly used bypass methods for SQL injection -ctf
1. probability theory - combination analysis
Openvino series 19 Openvino and paddleocr for real-time video OCR processing
AVL树的实现
vtk.js鼠标左键滑动改变窗位和窗宽
MySQL慢查询记录
Imperva- method of finding regular match timeout
unity 音频可视化方案
Learn to draw Er graph in an article
@Controller和@RestController的区别?
SQL注入常用到的绕过方式-ctf
odoo项目 发送信息到微信公众号或企业微信的做法
Openvino series 18 Real time object recognition through openvino and opencv (RTSP, USB video reading and video file reading)
一秒钟查看一次文件,并将文件最后一行内容结果发送至syslog服务器
socket编程(多进程)
MySQL系统表介绍
建立一有序的顺序表,并实现下列操作: 1.把元素x插入表中并保持有序; 2.查找值为x的元素,若找到将其删除; 3.输出表中各元素的值。
抓包发现tcp会话中老是出现重复的ack和大量的tcp重传——SACK(Selective Acknowledgment, 选择性确认)技术