当前位置:网站首页>Go crawler framework -colly actual combat (I)
Go crawler framework -colly actual combat (I)
2022-06-25 00:17:00 【You're like an ironclad treasure】
Original link :Hzy Blog
1. Make complaints
I'm going to use... These days go Write about reptiles , It used to be python,python Write the schedule , My chicken also has egg pain , Just learned again go, Just want to experience go Write about the pleasure of reptiles .
Before github According to other people's ideas , Write a simple concurrent crawler framework , Yes go Concurrent , I learned a little , Stumble across colly, Compare with others , Reading what I wrote , alas …
2.colly A brief introduction to the use of
github: https://github.com/gocolly/colly
Official website : http://go-colly.org/
2.1 colly Introduce
colly It's a reptile frame , Through him , We can quickly implement a concurrent crawler , Same as easy to understand , Easy to expand .
colly The main thing is Collector, adopt Collector To collect the accessed data , And store it .( Process oriented )
2.1 colly Callback in the process of fetching a page
- Before collector request : onRequest()
- Collector fetch failed :onError()
- After the collector responds :onResponse()
- Collector received HTML:onHTML()
- Collector received XML: onXML()
- The last callback executed after the collector finishes fetching :onScraped()
Through these callbacks , We can quickly write a reptile , There are also many examples on the official website , For our reference , Not really. Look at the source code .
2.2 colly in Collector Configuration of
- The specific configuration information can be viewed on the official website , Just a few words here .
- Crawler domain name crawl restrictions , Maximum depth limit , Whether to crawl duplicate websites , Avoid the dead cycle .
- Set asynchronous , Concurrent number , Set random delay time, etc
- http Whether the long connection is maintained in the , Limit the number of connections, etc .
- It also supports distributed .
- By extending the , We can also set random user-agent,reffer.
2.3 colly Storage in
- The default storage is in memory .
- The official website recommends storing in redis in
- It can also be stored in sqlite3,mongo in , There are relevant examples on the official website .
- colly-sqlite3 Storage
- colly-mongo Storage
3. ending
- If you want to know more about , Take a look at this article :go The crawler frame colly Source code and software architecture analysis , have a look colly Design structure of
- Colly The source code parsing —— Combined with examples to analyze the underlying implementation Under the analysis of colly The main functions in the source code .
Tomorrow, I will write about crawling with this framework leetCode Topics on .
边栏推荐
- Microsoft won the title of "leader" in the magic quadrant of Gartner industrial Internet of things platform again!
- Difficult and miscellaneous problems: A Study on the phenomenon of text fuzziness caused by transform
- Canvas spiral style animation JS special effect
- wx小程序跳转页面
- Wx applet jump page
- Is it so difficult to calculate the REM size of the web page according to the design draft?
- Signal integrity (SI) power integrity (PI) learning notes (XXV) differential pair and differential impedance (V)
- Ansible及playbook的相关操作
- svg+js键盘控制路径
- 【排行榜】Carla leaderboard 排行榜 运行与参与手把手教学
猜你喜欢

Phprunner 10.7.0 PHP code generator

5G dtu无线通信模块的电力应用

How to delete the entire row with duplicate items in a column of WPS table

Unmanned driving: Some Thoughts on multi-sensor fusion

人体改造 VS 数字化身

Design scheme of authority management of fusion model

Hibernate learning 3 - custom SQL

信号完整性(SI)电源完整性(PI)学习笔记(二十五)差分对与差分阻抗(五)

Im instant messaging development application keeping alive process anti kill

Why are life science enterprises on the cloud in succession?
随机推荐
Arbitrary file download of file operation vulnerability (7)
svg+js键盘控制路径
为什么越来越多的实体商铺用VR全景?优势有哪些?
Why do more and more physical stores use VR panorama? What are the advantages?
im即时通讯开发应用保活之进程防杀
C program design topic 18-19 final exam exercise solutions (Part 2)
Phprunner 10.7.0 PHP code generator
China CAE industry investment strategic planning and future development analysis report 2022 ~ 2028
One way 和two way ANOVA分析的区别是啥,以及如何使用SPSS或者prism进行统计分析
无需显示屏的VNC Viewer远程连接树莓派
Fast pace? high pressure? VR panoramic Inn brings you a comfortable life
Ten commandments of self-learning in machine learning
Analysis report on development trend and investment forecast of global and Chinese D-leucine industry from 2022 to 2028
Intensive reading of thinking about markdown
Requests Library
JDBC - database connection
Approaching harvest moon:moonbeam DFI Carnival
Design and practice of vivo server monitoring architecture
教程详解|在酷雷曼系统中如何编辑设置导览功能?
VR全景制作的优势是什么?为什么能得到青睐?