当前位置:网站首页>Go crawler framework -colly actual combat (I)

Go crawler framework -colly actual combat (I)

2022-06-25 00:17:00 You're like an ironclad treasure

Original link :Hzy Blog

1. Make complaints

I'm going to use... These days go Write about reptiles , It used to be python,python Write the schedule , My chicken also has egg pain , Just learned again go, Just want to experience go Write about the pleasure of reptiles .

Before github According to other people's ideas , Write a simple concurrent crawler framework , Yes go Concurrent , I learned a little , Stumble across colly, Compare with others , Reading what I wrote , alas …

2.colly A brief introduction to the use of

github: https://github.com/gocolly/colly

Official website : http://go-colly.org/

2.1 colly Introduce

colly It's a reptile frame , Through him , We can quickly implement a concurrent crawler , Same as easy to understand , Easy to expand .

colly The main thing is Collector, adopt Collector To collect the accessed data , And store it .( Process oriented )

2.1 colly Callback in the process of fetching a page

  • Before collector request : onRequest()
  • Collector fetch failed :onError()
  • After the collector responds :onResponse()
  • Collector received HTML:onHTML()
  • Collector received XML: onXML()
  • The last callback executed after the collector finishes fetching :onScraped()

Through these callbacks , We can quickly write a reptile , There are also many examples on the official website , For our reference , Not really. Look at the source code .

2.2 colly in Collector Configuration of

  • The specific configuration information can be viewed on the official website , Just a few words here .
  • Crawler domain name crawl restrictions , Maximum depth limit , Whether to crawl duplicate websites , Avoid the dead cycle .
  • Set asynchronous , Concurrent number , Set random delay time, etc
  • http Whether the long connection is maintained in the , Limit the number of connections, etc .
  • It also supports distributed .
  • By extending the , We can also set random user-agent,reffer.

2.3 colly Storage in

  • The default storage is in memory .
  • The official website recommends storing in redis in
  • It can also be stored in sqlite3,mongo in , There are relevant examples on the official website .
  • colly-sqlite3 Storage
  • colly-mongo Storage

3. ending

Tomorrow, I will write about crawling with this framework leetCode Topics on .

原网站

版权声明
本文为[You're like an ironclad treasure]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/02/202202210551199185.html