当前位置：网站首页>Crawler scheduling framework of scratch+scratch+grammar

Crawler scheduling framework of scratch+scratch+grammar

2022-06-25 11:07:00 【AJ~】

List of articles

One 、scrapy
Two 、scrapyd
- 2.1 brief introduction
- 2.2 Installation and use
3、 ... and 、gerapy
- 3.1 brief introduction
- 3.2 Install and use
Four 、scrapy+scrapyd+gerapy Combined use of
5、 ... and 、 Filling pit
- 5.1 function scrapy Reptile error
- 5.2 scrapyd function scrapy Report errors

One 、scrapy

1.1 summary

Scrapy,Python A fast development 、 High level screen grabs and web Grabbing framework , Used to grab web Site and extract structured data from the page .Scrapy A wide range of uses , Can be used for data mining 、 Monitoring and automated testing .

It was originally for page crawling ( More specifically , Network capture ) Designed by , The background application is also used to obtain API Data returned ( for example Amazon Associates Web Services ) Or general purpose web crawlers .

Scrapy The attraction is that it's a framework , Anyone can modify it conveniently according to their needs . It also provides a base class for many types of reptiles , Such as BaseSpider、sitemap Reptiles, etc , The latest version offers web2.0 Reptile support .

1.2 constitute

Scrapy The frame is mainly composed of five components , They are the scheduler (Scheduler)、 Downloader (Downloader)、 Reptiles （Spider） And physical pipes (Item Pipeline)、Scrapy engine (Scrapy Engine). Let's introduce the functions of each component .

(1)、 Scheduler (Scheduler):

Scheduler , To put it bluntly, assume it's a URL（ Grab the web address or link ） Priority queue for , It decides that the next URL to grab is what , Remove duplicate URLs at the same time （ Don't do useless work ）. Users can customize the scheduler according to their own requirements .

(2)、 Downloader (Downloader):

Downloader , It's the most burdensome of all the components , It's used to download resources on the Internet at high speed .Scrapy The downloader code is not too complicated , But it's efficient , The main reason is Scrapy The Downloader is built on twisted On this efficient asynchronous model ( In fact, the whole framework is based on this model ).

(3)、 Reptiles （Spider）:

Reptiles , Is the most concerned part of the user . Users customize their own crawler ( By customizing syntax such as regular expressions ), Used to extract the information you need from a specific web page , The so-called entity (Item). Users can also extract links from it , Give Way Scrapy Continue to grab next page .

(4)、 Physical pipeline (Item Pipeline):

Physical pipeline , Used to deal with reptiles (spider) Extracted entities . The main function is to persist entities 、 Verify the validity of the entity 、 Clear unwanted information .

(5)、Scrapy engine (Scrapy Engine):

Scrapy The engine is at the heart of the whole framework . It's used to control the debugger 、 Downloader 、 Reptiles . actually , The engine is the equivalent of a computer CPU, It controls the whole process .

1.3 Installation and use

install

pip install scrapy( or pip3 install scrapy）

Use

Create a new project ：scrapy startproject Project name
Create a new crawler ：scrapy genspider Reptile name domain name
Start the crawler ： scrapy crawl Reptile name

Two 、scrapyd

2.1 brief introduction

scrapyd Is one for deployment and operation scrapy Crawler program , It allows you to go through JSON API To deploy the crawler project and control the running of the crawler ,scrapyd Is a Daemons , Monitor the running and requests of the crawler , Then start the process to execute them

2.2 Installation and use

install

pip install scrapyd( or pip3 install scrapyd）
pip install scrapyd-client( or pip3 install scrapyd-client）

File configuration

vim /usr/local/python3/lib/python3.7/site-packages/scrapyd/default_scrapyd.conf

Insert picture description here

start-up

scrapyd

Insert picture description here

visit ip:6800, When this page appears, the startup succeeds
Insert picture description here

3、 ... and 、gerapy

3.1 brief introduction

Gerapy Is a distributed crawler management framework , Support Python 3, be based on Scrapy、Scrapyd、Scrapyd-Client、Scrapy-Redis、Scrapyd-API、Scrapy-Splash、Jinjia2、Django、Vue.js Development ,Gerapy Can help us ：

Easily control the operation of the crawler
Visually view crawler status
View crawling results in real time
Simply implement project deployment
Unified host management
Easily write crawler code

3.2 Install and use

install

pip install gerapy( or pip3 install gerapy）

Set up a soft link after installation

ln -s /usr/local/python3/bin/gerapy /usr/bin/gerapy

initialization

gerapy init

Insert picture description here

Initialize database

cd gerapy
gerapy migrate

Report errors sqllite Version is too low
Insert picture description here

terms of settlement ： upgrade sqllite

download
wget https://www.sqlite.org/2019/sqlite-autoconf-3300000.tar.gz --no-check-certificate
tar -zxvf sqlite-autoconf-3300000.tar.gz

install
mkdir /opt/sqlite
cd sqlite-autoconf-3300000
./configure --prefix=/opt/sqlite
make && make install

Establish a soft connection
mv /usr/bin/sqlite3 /usr/bin/sqlite3_old
ln -s /opt/sqlite/bin/sqlite3 /usr/bin/sqlite3
echo “/usr/local/lib” > /etc/ld.so.conf.d/sqlite3.conf
ldconfig
vim ~/.bashrc add to export LD_LIBRARY_PATH=“/usr/local/lib”
source ~/.bashrc

View the current sqlite3 Version of
sqlite3 --version

Reinitialize gerapy database

Insert picture description here

Configure account secret

gerapy createsuperuser

start-up gerapy

gerapy runserver
gerapy runserver 0.0.0.0:9000  #  External access  9000 Port boot

Insert picture description here
Because it is not started scrapy The host here is not 0

Insert picture description here

start-up scrapyd after , To configure scrapyd The host information
Insert picture description here

After the configuration is successful, it will be added to the host list
Insert picture description here

Four 、scrapy+scrapyd+gerapy Combined use of

4.1 establish scrapy project

Enter gerapy Project directory

cd ~/gerapy/projects/

And then create a new one scrapy project

scrapy startproject gerapy_test
scrapy genspider baidu_test www.baidu.com

modify scrapy.cfg as follows

Insert picture description here

In the use of scrapyd-deploy Upload to scrapyd, First establish a soft connection and then upload

ln -s /usr/local/python3/bin/scrapyd-deploy /usr/bin/scrapyd-deploy

scrapyd-deploy app -p gerapy_test

Insert picture description here

4.2 Deployment packaging scrapy project

And then again gerapy You can see our new project on the page , Pack it again

Insert picture description here

It needs to be modified before operation scrapy Code
Insert picture description here

Run the code after modification
Insert picture description here

4.3 function

The successful running , This deployment is ok 了！

Insert picture description here

5、 ... and 、 Filling pit

5.1 function scrapy Reptile error

Insert picture description here
terms of settlement ： modify lzma The source code is as follows

try:
    from _lzma import *
    from _lzma import _encode_filter_properties, _decode_filter_properties
except ImportError:
    from backports.lzma import *
    from backports.lzma import _encode_filter_properties, _decode_filter_properties