当前位置:网站首页>The idea of the regular crawler of the scratch
The idea of the regular crawler of the scratch
2022-07-25 07:09:00 【Fan zhidu】
scrapy Crawler timing setting
Set a in the startup file while loop , Then create two files , A data storage crawler needs to continue crawling , A flag to determine whether the crawler is running .
When running, judge whether the crawler runs the tag file 2 Whether there is ,
If it doesn't exist , Use isExsit = os.path.isdir( file 1) Judge the continued file 1
If it exists, use shutil.rmtree Delete all files in the directory ,
If it doesn't exist, it will output no crawler
cmdline Start the crawler file
If the run file exists : Output text file is running
Every time 10 Second pause , Then set a variable to record the time , Once the time is exceeded , Just jump out while.
from scrapy import cmdline
import datetime
import time
import shutil
import os
# Crawler task timing settings
# This is a directory created for crawlers to continue crawling . Continued data storage
recoderDir = r"C:/Users/stawind/Desktop/spider/cninfospider1"
# Flag to determine whether the crawler is running
checkFile = "C:/Users/stawind/Desktop/spider/isRunning.txt"
startTime = datetime.datetime.now()
print(f"startTime={startTime}")
i = 0
moniter = 0
while True:
isRunning = os.path.isfile(checkFile)
if not isRunning:
# Deal with something before the crawler starts , Clean up jobdir = crawls
isExsit = os.path.isdir(recoderDir)
print(f"cninfospider not running,ready to start.isExsit:{isExsit}")
if isExsit:
# Delete continuation directory crawls And all files in the directory
removeRes = shutil.rmtree(recoderDir)
print(f"At time:{datetime.datetime.now()}, delete res:{removeRes}")
else:
print(f"At time:{datetime.datetime.now()}, Dir:{recoderDir} is not exsit.")
time.sleep(20)
clawerTime = datetime.datetime.now()
waitTime = clawerTime - startTime
print(f"At time:{clawerTime}, start clawer: mySpider !!!, waitTime:{waitTime}")
cmdline.execute('scrapy crawl cninfospider1 -s JOBDIR=C:/Users/stawind/Desktop/spider/cninfospider1/storeMyRequest'.split())
break # Exit the script after the crawler
else:
print(f"At time:{datetime.datetime.now()}, mySpider is running, sleep to wait.")
i += 1
time.sleep(10)
moniter += 10
if moniter >= 1440:
break边栏推荐
- Dart final and const variables
- [Yugong series] July 2022 go teaching course 015 assignment operators and relational operators of operators
- 用VS Code搞Qt6:编译源代码与基本配置
- Install, configure, and use the metroframework in the C WinForms application
- 2022深圳杯
- Vscode saves setting configuration parameters to the difference between users and workspaces
- Box horse "waist cut", blame Hou Yi for talking too much?
- 代码中的软件工程:正则表达式十步通关
- 流量对于元宇宙来讲并不是最重要的,能否真正给传统的生活方式和生产方式带来改变,才是最重要的
- [daily question 1] 1184. Distance between bus stops
猜你喜欢

Ideal L9, can't cross a pit on the road?

Leetcode46 Full Permutation (Introduction to backtracking)

大话西游服务端启动注意事项

Will eating fermented steamed bread hurt your body

2022天工杯CTF---crypto1 wp

knapsack problem

Baidu xirang's first yuan universe auction ended, and Chen Danqing's six printmaking works were all sold!

如何学习 C 语言?

30 times performance improvement -- implementation of MyTT index library based on dolphin DB

error: redefinition of
随机推荐
Price reduction, game, bitterness, etc., vc/pe rushed to queue up and quit in 2022
Leetcode 206. reverse linked list I
Wechat applet switchtab transmit parameters and receive parameters
【SemiDrive源码分析】【驱动BringUp】39 - Touch Panel 触摸屏调试
LeetCode46全排列(回溯入门)
Save the sqoop run template
数据提交类型 Request Payload 与 Form Data 的区别总结
代码中的软件工程:正则表达式十步通关
New tea, start "fighting in groups"
Health clock in daily reminder tired? Then let automation help you -- hiflow, application connection automation assistant
Will eating fermented steamed bread hurt your body
Dynamic memory management
Rongyun launched a real-time community solution and launched "advanced players" for vertical interest social networking
Tab bar toggle style
常吃发酵馒头是否会伤害身体
Teach you to use cann to convert photos into cartoon style
Talk about practice, do solid work, and become practical: tour the digitalized land of China
各位老板 问一下 就是我们mysql cdc保存的是配置数据 然后kafka里面堆积的有历史
[300 + selected interview questions from big companies continued to share] big data operation and maintenance sharp knife interview question column (V)
流量对于元宇宙来讲并不是最重要的,能否真正给传统的生活方式和生产方式带来改变,才是最重要的