当前位置:网站首页>Sorting out common problems after crawler deployment
Sorting out common problems after crawler deployment
2022-06-23 06:06:00 【Python Yixi】
After the crawler local test run passes , Some students can't wait to deploy the program to the server for formal operation , Then after running for a period of time, there are various errors and even the program exits , Here are some common problems for reference :
1、 Local debugging only indicates that the process from request to data analysis has been completed , But it does not mean that the program can collect data stably for a long time , The collected websites need to be tested automatically , Generally, it is recommended to conduct stability test according to a certain number of times or time , Take a look at the response and anti - crawling of the website
2、 The program needs to add exception protection for data processing , If the data requirements are not high , It can be run in a single thread , If the data requirements are high , It is recommended to add multithreading , Improve the processing performance of the program
3、 According to the collected data requirements and website conditions , Configure the appropriate crawler agent , This can reduce the risk of website anti - crawling , Crawler agent selection comparison , Focus on network latency 、IP Pool size and request success rate , In this way, you can quickly select the appropriate crawler agent products
Here is a demo Program , Used to count requests and IP Distribution , It can also be modified into a data acquisition program as required :
#! -- encoding:utf-8 --
import requests
import random
import requests.adapters
# Target page to visit
targetUrlList = [
"https://",
"https://",
"https://",
]
# proxy server ( The product's official website h.shenlongip.com)
proxyHost = " h.shenlongip.com"
proxyPort = " "
# Proxy authentication information
proxyUser = "username"
proxyPass = "password"
proxyMeta = "http://%(user)s:%(pass)[email protected]%(host)s:%(port)s" % {
"host": proxyHost,
"port": proxyPort,
"user": proxyUser,
"pass": proxyPass,
}
# Set up http and https All visits are made with HTTP agent
proxies = {"http": proxyMeta,
"https": proxyMeta,
}
# Set up IP Switch head
tunnel = random.randint(1, 10000)
headers = {"Proxy-Tunnel": str(tunnel)}class HTTPAdapter(requests.adapters.HTTPAdapter):
def proxy_headers(self, proxy):
headers = super(HTTPAdapter, self).proxy_headers(proxy)
if hasattr(self, 'tunnel'):
headers['Proxy-Tunnel'] = self.tunnel
return headers
# Visit the website three times , Use the same tunnel sign , Can maintain the same extranet IP
for i in range(3):
s = requests.session()
a = HTTPAdapter()
# Set up IP Switch head
a.tunnel = tunnel
s.mount('https://', a)for url in targetUrlList:
r = s.get(url, proxies=proxies)
print r.text
边栏推荐
- 十一、纺织面料下架功能的实现
- PAT 乙等 1016 C语言
- Leetcode topic resolution divide two integers
- Pyinstaller sklearn reports errors
- Activity startup mode and life cycle measurement results
- 使用aggregation API扩展你的kubernetes API
- Ant Usage Summary (II): description of related commands
- 新课上线 | 每次 5 分钟,轻松玩转阿里云容器服务!
- Causes and methods of exe flash back
- Visual Studio调试技巧
猜你喜欢

The hierarchyviewer tool cannot find the hierarchyviewer location
![[open source project] excel export Lua configuration table tool](/img/3a/8e831c4216494d5497928bae21523b.png)
[open source project] excel export Lua configuration table tool

Wireshark TS | 视频 APP 无法播放问题

Huawei's software and hardware ecosystem has taken shape, fundamentally changing the leading position of the United States in the software and hardware system

jvm-01.指令重排

How to specify the output path of pig register Project Log
![[cocos2d-x] erasable layer:erasablelayer](/img/6e/1ee750854dfbe6a0260ca12a4a5680.png)
[cocos2d-x] erasable layer:erasablelayer

mysql以逗号分隔的字段作为查询条件怎么查——find_in_set()函数

Adnroid activity截屏 保存显示到相册 View显示图片 动画消失

Infotnews | which Postcard will you receive from the universe?
随机推荐
Pyinstaller sklearn reports errors
云原生数据库是未来
Pat class B 1022 d-ary a+b
Pat class B 1019 C language
PAT 乙等 1019 C语言
[Stanford Jiwang cs144 project] lab2: tcpreceiver
PAT 乙等 1026 程序运行时间
Analysis on the problems and causes of digital transformation of manufacturing industry
gplearn出现 assignment destination is read-only
Pat class B 1013 C language
mongodb 4.x绑定多个ip启动报错
Prometheus, incluxdb2.2 installation and flume_ Export download compile use
工作积累-判断GPS是否打开
Dolphin scheduler dolphin scheduling upgrade code transformation -upgradedolphin scheduler
SQL表名与函数名相同导致SQL语句错误。
【Cocos2d-x】可擦除的Layer:ErasableLayer
Runc symbolic link mount and container escape vulnerability alert (cve-2021-30465)
[cocos2d-x] screenshot sharing function
WordPress contact form entries cross cross site scripting attack
阿里云 ACK One、ACK 云原生 AI 套件新发布,解决算力时代下场景化需求