当前位置:网站首页>Reptile initial and project
Reptile initial and project
2022-06-22 05:42:00 【Amyniez】
Catalog
1.robots.txt agreement :
Specify which data in the website can be crawled , Which cannot be crawled ( A gentleman's agreement )
Such as the input :https://www.sohu.com/robots.txt To view the agreement 
2.http agreement :
Concept : A form of server client interaction . Be similar to , The slang between yangzirong and Zuoshan Diao , Heavenly King Gedi tiger , pagoda will stop river monster ( Encrypted calls )
Common request header information :
- 1.User-Agent: The identity represented by the request carrier ( The user agent )

Chrome Click... In the browser F12、network( The Internet ),Ctrl+r,header, Then you can get the information on the graph .
- 2.Connection: After the request , Disconnect or stay connected
Common response header information :
- Content-Type: The data type of the server response back to the client
3.https agreement :(s:security)
Security Hypertext transfer protocol , Including data encryption
encryption :
- 1. Symmetric key encryption

disadvantages : The secret key may be blocked by a third-party organization , Cause the disclosure of encrypted information .
2. Asymmetric key encryption
shortcoming : Low efficiency , Communication speed slows down ; The public key may be replaced by a third party3. Certificate key encryption

It can ensure that the public key obtained by the client must be generated by the server
4.requests modular
Python Zhongsheng's network request based module , Very powerful , Simple and convenient , Very efficient ,requests Modules are half of the reptiles
effect : Used to simulate the browser to send a request .
Usage mode :
- 1. Appoint url( website )
- 2. Initiate request
- 3. Get response data
- 4. Persistent storage
Environmental installation :
pip install -i http://pypi.douban.com/simple --trusted-host pypi.douban.com requests

import requests
# Appoint url
url = 'https://www.baidu.com/'
# Initiate request
response = requests.get(url=url) # get Method will return a response object
# Get response data
pageText = response.text # text What is returned is the response data in the form of string
# Persistent storage
with open('C:\\Users\\Administrator\\Desktop\\baiduPachong.html','w',encoding='utf-8') as fp:
fp.write(pageText)
print("end")
4. Netease cloud music crawler
import requests
import re # Regular expressions , Self contained , No installation required
import os # File operation module
filename = 'music1\\'
if not os.path.exists(filename):
os.mkdir(filename)
# modify id Access to different music lists
url = 'https://music.163.com/#/discover/toplist?id=3778678'
# header Request header It's Camouflage Python Code hold Python The code disguises itself as a browser to access the server
# After the server receives the request , Will return the response data to us (response)
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.67 Safari/537.36'
}
response = requests.get(url = url,headers=headers)
#print(response.text)
# A content extracted from regular expression The return is a list Each element is a tuple
html_data = re.findall('<li><a href="/song\?id=(\d+)">(.*?)</a>',response.text)
for num_id, title in html_data:
# https://music.163.com/song/media/outer/url?id=1830419924.mp3
music_url = 'https://music.163.com/song/media/outer/url?id={num_id}.mp3'
# Send a request for the music playing address Get binary data content
music_content = requests.get(url=music_url, headers=headers).content
with open(filename + title + '.mp3',mode='wb') as f:
f.write(music_content)
print(html_data)
边栏推荐
- count registers in C code -- registers has one pattern
- 毕业季 | 新的开始,不说再见
- 数据的存储(进阶)
- tmux -- ssh terminal can be closed without impact the server process
- Development planning and investment strategy analysis report of global and Chinese microwave ablation industry during the 14th Five Year Plan period 2022-2027
- 2022 Shanxi secondary vocational group "Cyberspace Security" event module b- web page penetration
- Optimization direction of code walk through (convenient interface requests, long dynamic class judgment conditions, removal of useless consoles, separation of public methods)
- Xshell下载安装(解决评估过期问题)
- Implementation of Nacos server source code
- 《MATLAB 神经网络43个案例分析》:第29章 极限学习机在回归拟合及分类问题中的应用研究——对比实验
猜你喜欢
随机推荐
QEMU ARM interrupt system architecture
CLion安装下载
Graduation season | a new start, no goodbye
Small and medium-sized enterprises should pay attention to these points when signing ERP contracts
Independent station optimization list - how to effectively improve the conversion rate in the station?
Tongda OA vulnerability analysis collection
Rambbmitmq Push Party
Implementation of Nacos server source code
中小企业签署ERP合同时,需要留意这几点
移动端布局适配
Remove then add string from variable of Makefile
Working method: 3C scheme design method
Summary of knapsack problem
Implementation of large file fragment uploading based on webuploader
Kubernetes - bare metal cluster environment
[issue 26] 123hr experience of Tencent teg+ operation development
基于WebUploader实现大文件分片上传
Throw away electron and embrace Tauri based on Rust
在线文本代码对比工具
c files always get rebuild when make -------- .PHONY in Makefile



![P1077 [NOIP2012 普及组] 摆花](/img/0d/f74a2036aa261ed327d9d74291aacc.png)




