当前位置:网站首页>Pyhton crawls Baidu library text and writes it into word document
Pyhton crawls Baidu library text and writes it into word document
2022-06-27 19:52:00 【Beidao end Lane】
Catalog
Introduce
Only supports crawling Baidu Library Word file , Text writing Word Document or text file (.txt), The main use of Python Reptile requests library .
requests Kuo is Python In the crawler series, request libraries are popular, convenient and practical , in addition urlib library ( package ) It is also quite popular . besides Python The crawler series also has a parsing library lxml as well as Beautiful Soup,Python The crawler frame scrapy.
Request URL
Introduce to you headers How to use 、 And paging crawling ,headers Generally speaking, it only needs User-Agent That's enough .
def get_url(self):
url = input(" Please enter the downloaded Library URL Address :")
headers = {
# Receive request processing
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
# Declare the type of encoding supported by the browser
'Accept-Encoding': 'gzip, deflate, br',
# The acceptance language sent to the client browser
'Accept-Language': 'zh-CN,zh;q=0.9',
# Get browser cache
'Cache-Control': 'max-age=0',
# Send the next request to the same connection , Until one party actively closes the connection
'Connection': 'keep-alive',
# Main address ( Domain name of the server )
'Host': 'wenku.baidu.com',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'same-origin',
'Sec-Fetch-User': '?1',
'Upgrade-Insecure-Requests': '1',
# Client identification certificate ( It's like an ID card )
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36'
}
response = self.session.get(url=url,headers=headers)
json_data = re.findall('"json":(.*?}])', response.text)[0]
json_data = json.loads(json_data)
# print(json_data)
for index, page_load_urls in enumerate(json_data):
# print(page_load_urls)
page_load_url = page_load_urls['pageLoadUrl']
# print(index)
self.get_data(index, page_load_url)
Crawl data
Get the server response to crawl the document data to write Word file , Can also be with open(‘ Baidu library .docx’, ‘a’, encoding=‘utf-8’) Medium .docx Change to .txt text file , In this way, a text file is written , The line feed function has not been added to write !
def get_data(self, index, url):
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Cache-Control': 'max-age=0',
'Connection': 'keep-alive',
'Host': 'wkbjcloudbos.bdimg.com',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'none',
'Sec-Fetch-User': '?1',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36'
}
response = self.session.get(url=url,headers=headers)
# print(response.content.decode('unicode_escape'))
data = response.content.decode('unicode_escape')
comand = 'wenku_' + str(index+1)
json_data = re.findall(comand + "\((.*?}})\)", data)[0]
# print(json_data)
json_data = json.loads(json_data)
result = []
for i in json_data['body']:
data = i["c"]
# print(data)
result.append(data)
print(''.join(result).replace(' ', '\n'))
print("")
with open(' Baidu library .docx', 'a', encoding='utf-8') as f:
f.write('')
f.write(''.join(result).replace(' ', '\n'))
Complete code
import requests
import re
import json
class WenKu():
def __init__(self):
self.session = requests.Session()
def get_url(self):
url = input(" Please enter the downloaded Library URL Address :")
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Cache-Control': 'max-age=0',
'Connection': 'keep-alive',
'Host': 'wenku.baidu.com',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'same-origin',
'Sec-Fetch-User': '?1',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36'
}
response = self.session.get(url=url,headers=headers)
json_data = re.findall('"json":(.*?}])', response.text)[0]
json_data = json.loads(json_data)
# print(json_data)
for index, page_load_urls in enumerate(json_data):
# print(page_load_urls)
page_load_url = page_load_urls['pageLoadUrl']
# print(index)
self.get_data(index, page_load_url)
def get_data(self, index, url):
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Cache-Control': 'max-age=0',
'Connection': 'keep-alive',
'Host': 'wkbjcloudbos.bdimg.com',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'none',
'Sec-Fetch-User': '?1',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36'
}
response = self.session.get(url=url,headers=headers)
# print(response.content.decode('unicode_escape'))
data = response.content.decode('unicode_escape')
comand = 'wenku_' + str(index+1)
json_data = re.findall(comand + "\((.*?}})\)", data)[0]
# print(json_data)
json_data = json.loads(json_data)
result = []
for i in json_data['body']:
data = i["c"]
# print(data)
result.append(data)
print(''.join(result).replace(' ', '\n'))
print("")
with open(' Baidu library .docx', 'a', encoding='utf-8') as f:
f.write('')
f.write(''.join(result).replace(' ', '\n'))
if __name__ == '__main__':
wk = WenKu()
wk.get_url()
边栏推荐
- Batch insert data using MySQL bulkloader
- 爬取国家法律法规数据库
- Comprehensively analyze the zero knowledge proof: resolve the expansion problem and redefine "privacy security"
- 高收益银行理财产品在哪里看?
- binder hwbinder vndbinder
- 数仓的字符截取三胞胎:substrb、substr、substring
- NVIDIA Clara-AGX-Developer-Kit installation
- Solution of adding st-link to Huada MCU Keil
- UE4:Build Configuration和Config的解释
- 多伦多大学博士论文 | 深度学习中的训练效率和鲁棒性
猜你喜欢

Oracle 获取月初、月末时间,获取上一月月初、月末时间

GIS remote sensing R language learning see here

binder hwbinder vndbinder

今晚战码先锋润和赛道第2期直播丨如何参与OpenHarmony代码贡献

从感知机到前馈神经网络的数学推导

Blink SQL内置函数大全

Bit.Store:熊市漫漫,稳定Staking产品或成主旋律

SQL Server - Window Function - 解决连续N条记录过滤问题

Doctoral Dissertation of the University of Toronto - training efficiency and robustness in deep learning

基于STM32F103ZET6库函数外部中断实验
随机推荐
一对一关系
嵌入式软件开发中必备软件工具
Comprehensively analyze the zero knowledge proof: resolve the expansion problem and redefine "privacy security"
Garbage collector driving everything -- G1
“我让这个世界更酷”2022华清远见研发产品发布会圆满成功
DCC888 :Register Allocation
Kotlin微信支付回调后界面卡死并抛出UIPageFragmentActivity WindowLeaked
循环遍历及函数基础知识
Bit. Store: long bear market, stable stacking products may become the main theme
# Leetcode 821. 字符的最短距离(简单)
Blink SQL built in functions
UE4:Build Configuration和Config的解释
1023 Have Fun with Numbers
How to encapsulate and call a library
429- binary tree (108. convert the ordered array into a binary search tree, 538. convert the binary search tree into an accumulation tree, 106. construct a binary tree from the middle order and post o
基于STM32F103ZET6库函数跑马灯实验
mime.type文件内容
Seven phases of CMS implementation
移动低代码开发专题月 | 可视化开发 一键生成专业级源码
Leetcode 821. 字符的最短距离(简单) - 续集