当前位置:网站首页>Ppt template crawler case
Ppt template crawler case
2022-06-26 06:13:00 【An Muxi】
PPT Templates python Crawling
Yes http://www.ypppt.com/moban/ Medium ppt Climbing of formwork , The website has set up some anti - crawling mechanisms , It needs careful analysis url Address can be crawled correctly !!!
#-*- coding = utf-8 -*-
#@Time:2020-08-13 16:43
#@Author: Have a bottle of anmuxi
#@File: Free resume crawling .py
#@ Start a good day @[email protected]
import requests
import os
from lxml import etree
import re
if __name__ == "__main__":
url = 'http://www.ypppt.com/moban/'
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36'
}
response = requests.get(url=url,headers=headers)
response.encoding = 'utf-8'
page_text = response.text
# Create storage ppt Template files
if not os.path.exists('./ppt Templates '):
os.mkdir('./ppt Templates ')
# establish etree object
tree = etree.HTML(page_text)
# li_list Save first page ppt Template li
li_list = tree.xpath('//ul[@class="posts clear"]/li')
# Analyze each one li, Extract the concrete inside ppt Of url And name
for li in li_list:
ppt_url ='http://www.ypppt.com' +li.xpath('./a[1]/@href')[0]
ppt_name = li.xpath('./a[2]/text()')[0]
# print(ppt_url)
# print(ppt_name)
# Get every one ppt The web page of , Analyze where the download portal is , Find the download portal url
ppt_response = requests.get(url=ppt_url,headers = headers)
ppt_response.encoding = 'utf-8'
ppt_text = ppt_response.text
ppt_tree = etree.HTML(ppt_text)
load_path ='http://www.ypppt.com' +ppt_tree.xpath('//div[@class="button"]/a/@href')[0]
# Found the page of the download portal , Now we need to analyze , Find out where the download button is
load_response = requests.get(url=load_path,headers=headers)
load_response.encoding = 'utf-8'
final_text = load_response.text
final_tree = etree.HTML(final_text)
final_url = final_tree.xpath('//ul[@class="down clear"]/li[1]/a/@href')[0]
# Here the website makes a simple anti - crawl mechanism , Some Download Links url Directly for :/uploads/soft/200810/1-200Q0113H8.zip
# And some download links url:http://www.ypppt.com/uploads/soft/200810/1-200Q0113H8.zip
# So here we use regular expressions to judge
if len(re.findall('http:',str(final_url))) == 0:
final_url = 'http://www.ypppt.com' + final_url
else:
final_url = final_url
# Request to download , there zip Binary, too content
final_ppt = requests.get(url = final_url,headers = headers).content
# It 's going to crawl ppt Store
with open('./ppt Templates /'+ppt_name+'.zip','wb') as fp:
fp.write(final_ppt)
print(ppt_name+'---- Download complete ')
print(' Have a bottle of anmuxi : End of climb !!!!!!!')
End of climb : The folder is shown in the above figure !!!
notes : Don't crawl maliciously , Just use it to learn reptiles ~
边栏推荐
- Selective search for object recognition paper notes [image object segmentation]
- SQL Server视图
- The difference between overload method and override method
- Several promotion routines of data governance
- GoF23—原型模式
- 302. 包含全部黑色像素的最小矩形 BFS
- Data visualization practice: Data Visualization
- GoF23—建造者模式
- 通俗易懂的从IDE说起,再谈谈小程序IDE
- MySQL-07
猜你喜欢
421- binary tree (226. reversed binary tree, 101. symmetric binary tree, 104. maximum depth of binary tree, 222. number of nodes of complete binary tree)
MySQL-07
Data visualization practice: Experimental Report
Combined mode, transparent mode and secure mode
事务与消息语义
冒泡排序(Bubble Sort)
Implementation of third-party wechat authorized login for applet
解决在win10下cmder无法使用find命令
如何设计好的技术方案
Prometheus和Zabbix的对比
随机推荐
Logstash -- send an alert message to the nail using the throttle filter
Test depends on abstraction and does not depend on concrete
Solve the problem that Cmdr cannot use find command under win10
Machine learning 05: nonlinear support vector machines
SQL Server view
numpy. frombuffer()
数据可视化实战:数据可视化
Deeply uncover Ali (ant financial) technical interview process with preliminary preparation and learning direction
Getting started with Python
The interviewer with ByteDance threw me an interview question and said that if I could answer it, other companies would have an 80% chance of passing the technical level
Data visualization practice: Data Visualization
On site commissioning - final method of kb4474419 for win7 x64 installation and vs2017 flash back
跨域的五种解决方案
【群内问题学期汇总】初学者的部分参考问题
Redis多线程与ACL
数据可视化实战:实验报告
numpy. random. choice
Class and object learning
302. 包含全部黑色像素的最小矩形 BFS
Message queue - function, performance, operation and maintenance comparison