当前位置：网站首页>Ppt template crawler case

Ppt template crawler case

2022-06-26 06:13:00 【An Muxi】

PPT Templates python Crawling

Yes http://www.ypppt.com/moban/ Medium ppt Climbing of formwork , The website has set up some anti - crawling mechanisms , It needs careful analysis url Address can be crawled correctly ！！！

#-*- coding = utf-8 -*-
#@Time：2020-08-13 16:43
#@Author： Have a bottle of anmuxi 
#@File： Free resume crawling .py
#@ Start a good day  @[email protected]

import requests
import os
from lxml import etree
import re

if __name__ == "__main__":
    url = 'http://www.ypppt.com/moban/'
    headers = {
    
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36'
    }
    response = requests.get(url=url,headers=headers)
    response.encoding = 'utf-8'
    page_text = response.text

    #  Create storage ppt Template files 
    if not os.path.exists('./ppt Templates '):
        os.mkdir('./ppt Templates ')

    #  establish etree object 
    tree = etree.HTML(page_text)
    # li_list  Save first page ppt Template li
    li_list = tree.xpath('//ul[@class="posts clear"]/li')
    #  Analyze each one li, Extract the concrete inside ppt Of url And name 
    for li in li_list:
        ppt_url ='http://www.ypppt.com' +li.xpath('./a[1]/@href')[0]
        ppt_name = li.xpath('./a[2]/text()')[0]
        # print(ppt_url)
        # print(ppt_name)
        #  Get every one ppt The web page of , Analyze where the download portal is , Find the download portal url
        ppt_response = requests.get(url=ppt_url,headers = headers)
        ppt_response.encoding = 'utf-8'
        ppt_text = ppt_response.text
        ppt_tree = etree.HTML(ppt_text)
        load_path ='http://www.ypppt.com' +ppt_tree.xpath('//div[@class="button"]/a/@href')[0]
        #  Found the page of the download portal , Now we need to analyze , Find out where the download button is 
        load_response = requests.get(url=load_path,headers=headers)
        load_response.encoding = 'utf-8'
        final_text = load_response.text
        final_tree = etree.HTML(final_text)
        final_url = final_tree.xpath('//ul[@class="down clear"]/li[1]/a/@href')[0]
        #  Here the website makes a simple anti - crawl mechanism , Some Download Links url Directly for :/uploads/soft/200810/1-200Q0113H8.zip
        #  And some download links url：http://www.ypppt.com/uploads/soft/200810/1-200Q0113H8.zip
        #  So here we use regular expressions to judge 
        if len(re.findall('http:',str(final_url))) == 0:
            final_url = 'http://www.ypppt.com' + final_url
        else:
            final_url = final_url
        #  Request to download , there zip Binary, too content
        final_ppt = requests.get(url = final_url,headers = headers).content

        #  It 's going to crawl ppt Store 
        with open('./ppt Templates /'+ppt_name+'.zip','wb') as fp:
            fp.write(final_ppt)
        print(ppt_name+'---- Download complete ')

    print(' Have a bottle of anmuxi ： End of climb !!!!!!!')