当前位置:网站首页>Directional crawling Taobao product name and price (teacher Songtian)
Directional crawling Taobao product name and price (teacher Songtian)
2022-07-24 11:15:00 【What about Saipan】
Songtian's code can't crawl to Taobao now , That's because Taobao's anti pickpocketing technology has been upgraded
resolvent : We're going to headers Medium cookie Replace with Taobao's ( Everyone's cookie The value is different )
Specific method reference : adopt requests library re The library crawls Taobao products ( For Chinese Universities mooc Songtian teacher crawler to modify )_Omann The blog of -CSDN Blog
# -*- coding: utf-8 -*-
"""
Created on Mon Oct 4 00:06:08 2021
@author: saiban
"""
# Songtian's code can't crawl to Taobao now , Now the anti pickpocket technology of Taobao is upgraded ,
# We need to take headers In content referer and cookies Replace with Taobao's
import requests
import re
def getHtmlText(url):# Access to the page
try:
header = {
'authority': 's.taobao.com',
'cache-control': 'max-age=0',
'sec-ch-ua': '"Chromium";v="94", "Google Chrome";v="94", ";Not A Brand";v="99"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Windows"',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.71 Safari/537.36',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'sec-fetch-site': 'same-origin',
'sec-fetch-mode': 'navigate',
'sec-fetch-user': '?1',
'sec-fetch-dest': 'document',
'accept-language': 'zh-CN,zh;q=0.9',
'cookie': 'cna=w7y7GYch4kMCAasijBL1tcnw; xlly_s=1; t=d77b430a77fdf76d9c17f2806b57c2ff; hng=CN%7Czh-CN%7CCNY%7C156; thw=cn; _m_h5_tk=5ccf9d80baa24976d1bd97719fb7d377_1633327797887; _m_h5_tk_enc=2d2f907d5f3a86f04dbbf2aa73897ecd; _samesite_flag_=true; cookie2=16bd808d5d4c077a1e5f8e3abc00787e; _tb_token_=e3e5e0fd3333d; sgcookie=E1004cVmZPbUKd%2Bo2Y6ewiI7lmCD2rFerQ9K0Rx3PgQSSoYp%2FWW8LOvfo4oThh7eNLIEFm5uGhkQ9IUsgsWnv4%2BYBvCt6Z2xxPQJp498ChIaTCg%3D; unb=2202345247497; uc3=nk2=F5RBx%2BGr84TAocRa&lg2=URm48syIIVrSKA%3D%3D&vt3=F8dCujaCTG0Yn%2BEGbMY%3D&id2=UUphyItuGYNeDyMxrA%3D%3D; csg=fa72b214; lgc=tb4216148421; cancelledSubSites=empty; cookie17=UUphyItuGYNeDyMxrA%3D%3D; dnk=tb4216148421; skt=a19b5b0102a7b5ee; existShop=MTYzMzM1MjE0MQ%3D%3D; uc4=nk4=0%40FY4KoqHi383HYtpSM0RDmlOwk4iA8Tg%3D&id4=0%40U2grE1hEVww3EVoATgMbl4PMiyTEeIZt; tracknick=tb4216148421; _cc_=W5iHLLyFfA%3D%3D; _l_g_=Ug%3D%3D; sg=17e; _nk_=tb4216148421; cookie1=W8743JHOqTkZp4234GIqb8W2j3pRPRi2%2Ftn1Y16wf2Y%3D; enc=lEu3iTdRRzo2bKJ%2FSRTJ7W3KJmqkZoqJ8qTWcN7Fqxv4oVm4619kntnz84TzJb6SnF8AjFC43wovrgqFDVvISLE2T0wQC8D4h3ZzkSjIpSs%3D; mt=ci=0_1; uc1=existShop=false&pas=0&cookie21=Vq8l%2BKCLjhZM&cookie16=UtASsssmPlP%2Ff1IHDsDaPRu%2BPw%3D%3D&cookie14=Uoe3dP4mSshcOw%3D%3D&cookie15=WqG3DMC9VAQiUQ%3D%3D; JSESSIONID=338CCA02A39CF025F71F1BF78290B3CC; tfstk=c_YCB2ijKJ2I0Vnz8BGaUpf5nnb5aLH1sD6pO3g3IVJs9xRcBsmQutMG732Gz_C1.; l=eBxAgJEmghiX7hpyBO5Cnurza77OFIRbzPVzaNbMiInca6OA9FiEjNCLQg5JWdtjgt5xNFtzh0NBGRE6SuzLRxGjL77kRs5mpI96Re1..; isg=BD09yvnBAAYXt6RpoStY_zgXTJk32nEsV3r_rv-C1hTDNl9owymX_Iuk4Gpws4nk',
}
r=requests.get(url,headers=header,timeout=30)
r.raise_for_status()
r.encoding=r.apparent_encoding
return r.text
except:
print(" Something went wrong ")
def parsePage(ilt,html):# Parse each obtained page
try:
plt=re.findall(r'\"view_price\"\:\"[\d\.]*\"', html)
tlt=re.findall(r'\"raw_title\"\:\".*?\"', html)
for i in range(len(plt)):
price=eval(plt[i].split(':')[1])#eval Function can remove the outermost single quotation marks and double quotation marks
title=eval(tlt[i].split(':')[1])
ilt.append([price,title])
except:
print(" Something went wrong ")
def printGoodList(ilt):# Printout information
tplt='{:4}\t{:8}\t{:16}\t'
print(tplt.format(' Serial number ', ' Price ',' Name of commodity '))
count=0
for g in ilt:
count=count+1
print(tplt.format(count,g[0],g[1]))
def main():
goods='s a bag '
depth=2
start_url='https://s.taobao.com/search?q='+goods
infolist=[]
for i in range(depth):
try:
url=start_url+'&pnum='+str(44*i)
html=getHtmlText(url)
parsePage(infolist,html)
except:
continue
printGoodList(infolist)
main()# The previous is just the definition main() function , Here is the call main function , Make the whole program run
Here to Convert curl command syntax to Python requests, Ansible URI, browser fetch, MATLAB, Node.js, R, PHP, Strest, Go, Dart, Java, JSON, Elixir, and Rust code Explain , What we get from the inspection is curl grammar , This website can take curl Syntax to Python、Node.js、PHP、R、Go、Rust、Elixir、Java、MATLAB、Ansible URI、Strest、Dart and JSON Equiform

Saipan learns to crawl .... Learning is useless
边栏推荐
- 1184. Distance between bus stops: simple simulation problem
- 如何给自己的网站接入在线客服系统代码
- 【C】 Recursive and non recursive writing of binary tree traversal
- 【Golang】golang实现发送微信服务号模板消息
- Download path of twincat3 versions
- Read the triode easily. It turns out that it works like this
- MySQL engine
- Build resume editor based on Nocode
- Performance test summary (I) -- basic theory
- Introduction to kubernetes Basics
猜你喜欢

【C】 Recursive and non recursive writing of binary tree traversal

简单理解modbus功能码和分区

How to go from functional testing to automated testing?

Kubernetes Foundation

Reprint of illustrations in nature, issue 3 - area map (part2-100)

浅析拉格朗日乘数法及其对偶问题

1184. Distance between bus stops: simple simulation problem

08【AIO编程】

RRPN:Arbitrary-Oriented Scene Text Detection via Rotation Proposals

在idea中System.getProperty(“user.dir“)识别到模块(module)路径的方法:Working directory的设置
随机推荐
Online customer service chat system source code_ Beautiful and powerful golang kernel development_ Binary operation fool installation_ Construction tutorial attached
Redistribution distributed lock types
[golang] golang implements sha256 encryption function
RS485 communication OSI model network layer
[attack and defense world web] difficulty five-star 15 point advanced question: ics-07
Lanqiao cup provincial training camp - commonly used STL
Blue Bridge Cup provincial match training camp - Calculation of date
The solution of permission denied
Read the triode easily. It turns out that it works like this
Working principle and function application of frequency converter
《Nature》论文插图复刻第3期—面积图(Part2-100)
Classification and introduction of arm and series processors
UNIX C language POSIX thread creation, obtaining thread ID, merging thread, separating thread, terminating thread, thread comparison
MySQL paging
Logic of automatic reasoning 06 -- predicate calculus
[live registration] analysis of location cache module and detailed explanation of OCP monitoring and alarm
How to convert word to markdown text
Decomposition of kubernets principle
JMeter interface test steps - Installation Tutorial - script recording - concurrent test
在线客服聊天系统源码_美观强大golang内核开发_二进制运行傻瓜式安装_附搭建教程