当前位置:网站首页>Crawler Xiaobai Notes (yesterday's supplement to pay attention to parsing data)
Crawler Xiaobai Notes (yesterday's supplement to pay attention to parsing data)
2022-08-04 15:40:00 【Always sweat than talent】
import reimport urllib.requestfrom bs4 import BeautifulSoupdef main():#1. Crawl the web (parse the data one by one in this)baseurl = 'https://movie.douban.com/top250?start='datalist = getData(baseurl)#2. Save dataprint()#movie linkfindLink = re.compile(r'')#movie picturesfindImg = re.compile(r'(.*?)')#video ratingfindRating = re.compile(r'')#The number of reviewersfindJudge = re.compile(r'(\d*)people evaluation')#profilefindInq = re.compile(r'(.*?)')#video related contentfindBd = re.compile(r'(.*?)
',re.S)#crawl the webdef getData(baseurl):#First you need to get a page of data, and then use a loop to get the information of each page#datalist stores one page of data eachdatalist = []for i in range(0,10):url = baseurl + str(i*25)html = askURL(url)#Parse the data of each page one by one in the loop#loop through each moviesoup = BeautifulSoup(html,"html.parser")for item in soup.find_all('div',class_ = 'item'):data = []# used to store the information of each movieitem = str(item)link = re.findall(findLink,item)[0]data.append(link)img = re.findall(findImg,item)[0]data.append(img)title = re.findall(findTitle, item)if(len(title) == 2) :ctitle = title[0]data.append(ctitle)etitle = title[1].replace("/","")data.append(etitle)else:data.append(title[0])data.append(' ')#When there is no English name, keep the position with a spacejudgeNum = re.findall(findJudge,item)[0]data.append(judgeNum)Inq = re.findall(findInq,item)if len(Inq) != 0:inq = Inq[0].replace('.'," ")data.append(inq)else :data.append(" ")#If there is none, leave it blankbd = re.findall(findBd,item)[0]bd = re.sub('
(\s+)?'," ",bd)bd = re.sub("/"," ",bd)data.append(bd.strip())#Remove the spaces before and afterdatalist.append(data)# Put the processed movie information into the datalistprint(datalist)return datalist#Request web pagedef askURL(url):header = {"User-Agent": "Mozilla/5.0(Linux;Android6.0;Nexus5 Build / MRA58N) AppleWebKit / 537.36(KHTML, likeGecko) Chrome / 103.0.5060.134MobileSafari / 537.36Edg / 103.0.1264.77"}request = urllib.request.Request(url, headers = header)html = ""try :response = urllib.request.urlopen(request)html = response.read().decode()except urllib.error.URLerror as e:if hasattr(e,"code"):print(e.code)if hasattr(e,"reason"):print(e.reason)return html#save datadef saveData() :print()if __name__ == '__main__':main() 边栏推荐
- inter-process communication
- Go 事,Gopher 要学的数字类型,变量,常量,运算符 ,第2篇
- MySQL select加锁分析
- GET 和 POST 请求的区别
- 【Es6中的promise】
- Xi'an Zongheng Information × JNPF: Adapt to the characteristics of Chinese enterprises, fully integrate the cost management and control system
- 分布式链路追踪Jaeger + 微服务Pig在Rainbond上的实践分享
- 附加:自定义注解(参数校验注解);(写的不好,别看…)
- 你一定从未看过如此通俗易懂的YOLO系列(从v1到v5)模型解读
- Resharper 如何把类里的类移动到其他文件
猜你喜欢

项目里的各种配置,你都了解吗?

第三章 Scala运算符

直播回放含 PPT 下载|基于 Flink & DeepRec 构建 Online Deep Learning

What are the useful IT asset management platforms?

Jupyter常用操作总结(强烈建议收藏,持续更新实用操作)

Li Mu's deep learning notes are here!

What is an artifact library in a DevOps platform?What's the use?

Flutter 运动鞋商铺小demo
MySQL当前读、快照读、MVCC

洛谷题解P4326 求圆的面积
随机推荐
Redis 高可用
To ensure that the communication mechanism
PTA 6-2 多项式求值
numpy入门详细代码
初学爬虫笔记(收集数据)
C端折戟,转战B端,联想的元宇宙梦能成吗?
什么是 DevOps?看这一篇就够了!
Request method ‘POST‘ not supported。 Failed to load resource: net::ERR_FAILED
Online Excel based on Next.js
小程序|炎炎夏日、清爽一夏、头像大换装
素士科创板IPO撤单,雷军失去“电动牙刷第一股”
Sublime Text 好用的插件
DocuWare Platform - Content Services and Workflow Automation Platform for Document Management (Part 1)
解决dataset.mnist无法加载进去的情况
我在羊毛和二手群里报复性消费
Go 事,如何成为一个Gopher ,并在7天找到 Go 语言相关工作,第1篇
Roslyn 节点的 Span 和 FullSpan 有什么区别
Manacher(求解最长回文子串)
浅谈一下跨端技术方案
RTC 场景下的屏幕共享优化实践