当前位置:网站首页>第一个scrapy爬虫
第一个scrapy爬虫
2022-07-25 11:23:00 【托塔天王李】
scrapy目录结构如下

我们要爬取的是读书网里面的书名,作者,和对书的描写
首先我们要定义爬取数据的模型,在items.py文件中
import scrapy
class MoveItem(scrapy.Item): # 定义爬取的数据的模型 title = scrapy.Field() auth = scrapy.Field() desc = scrapy.Field() 主要的还是spiders目录下的move.py文件
import scrapy
from douban.items import MoveItem
class MovieSpider(scrapy.Spider):
# 表示蜘蛛的名字,每个蜘蛛的名字必须是唯一的
name = 'movie'
# 表示过滤爬取的域名
allwed_domians = ['dushu.com']
# 表示最初要爬取的url
start_urls = ['https://www.dushu.com/book/1188.html']
def parse(self,response):
li_list = response.xpath('/html/body/div[6]/div/div[2]/div[2]/ul/li')
for li in li_list:
item = MoveItem()
item['title'] = li.xpath('div/h3/a/text()').extract_first()
item['auth'] = li.xpath('div/p[1]/a/text()').extract_first()
item['desc'] = li.xpath('div/p[2]/text()').extract_first()
# 生成器
yield item
href_list = response.xpath('/html/body/div[6]/div/div[2]/div[3]/div/a/@href').extract()
for href in href_list:
# 把在页面上爬取的url补全
url = response.urljoin(href)
# 一个生成器,response的里面链接,再进行子request,不断执行parse,是个递归。
yield scrapy.Request(url=url,callback=self.parse)
想要持久化数据只有把数据保存起来:在settings.py文件里设置 
在pipelines.py文件里:
import pymongo
class DoubanPipeline(object):
def __init__(self):
self.mongo_client = pymongo.MongoClient('mongodb://39.108.188.19:27017')
def process_item(self, item, spider):
db = self.mongo_client.data
message = db.messages
message.insert(dict(item))
return item边栏推荐
- 对比学习的应用(LCGNN,VideoMoCo,GraphCL,XMC-GAN)
- 小程序image 无法显示base64 图片 解决办法 有效
- 【GCN多模态RS】《Pre-training Representations of Multi-modal Multi-query E-commerce Search》 KDD 2022
- 利用wireshark对TCP抓包分析
- [dark horse morning post] eBay announced its shutdown after 23 years of operation; Wei Lai throws an olive branch to Volkswagen CEO; Huawei's talented youth once gave up their annual salary of 3.6 mil
- PHP curl post length required error setting header header
- Heterogeneous graph neural network for recommendation system problems (ackrec, hfgn)
- Web programming (II) CGI related
- After having a meal with trump, I wrote this article
- php 一台服务器传图片到另一台上 curl post file_get_contents保存图片
猜你喜欢

剑指 Offer 22. 链表中倒数第k个节点

NLP的基本概念1

【CTR】《Towards Universal Sequence Representation Learning for Recommender Systems》 (KDD‘22)
![[multimodal] transferrec: learning transferable recommendation from texture of modality feedback arXiv '22](/img/02/5f24b4af44f2f9933ce0f031d69a19.png)
[multimodal] transferrec: learning transferable recommendation from texture of modality feedback arXiv '22

NLP knowledge - pytorch, back propagation, some small pieces of notes for predictive tasks

【AI4Code】《Contrastive Code Representation Learning》 (EMNLP 2021)

Meta learning (meta learning and small sample learning)

【AI4Code】《InferCode: Self-Supervised Learning of Code Representations by Predicting Subtrees》ICSE‘21

Transformer variants (spark transformer, longformer, switch transformer)

Application and innovation of low code technology in logistics management
随机推荐
LeetCode第303场周赛(20220724)
php curl post Length Required 错误设置header头
'C:\xampp\php\ext\php_ zip. Dll'-%1 is not a valid Win32 Application Solution
异构图神经网络用于推荐系统问题(ACKRec,HFGN)
[high concurrency] I summarized the best learning route of concurrent programming with 10 diagrams!! (recommended Collection)
【AI4Code】《IntelliCode Compose: Code Generation using Transformer》 ESEC/FSE 2020
Add a little surprise to life and be a prototype designer of creative life -- sharing with X contestants in the programming challenge
Power Bi -- these skills make the report more "compelling"“
Eureka使用记录
Meta-learning(元学习与少样本学习)
Web programming (II) CGI related
R语言ggplot2可视化:可视化散点图并为散点图中的部分数据点添加文本标签、使用ggrepel包的geom_text_repel函数避免数据点之间的标签互相重叠(为数据点标签添加线段、指定线段的角度
Sword finger offer 22. the penultimate node in the linked list
Transformer variants (routing transformer, linformer, big bird)
Atomic atomic class
A beautiful gift for girls from programmers, H5 cube, beautiful, exquisite, HD
Hydrogen entrepreneurship competition | Liu Yafang, deputy director of the science and Technology Department of the National Energy Administration: building a high-quality innovation system is the cor
RestTemplate与Ribbon简单使用
Brpc source code analysis (II) -- the processing process of brpc receiving requests
【AI4Code最终章】AlphaCode:《Competition-Level Code Generation with AlphaCode》(DeepMind)