当前位置:网站首页>Use the script to crawl the beautiful sentences of the sentence fan website and store them locally (blessed are those who like to excerpt!)
Use the script to crawl the beautiful sentences of the sentence fan website and store them locally (blessed are those who like to excerpt!)
2022-06-26 13:06:00 【Rowing in the waves】
utilize scrapy Crawl the sentence fan website and store the beautiful sentences locally ( Blessed are those who like excerpts !)
- 1. Preface
- 2. newly build scapy The project
- 3. For the convenience of operation , Or create a new one under the first level directory main.py file , The code is as follows :
- 4. To write item.py The code is as follows :
- 5. To write sentence.py Inside the main function code :
- 6. utilize CsvItemExporter Saved locally
- 7. Set up the proxy middleware .
- 8. The screenshot of the startup and operation project interface is as follows :
1. Preface
I have always collected sentences or lines that I accidentally think are very beautiful , When I first played Sina Weibo, I saw that some weibos had always shared very good lines or sentences . I almost copied them down , Although the writing is not very good , Now when you see a good sentence, you will copy, paste and save it to your mobile phone memo , I'll watch it when I have time , Then I will post my collection (ps: The following sentences are all about me after dinner , Browse the major websites , Hold the heart of love , Collected , The source is unknown , Please let us know if there is any infringement )
- It's not what we hate that destroys us , And it's exactly what we love .
These people are not heroes , It may not be the culprit . Their social status may be extremely low , It can also be extremely high . It's hard to say whether they are good or bad , But because of their existence , Many simple things become chaotic 、 ambiguous 、 dirty , Many peaceful relationships become tense 、 embarrassed … They never want to be responsible for anything , And they really can't be held accountable …
… You are finally angry , Gather all powerful thunders and prepare to bombard , I didn't expect them to be angry with you , You suddenly lost your target …
… What is a villain ? If the definition is clear , They are not so hateful … - “ Where the heart is , I've been in the past , Life is like a journey , One reed to sail .”
- The first floor will eventually miss the youth , Freedom sooner or later confuses the rest of life .
- The bloody wind made a great noise , There is no white plum fragrance in the world . The beloved drifts away with the snow , The living keep the cross wound alone .----《 Rurouni Kenshin 》 (ps: This is what I saw when I watched animation , It still reads like chicken jelly ?)
- Three winters are warm , Harsh words hurt in June .
- . All the people in the world are for profit , The world is bustling for profit .
- The honey of this , That is arsenic. .
- I love you , It's with Qing Qing , I don't care , Who should be Qing Qing .
- I'm not sociable by nature . In most cases , I don't think it's boring , It's just that I'm afraid that they think I'm boring . But I don't want to put up with boredom , I don't want to make myself interesting , It was too tired . I'm the most relaxed when I'm alone , Because I don't feel bored , Even if it's boring , I'll bear it myself , Don't involve others .
- The green light shines on the wall and people first sleep , The cold rain knocks on the window and is not warm .
- Memories can make a person neurotic. One second ago, the corners of his mouth were slightly raised , The next second it moistened the eyes , This may be a sudden moment related to your memory , Or a similar plot is enough to make you suddenly burst into tears ……
- On the first night of last year , Flower market lights like day . Willow head on the moon , After dusk .
On the first night of this year , Flowers and lights are still . No one last year , Tears filled the sleeves of the spring shirt . - I didn't know the meaning of the song at the beginning , Listen to the music again . Now that you have become a person in the song , Why do you need to know the music again .
- Since ancient times, people have expressed themselves in white , Never love letter, never love letter .
- Talk about how many years you are young , Often with life, human life .
2. newly build scapy The project
The reason for using the following crawl Templates , It is because the template is more suitable for extracting links from article pages , Avoid too many parsing pages ,scrapy It will automatically access the url style , Greatly save access time .
Create a juzimi Of project, And called sentence The reptiles of py file .
scrapy startproject juzimi
cd juzimi
scrapy genspider -t crawl sentence www.juzimi.com# It's using scrapy Another kind of crawl Templates
3. For the convenience of operation , Or create a new one under the first level directory main.py file , The code is as follows :
from scrapy import cmdline
cmdline.execute('scrapy crawl sentence'.split())
4. To write item.py The code is as follows :
class JuzimiItem(scrapy.Item):
title=scrapy.Field()# The source of the sentence
sentence=scrapy.Field()# The sentence
writer=scrapy.Field()# author
love=scrapy.Field()# Love to count
url=scrapy.Field()# article url, This link is stored so that some of the above fields are incorrect , Convenient for troubleshooting .
5. To write sentence.py Inside the main function code :
# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from w3lib.html import remove_tags#python Self-contained class , Remove the label inside the content
from ..items import JuzimiItem#.. Represents the relative path under the current file
class SentenceSpider(CrawlSpider):
name = 'sentence'
allowed_domains = ['www.juzimi.com']
start_urls = [ 'https://www.juzimi.com/allarticle/jingdiantaici',
'https://www.juzimi.com/allarticle/sanwen' ]
rules = (
Rule(LinkExtractor(allow=r'article/\d+'), callback='parse_item', follow=True),
# List of sentences url Such as :'https://www.juzimi.com/article/28410'
)
def parse_item(self, response):
sentences = [remove_tags(item) for item in response.css('.xlistju').extract()]
loves=[ item.lstrip(' like ') for item in response.css('div.view-content:nth-child(2) .flag-action::text').extract()]
writers=[item.css('.views-field-field-oriwriter-value::text').extract() if item.css('.views-field-field-oriwriter-value::text') else ''
for item in response.css('div.view-content:nth-child(2) .xqjulistwafo')]
titles=[item for item in response.css('div.view-content:nth-child(2) .xqjulistwafo .active::text').extract()]
for sentence,love,writer,title in zip(sentences,loves,writers,titles):
item = JuzimiItem()
item['sentence']=sentence
item['love']=love
item['writer']=writer
item['title']=title
item['url']=response.url
return item
Find yourself really in love with the list derivation , Each sentence list page has 10 Sentences , The field value is not obtained on the sentence details page , Considering that the details page doesn't have the amount of love I want , In addition, field values can be obtained on the list page , All said there was no need to go to the details page to get . Some sentences have no author , All the judgment statements are added to the list derivation , If there is a corresponding author tag, get , Nothing is empty ’’. Back to item The fields are as follows :
{
'love': '(2139)',
'sentence': ' Strangers are like jade , There is no one like you .',
'title': ' Mu Yucheng is about ',
'url': 'https://www.juzimi.com/article/45916',
'writer': [' Ye Fan ']}
{
'love': '(110)',
'sentence': ' Life always takes more than it gives .',
'title': ' Galaxy escort ',
'url': 'https://www.juzimi.com/article/76689',
'writer': ''}
{
'love': '(405)',
'sentence': "Secrets have a cost. They're not free. Not now, not ever. \r"
' Secrets come at a price . There are no free secrets . No, not at the moment , Not at any time .',
'title': ' Super spider man ',
'url': 'https://www.juzimi.com/article/27324',
'writer': ''}
{
'love': '(140)',
'sentence': ' Hatred is like poison , Slowly it will make your soul ugly .',
'title': ' spider-man 3',
'url': 'https://www.juzimi.com/article/37095',
'writer': ''}
{
'love': '(30)',
'sentence': '“ From now on , I will no longer be greedy , But just love to eat .”',
'title': ' Garfield's happy life ',
'url': 'https://www.juzimi.com/article/362645',
'writer': [' Garfield ']}
6. utilize CsvItemExporter Saved locally
in consideration of scraoy Provides exporter, You can see expoter Source code ,scrapy Provides multiple export methods , As shown below :
__all__ = ['BaseItemExporter', 'PprintItemExporter', 'PickleItemExporter',
'CsvItemExporter', 'XmlItemExporter', 'JsonLinesItemExporter',
'JsonItemExporter', 'MarshalItemExporter']
Here we use CsvItemExporter The exporter exports the... Returned above item, because csv The format file can be accessed through excel open . So next write pipelines.py The documents inside , The code is as follows :
class CsvExporterPipeline(object):
# utilize scrapy Built in CsvExporter export csv file
def __init__(self):
self.file=open('new_sentence.csv','wb')# Open storage file , It doesn't matter whether the file is created or not
self.exporter=CsvItemExporter(self.file,encoding='utf-8')# Create an export class object , Specify the encoding format
self.exporter.start_exporting()# Start import
def spider_closed(self,spider):
self.finish_exporting()# Close import
def process_item(self,item,spider):
self.exporter.export_item(item)
return item # Pass on item
Remember to write the code in settings.py Open inside ITEM_PIPELINES, The code is as follows :
ITEM_PIPELINES = {
'juzimi.pipelines.CsvExporterPipeline':300,
}
ROBOTSTXT_OBEY = False
DOWNLOAD_DELAY = 2
USER_AGENT ='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'
7. Set up the proxy middleware .
Waiting for the result with joy , It turned out to be a 403 Blocking access , The back directly sealed it for me ip, Browsers can't access , I have no choice but to act as an agent , Acting Dafa is good . Here we recommend Abu cloud agent , There is an hour system , At the same time, various interface documents are written in detail , So it is more convenient . The following code provides documentation for Abu cloud
import base64
# proxy server
proxyServer = "http://http-dyn.abuyun.com:9020"
# Proxy tunnel validation information
proxyUser = "user"# There will be... After purchase
proxyPass = "password"# There will be... After purchase
proxyAuth = "Basic " + base64.urlsafe_b64encode(bytes((proxyUser + ":" + proxyPass), "ascii")).decode("utf8")
class ProxyMiddleware(object):
def process_request(self, request, spider):
request.meta["proxy"] = proxyServer
request.headers["Proxy-Authorization"] = proxyAuth
Next, just configure the middleware :
DOWNLOADER_MIDDLEWARES = {
'juzimi.middlewares.ProxyMiddleware': 543,
}
8. The screenshot of the startup and operation project interface is as follows :
Here's a little question , Generated csv The file in pycharm It's not garbled when you open it , Notepad open is not garbled , But with excel When you open it, the code is garbled , Later, I tried to specify the encoding format as... Through Notepad ascii Save as a new file , It is normal to open it again ,csv The screenshot of the file is as follows :
Here I sort according to my favorite amount , Post it , The plan was to post only the top ten , There are my lines in the back of the page ,HASAKI, That must be posted , It seems that a lot of people like Yasso , The lines sound domineering .
边栏推荐
- 倍福TwinCAT3 NCI在NC轴界面中的基本配置和测试
- HDU1724[辛普森公式求积分]Ellipse
- 倍福TwinCAT通过Emergency Scan快速检测物理连接和EtherCAT网络
- P2393 yyy loves Maths II
- Verilog中的系统任务(显示/打印类)--$display, $write,$strobe,$monitor
- Group counting practice experiment 9 -- using cmstudio to design microprogram instructions based on segment model machine (2)
- code force Party Lemonade
- mariadb学习笔记
- 倍福PLC实现绝对值编码器原点断电保持---bias的使用
- National standard gb28181 protocol easygbs video platform TCP active mode streaming exception repair
猜你喜欢
System tasks (display / print class) in Verilog - $display, $write, $strobe, $monitor
Software testing - concept
Photoshop 2022 23.4.1增加了哪些功能?有知道的吗
processing 函数translate(mouseX, mouseY)学习
Word文档导出(使用固定模板)
详细讲解C语言10(C语言系列)
processsing 函数random
Software testing - Fundamentals
C语言:练习题二
National standard gb28181 protocol easygbs cascaded universal vision platform, how to deal with live message 403?
随机推荐
Guacamole installation
利用scrapy爬取句子迷网站优美句子存储到本地(喜欢摘抄的人有福了!)
【网络是怎么连接的】第二章(上): 建立连接,传输数据,断开连接
Source code learning: atomicinteger class code internal logic
What are the common categories of software testing?
Processsing 鼠标交互 学习
倍福PLC基于CX5130实现数据的断电保持
Record a phpcms9.6.3 vulnerability to use the getshell to the intranet domain control
How does easygbs solve the abnormal use of intercom function?
QT . Establishment and use of pri
倍福CX5130换卡对已有的授权文件转移操作
UVA10341 solve it 二分
Group counting practice experiment 9 -- using cmstudio to design microprogram instructions based on segment model machine (2)
倍福PLC旋切基本原理和应用例程
EasyGBS如何解决对讲功能使用异常?
HDU1724[辛普森公式求积分]Ellipse
KVM video card transparent transmission -- the road of building a dream
postgis計算角度
tauri vs electron
黑马笔记---常用API