当前位置:网站首页>Use the script to crawl the beautiful sentences of the sentence fan website and store them locally (blessed are those who like to excerpt!)

Use the script to crawl the beautiful sentences of the sentence fan website and store them locally (blessed are those who like to excerpt!)

2022-06-26 13:06:00 Rowing in the waves

1. Preface

I have always collected sentences or lines that I accidentally think are very beautiful , When I first played Sina Weibo, I saw that some weibos had always shared very good lines or sentences . I almost copied them down , Although the writing is not very good , Now when you see a good sentence, you will copy, paste and save it to your mobile phone memo , I'll watch it when I have time , Then I will post my collection (ps: The following sentences are all about me after dinner , Browse the major websites , Hold the heart of love , Collected , The source is unknown , Please let us know if there is any infringement )

  1. It's not what we hate that destroys us , And it's exactly what we love .
    These people are not heroes , It may not be the culprit . Their social status may be extremely low , It can also be extremely high . It's hard to say whether they are good or bad , But because of their existence , Many simple things become chaotic 、 ambiguous 、 dirty , Many peaceful relationships become tense 、 embarrassed … They never want to be responsible for anything , And they really can't be held accountable …
    … You are finally angry , Gather all powerful thunders and prepare to bombard , I didn't expect them to be angry with you , You suddenly lost your target …
    … What is a villain ? If the definition is clear , They are not so hateful …
  2. “ Where the heart is , I've been in the past , Life is like a journey , One reed to sail .”
  3. The first floor will eventually miss the youth , Freedom sooner or later confuses the rest of life .
  4. The bloody wind made a great noise , There is no white plum fragrance in the world . The beloved drifts away with the snow , The living keep the cross wound alone .----《 Rurouni Kenshin 》 (ps: This is what I saw when I watched animation , It still reads like chicken jelly ?)
  5. Three winters are warm , Harsh words hurt in June .
  6. . All the people in the world are for profit , The world is bustling for profit .
  7. The honey of this , That is arsenic. .
  8. I love you , It's with Qing Qing , I don't care , Who should be Qing Qing .
  9. I'm not sociable by nature . In most cases , I don't think it's boring , It's just that I'm afraid that they think I'm boring . But I don't want to put up with boredom , I don't want to make myself interesting , It was too tired . I'm the most relaxed when I'm alone , Because I don't feel bored , Even if it's boring , I'll bear it myself , Don't involve others .
  10. The green light shines on the wall and people first sleep , The cold rain knocks on the window and is not warm .
  11. Memories can make a person neurotic. One second ago, the corners of his mouth were slightly raised , The next second it moistened the eyes , This may be a sudden moment related to your memory , Or a similar plot is enough to make you suddenly burst into tears ……
  12. On the first night of last year , Flower market lights like day . Willow head on the moon , After dusk .
    On the first night of this year , Flowers and lights are still . No one last year , Tears filled the sleeves of the spring shirt .
  13. I didn't know the meaning of the song at the beginning , Listen to the music again . Now that you have become a person in the song , Why do you need to know the music again .
  14. Since ancient times, people have expressed themselves in white , Never love letter, never love letter .
  15. Talk about how many years you are young , Often with life, human life .

2. newly build scapy The project

The reason for using the following crawl Templates , It is because the template is more suitable for extracting links from article pages , Avoid too many parsing pages ,scrapy It will automatically access the url style , Greatly save access time .

Create a juzimi Of project, And called sentence The reptiles of py file .

scrapy startproject juzimi
cd juzimi
scrapy genspider -t crawl sentence www.juzimi.com# It's using scrapy Another kind of crawl Templates 

3. For the convenience of operation , Or create a new one under the first level directory main.py file , The code is as follows :

from scrapy import cmdline
cmdline.execute('scrapy crawl sentence'.split())

4. To write item.py The code is as follows :

class JuzimiItem(scrapy.Item):
    title=scrapy.Field()# The source of the sentence 
    sentence=scrapy.Field()# The sentence 
    writer=scrapy.Field()# author 
    love=scrapy.Field()# Love to count 
    url=scrapy.Field()# article url, This link is stored so that some of the above fields are incorrect , Convenient for troubleshooting .

5. To write sentence.py Inside the main function code :

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from w3lib.html import remove_tags#python Self-contained class , Remove the label inside the content 
from ..items import JuzimiItem#.. Represents the relative path under the current file 

class SentenceSpider(CrawlSpider):
    name = 'sentence'
    allowed_domains = ['www.juzimi.com']
    start_urls = [	'https://www.juzimi.com/allarticle/jingdiantaici',
    				'https://www.juzimi.com/allarticle/sanwen'		]
    rules = (
        Rule(LinkExtractor(allow=r'article/\d+'), callback='parse_item', follow=True),
        # List of sentences url Such as :'https://www.juzimi.com/article/28410'
    )
    def parse_item(self, response):
        sentences = [remove_tags(item) for item in response.css('.xlistju').extract()]
        loves=[ item.lstrip(' like ') for item in response.css('div.view-content:nth-child(2) .flag-action::text').extract()]
        writers=[item.css('.views-field-field-oriwriter-value::text').extract() if item.css('.views-field-field-oriwriter-value::text') else ''
                for item in response.css('div.view-content:nth-child(2) .xqjulistwafo')]
        titles=[item for item in response.css('div.view-content:nth-child(2) .xqjulistwafo .active::text').extract()]
        for sentence,love,writer,title in zip(sentences,loves,writers,titles):
            item = JuzimiItem()
            item['sentence']=sentence
            item['love']=love
            item['writer']=writer
            item['title']=title
            item['url']=response.url
            return item

Find yourself really in love with the list derivation , Each sentence list page has 10 Sentences , The field value is not obtained on the sentence details page , Considering that the details page doesn't have the amount of love I want , In addition, field values can be obtained on the list page , All said there was no need to go to the details page to get . Some sentences have no author , All the judgment statements are added to the list derivation , If there is a corresponding author tag, get , Nothing is empty ’’. Back to item The fields are as follows :

{
    'love': '(2139)',
 'sentence': ' Strangers are like jade , There is no one like you .',
 'title': ' Mu Yucheng is about ',
 'url': 'https://www.juzimi.com/article/45916',
 'writer': [' Ye Fan ']}
 
{
    'love': '(110)',
 'sentence': ' Life always takes more than it gives .',
 'title': ' Galaxy escort ',
 'url': 'https://www.juzimi.com/article/76689',
 'writer': ''}
 
{
    'love': '(405)',
 'sentence': "Secrets have a cost. They're not free. Not now, not ever. \r"
             ' Secrets come at a price . There are no free secrets . No, not at the moment , Not at any time .',
 'title': ' Super spider man ',
 'url': 'https://www.juzimi.com/article/27324',
 'writer': ''}

{
    'love': '(140)',
 'sentence': ' Hatred is like poison , Slowly it will make your soul ugly .',
 'title': ' spider-man 3',
 'url': 'https://www.juzimi.com/article/37095',
 'writer': ''}
 
{
    'love': '(30)',
 'sentence': '“ From now on , I will no longer be greedy , But just love to eat .”',
 'title': ' Garfield's happy life ',
 'url': 'https://www.juzimi.com/article/362645',
 'writer': [' Garfield ']}

6. utilize CsvItemExporter Saved locally

in consideration of scraoy Provides exporter, You can see expoter Source code ,scrapy Provides multiple export methods , As shown below :

__all__ = ['BaseItemExporter', 'PprintItemExporter', 'PickleItemExporter',
           'CsvItemExporter', 'XmlItemExporter', 'JsonLinesItemExporter',
           'JsonItemExporter', 'MarshalItemExporter']

Here we use CsvItemExporter The exporter exports the... Returned above item, because csv The format file can be accessed through excel open . So next write pipelines.py The documents inside , The code is as follows :

class CsvExporterPipeline(object):
    # utilize scrapy Built in CsvExporter export csv file 
    def __init__(self):
        self.file=open('new_sentence.csv','wb')# Open storage file , It doesn't matter whether the file is created or not 
        self.exporter=CsvItemExporter(self.file,encoding='utf-8')# Create an export class object , Specify the encoding format 
        self.exporter.start_exporting()# Start import 
    def spider_closed(self,spider):
        self.finish_exporting()# Close import 
    def process_item(self,item,spider):
        self.exporter.export_item(item)
        return item # Pass on item

Remember to write the code in settings.py Open inside ITEM_PIPELINES, The code is as follows :

ITEM_PIPELINES = {
    
    'juzimi.pipelines.CsvExporterPipeline':300,
}
ROBOTSTXT_OBEY = False
DOWNLOAD_DELAY = 2
USER_AGENT ='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'

7. Set up the proxy middleware .

Waiting for the result with joy , It turned out to be a 403 Blocking access , The back directly sealed it for me ip, Browsers can't access , I have no choice but to act as an agent , Acting Dafa is good . Here we recommend Abu cloud agent , There is an hour system , At the same time, various interface documents are written in detail , So it is more convenient . The following code provides documentation for Abu cloud

import base64
#  proxy server 
proxyServer = "http://http-dyn.abuyun.com:9020"
#  Proxy tunnel validation information 
proxyUser = "user"# There will be... After purchase 
proxyPass = "password"# There will be... After purchase 
proxyAuth = "Basic " + base64.urlsafe_b64encode(bytes((proxyUser + ":" + proxyPass), "ascii")).decode("utf8")
class ProxyMiddleware(object):
    def process_request(self, request, spider):
        request.meta["proxy"] = proxyServer
        request.headers["Proxy-Authorization"] = proxyAuth

Next, just configure the middleware :

DOWNLOADER_MIDDLEWARES = {
   'juzimi.middlewares.ProxyMiddleware': 543,
}

8. The screenshot of the startup and operation project interface is as follows :

 Insert picture description here Here's a little question , Generated csv The file in pycharm It's not garbled when you open it , Notepad open is not garbled , But with excel When you open it, the code is garbled , Later, I tried to specify the encoding format as... Through Notepad ascii Save as a new file , It is normal to open it again ,csv The screenshot of the file is as follows :
 Insert picture description here Here I sort according to my favorite amount , Post it , The plan was to post only the top ten , There are my lines in the back of the page ,HASAKI, That must be posted , It seems that a lot of people like Yasso , The lines sound domineering .
 Insert picture description here

原网站

版权声明
本文为[Rowing in the waves]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/177/202206261224271891.html