python_scrapy_爬虫

2020-12-13 06:29

阅读：503

标签：init pat obj isp on() 开始 play elf ini

python , scrapy框架入门 , xpath解析, json 存储.

涉及到详情页爬取,

目录结构:

技术图片

kaoshi_bqg.py

import scrapy
from scrapy.spiders import Rule
from scrapy.linkextractors import LinkExtractor
from ..items import BookBQGItem


class KaoshiBqgSpider(scrapy.Spider):
    name = ‘kaoshi_bqg‘
    allowed_domains = [‘biquge5200.cc‘]
    start_urls = [‘https://www.biquge5200.cc/xuanhuanxiaoshuo/‘]

    rules = (
        # 编写匹配文章列表的规则
        Rule(LinkExtractor(allow=r‘https://www.biquge5200.cc/xuanhuanxiaoshuo/‘), follow=True),
        # 匹配文章详情
        Rule(LinkExtractor(allow=r‘.+/[0-9]{1-3}_[0-9]{2-6}/‘), callback=‘parse_item‘, follow=False),
    )

    # 小书书名
    def parse(self, response):
        a_list = response.xpath(‘//*[@id="newscontent"]/div[1]/ul//li//span[1]/a‘)
        for li in a_list:
            name = li.xpath(".//text()").get()
            detail_url = li.xpath(".//@href").get()
            yield scrapy.Request(url=detail_url, callback=self.parse_book, meta={‘info‘: name})

    #  单本书所有的章节名
    def parse_book(self, response):
        name = response.meta.get(‘info‘)
        list_a = response.xpath(‘//*[@id="list"]/dl/dd[position()>20]//a‘)
        for li in list_a:
            chapter = li.xpath(".//text()").get()
            url = li.xpath(".//@href").get()
            yield scrapy.Request(url=url, callback=self.parse_content, meta={‘info‘: (name, chapter)})

    # 每章节内容
    def parse_content(self, response):
        name, chapter = response.meta.get(‘info‘)
        content = response.xpath(‘//*[@id="content"]//p/text()‘).getall()
        item = BookBQGItem(name=name, chapter=chapter, content=content)
        yield item

代码

item.py

代码

pipelines.py

from scrapy.exporters import JsonLinesItemExporter


class BqgPipeline(object):
    def __init__(self):
        self.fp = open("biquge.json", ‘wb‘)
        #  JsonLinesItemExporter 调度器
        self.exporter = JsonLinesItemExporter(self.fp, ensure_ascii=False)

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

    def close_item(self):
        self.fp.close()
        print("爬虫结束")


# class XmlyPipeline(object):
#     def __init__(self):
#         self.fp = open("xmly.json", ‘wb‘)
#         #  JsonLinesItemExporter 调度器
#         self.exporter = JsonLinesItemExporter(self.fp, ensure_ascii=False)
#
#     def process_item(self, item, spider):
#         self.exporter.export_item(item)
#         return item
#
#     def close_item(self):
#         self.fp.close()
#         print("爬虫结束")

代码

starts.py

from scrapy import cmdline

cmdline.execute("scrapy crawl kaoshi_bqg".split())
# cmdline.execute("scrapy crawl xmly".split())

代码

然后是爬取到的数据

biquge.json

技术图片

xmly.json 技术图片

记录一下爬取过程中遇到的一点点问题:
- 在爬取详情页的的时候, 刚开始不知道怎么获取详情页的 url 以及上一个页面拿到的字段
-

python_scrapy_爬虫

标签：init pat obj isp on() 开始 play elf ini

原文地址：https://www.cnblogs.com/longpy/p/11180956.html

上一篇：windows7所有版本迅雷地址下载集合（含32位和64位） - imsoft.cnblogs

下一篇：Test: 为WLW添加源代码着色插件WindowsLiveWriter.CNBlogs.CodeHighlighter

文章来自：搜素材网的编程语言模块，转载请注明文章出处。
文章标题：python_scrapy_爬虫
文章链接：http://soscw.com/essay/33110.html

亲，登录后才可以留言！

python_scrapy_爬虫

评论

热门文章

推荐文章

最新文章

置顶文章