scrapy-redis 分布式爬虫爬取房天下网站所有国内城市的新房和二手房信息

2021-02-14 02:16

阅读:669

标签:web   fine   join   ascii   数据保存   import   json   启动   yun   

scrapy-redis 分布式爬虫爬取房天下网站所有国内城市的新房和二手房信息

先完成单机版的爬虫,然后将单机版爬虫转为分布式爬虫

爬取思路

1. 进入 https://www.fang.com/SoufunFamily.htm 页面,解析所有的省份和城市,获取到城市首页链接
2. 通过分析,每个城市的新房都是在首页链接上添加newhouse和house/s/字符串,二手房 都死在首页链接上添加esf字段    
以上海为例:    
首页:https://sh.fang.com/
新房:https://sh.newhouse.fang.com/house/s/
二手房:https://sh.esf.fang.com
所以就可以爬取每个城市的新房和二手房

1. 创建项目

scrapy startproject fang
cd fang
scrapy genspider fangtianxia "fang.com"

2. 编辑需要爬取的数据字段

import scrapy


class FangItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    province = scrapy.Field()
    city_name = scrapy.Field()
    house_name = scrapy.Field()
    size = scrapy.Field()
    address = scrapy.Field()
    tel = scrapy.Field()
    price = scrapy.Field()
    type = scrapy.Field()

3. 编辑爬虫解析数据和请求转发

# -*- coding: utf-8 -*-
import scrapy

from scrapylearn.fang.fang.items import FangItem


class FangtianxiaSpider(scrapy.Spider):
    name = ‘fangtianxia‘
    allowed_domains = [‘fang.com‘]
    start_urls = [‘https://www.fang.com/SoufunFamily.htm‘]

    def parse(self, response):
        tr_id = None
        province = None
        trs = response.xpath("//div[@class=‘outCont‘]//tr")
        # 获取每个省每个城市的新房和二手房链接
        for tr in trs:
            new_tr_id = tr.xpath("@id").get()
            if tr_id != new_tr_id:
                tr_id = new_tr_id
                province = tr.xpath("./td[2]//text()").get()
            citys = tr.xpath("./td[3]/a")
            for city in citys:
                city_name = city.xpath("text()").get()
                city_url = city.xpath("@href").get()
                city_newhouse_url = city_url.replace(".", ".newhouse.", 1) + "house/s/"
                city_esf_url = list5 = city_url.replace(".", ".esf.", 1)
                yield scrapy.Request(city_newhouse_url, callback=self.parse_newhouse,
                                     meta={"info": (province, city_name)})
                yield scrapy.Request(city_esf_url, callback=self.parse_esf, meta={"info": (province, city_name)})

    def parse_newhouse(self, response):
        province, city_name = response.meta["info"]
        type = "新房"
        houses = response.xpath("//div[@id=‘newhouse_loupai_list‘]/ul/li[@id]")
        for house in houses:
            house_name = house.xpath(".//div[@class=‘nlcd_name‘]/a/text()").get().strip()
            size = house.xpath(".//div[@class=‘house_type clearfix‘]/a/text()").getall()
            size = ",".join(size)
            address = house.xpath(".//div[@class=‘address‘]/a/@title").get()
            tel = house.xpath(".//div[@class=‘tel‘]/p//text()").getall()
            tel = "".join(tel)
            price = house.xpath(".//div[@class=‘nhouse_price‘]/*/text()").getall()
            price = " ".join(price)
            item = FangItem(province=province, city_name=city_name, house_name=house_name, size=size, address=address,
                            tel=tel, price=price, type=type)
            yield item
        # 继续抓取下一页
        next_url = response.xpath("//a[@class=‘active‘]/following-sibling::a[1]/@href").get()
        if next_url:
            next_url = response.urljoin(next_url)
            yield scrapy.Request(next_url, callback=self.parse_newhouse, meta={"info": (province, city_name)})

    def parse_esf(self, response):
        # 爬取二手房与 parse_newhouse 中爬取新房同理
        pass

4. 将爬取的数据保存到json文件中

from scrapy.exporters import JsonLinesItemExporter


class FangPipeline:
    # 当爬虫被打开的时候会调用
    def open_spider(self, spider):
        print("爬虫开始执行。。。")
        fileName = "fang.json"
        self.fp = open(fileName, "wb")  # 必须以二进制的形式打开文件
        self.exporter = JsonLinesItemExporter(self.fp, ensure_ascii=False, encoding="utf-8")

    # 当爬虫有item传过来的时候会调用
    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

    # 当爬虫关闭的时候会调用
    def close_spider(self, spider):
        print("爬虫执行结束")

5. 设置配置文件 settings.py

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

ITEM_PIPELINES = {
   ‘fang.pipelines.FangPipeline‘: 300,
}

6. 启动爬虫

scrapy crawl fangtianxia

拓展,将单机版的爬虫转成分布式爬虫

1. 安装scrapy-redis

## 安装scrapy-redis:
pip3 install scrapy-redis -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com

2. 将爬虫的类 scrapy.Spider 换成 scrapy_redis.spiders.RedisSpider

3. 将 start_urls = [‘https://www.fang.com/SoufunFamily.htm‘] 删掉,添加一个 redis_key

    # start_urls = [‘https://www.fang.com/SoufunFamily.htm‘]
    # 在redis数据库中添加时要添加成列表类型
    # LPUSH sfw:start_url https://www.fang.com/SoufunFamily.htm
    redis_key = "sfw:start_url"

4. 在配置文件中添加配置

ITEM_PIPELINES = {
    # ‘fang.pipelines.FangPipeline‘: 300,
    "scrapy_redis.pipelines.RedisPipeline": 300
}
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
SCHEDULER_PERSIST = True
REDIS_HOST = "39.97.234.52"
REDIS_PORT = 6479

# Crawl responsibly by identifying yourself (and your website) on the user-agent
# USER_AGENT = ‘fang (+http://www.yourdomain.com)‘

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

5. 在redis中添加url数据

    # 在redis数据库中添加时要添加成列表类型
    LPUSH sfw:start_url https://www.fang.com/SoufunFamily.htm

6. 启动爬虫,就可以在redis中看到爬取的数据了

scrapy-redis 分布式爬虫爬取房天下网站所有国内城市的新房和二手房信息

标签:web   fine   join   ascii   数据保存   import   json   启动   yun   

原文地址:https://www.cnblogs.com/yloved/p/12996463.html


评论


亲,登录后才可以留言!