scrapy爬虫学习系列七：scrapy常见问题解决方案

1 常见错误

1.1 错误： ImportError: No module named win32api

官方参考：https://doc.scrapy.org/en/latest/faq.html#scrapy-crashes-with-importerror-no-module-named-win32api

官方参考里面有个win32的连接，你下载后安装就可以了。

1.2 DEBUG: Forbidden by robots.txt: <GET https://www.baidu.com>

官方参考： https://doc.scrapy.org/en/latest/topics/settings.html#robotstxt-obey

修改setting.py中的ROBOTSTXT_OBEY = False

1.3 抓取xml文档的时候使用xpath无法返回结果

官方参考：https://doc.scrapy.org/en/latest/faq.html#i-m-scraping-a-xml-document-and-my-xpath-selector-doesn-t-return-any-items

response.selector.remove_namespaces()

response.xpath("//link")

这个问题正常情况我们不用执行remove_namespaces的，只有在抓取不到数据的时候的时候尝试修改下。

1.4 响应流乱码

官方参考： https://doc.scrapy.org/en/latest/topics/request-response.html#scrapy.http.TextResponse.encoding

1 在请求的构造函数中设定encoding

2 在http header中设置

3 在response body定义encoding

4 对获取到的响应流进行转码，这也是最后的方法了。

def parse(self, response):
    #具体怎么转，要看你的编码的

    response=response.replace(encoding="gbk")

    # todo extract a item

2常用解决方案

2.1 scrapy发送抓取数据个数的邮件到指定用户

官方文档有关于email的说明： https://doc.scrapy.org/en/latest/topics/email.html

博友的一篇文章，使用了scrapy的mail模块：http://blog.csdn.net/you_are_my_dream/article/details/60868329

我自己尝试了下使用scrapy的mail模块发送邮件，但是日志是发送成功，但是一直没有收到邮件，不知道啥情况，所以换成了smtpllib发送。修改pipeline.py文件如下：

class MailPipeline(object):

    def __init__(self):

        self.count = 0

    def open_spider(self,spider):

        pass

    def process_item(self, item, spider):

        self.count=self.count + 1

        return item                         #切记，这个return item 必须有， 没有的话，后续的pipeline没法处理数据的。

    def close_spider(self, spider):

        import smtplib

        from email.mime.text import MIMEText

        _user = "1072892917@qq.com"

        _pwd = "xxxxxxxx"                   #这个密码不是直接登陆的密码， 是smtp授权码。具体可以参考http://blog.csdn.net/you_are_my_dream/article/details/60868329

        _to = "1072892917@qq.com"

        msg = MIMEText("Test")

        msg["Subject"] = str(self.count)   #这里我们把抓取到的item个数，当主题发送

        msg["From"] = _user

        msg["To"] = _to

        try:

            s = smtplib.SMTP_SSL("smtp.qq.com", 465)       #参考 http://service.mail.qq.com/cgi-bin/help?subtype=1&no=167&id=28

            s.login(_user, _pwd)

            s.sendmail(_user, _to, msg.as_string())

            s.quit()

            print("Success!")

        except smtplib.SMTPException as e:

            print("Falied,%s" % e)

import json

class JsonWriterPipeline(object):

    def open_spider(self, spider):

        self.file = open('items.jl', 'w')

    def close_spider(self, spider):

        self.file.close()

    def process_item(self, item, spider):

        line = json.dumps(dict(item)) + "\n"

        self.file.write(line)

        return item

修改settings.py文件如下

ITEM_PIPELINES = {
　　　　#这个302，303，数字越小，越先通过管道。

      'quotesbot.pipelines.MailPipeline': 302,

     'quotesbot.pipelines.JsonWriterPipeline': 303

}

这样我们可以把抓取到的数据先通过MailPipeline获取到抓取的个数，然后发送邮件，在经过jsonWritePipeline进行持久化处理，当然你可以修改pipeline的顺序，发送邮件的时候把持久化的文件作为附件发送。

注意： scrapy的mail模块使用的是twist的mail模块，支持异步的。

2.2 在scrapy中使用beautifulsoup

scrapy 官方参考： https://doc.scrapy.org/en/latest/faq.html#can-i-use-scrapy-with-beautifulsoup

bs4官方英文参考：https://www.crummy.com/software/BeautifulSoup/bs4/doc/#

bs4官方中文参考： https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/

from bs4 import BeautifulSoup

import scrapy

class ExampleSpider(scrapy.Spider):

    name = "example"

    allowed_domains = ["example.com"]

    start_urls = (

        'http://www.example.com/',

    )

    def parse(self, response):

        # use lxml to get decent HTML parsing speed

        soup = BeautifulSoup(response.text, 'lxml')

        yield {

            "url": response.url,

            "title": soup.h1.string

        }

2.3 抓取的item不同的属性值的提取需要来自多个页面，不是单个页面就能提取到所有属性

官方参考：https://doc.scrapy.org/en/latest/faq.html#how-can-i-scrape-an-item-with-attributes-in-different-pages

def parse_page1(self, response):

    item = MyItem()

    item['main_url'] = response.url

    request = scrapy.Request("http://www.example.com/some_page.html",

                             callback=self.parse_page2)

    request.meta['item'] = item

    yield request

def parse_page2(self, response):

    item = response.meta['item']

    item['other_url'] = response.url

    yield item

这个是通过request的meta传递给后续的请求的，最终的那个请求返回item结果。

2.4 如何抓取一个需要登陆的页面

官方参考： https://doc.scrapy.org/en/latest/faq.html#how-can-i-simulate-a-user-login-in-my-spider

import scrapy

class LoginSpider(scrapy.Spider):

    name = 'example.com'

    start_urls = ['http://www.example.com/users/login.php']

    def parse(self, response):

        return scrapy.FormRequest.from_response(

            response,

            formdata={'username': 'john', 'password': 'secret'},

            callback=self.after_login

        )

    def after_login(self, response):

        # check login succeed before going on

        if "authentication failed" in response.body:

            self.logger.error("Login failed")

            return

        # continue scraping with authenticated session...

这个就是使用FromRequest把用户名和密码提交，获取对应的服务器响应，这里需要对响应流进行判定，如果登陆成功进行抓取，如果失败退出。

2.5 不创建工程运行一个爬虫

官方参考： https://doc.scrapy.org/en/latest/faq.html#can-i-run-a-spider-without-creating-a-project

官方参考： https://doc.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script

import scrapy

from scrapy.crawler import CrawlerProcess

class MySpider(scrapy.Spider):

    # Your spider definition

    ...

process = CrawlerProcess({

    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'

})

process.crawl(MySpider)

process.start() # the script will block here until the crawling is finished

2.6 最简单的方式存储抓取到的数据

官方参考： https://doc.scrapy.org/en/latest/faq.html#simplest-way-to-dump-all-my-scraped-items-into-a-json-csv-xml-file

scrapy crawl myspider -o items.json

scrapy crawl myspider -o items.csv

scrapy crawl myspider -o items.xml

scrapy crawl myspider -o items.jl

这个方法是最快的方法了。但是有个问题。 json的使用的ansi编码，对中文不支持，我们需要使用utf-8的。这个时候这个就有问题。

1.可以在设置中指定 FEED_EXPORT_ENCODING = 'utf-8'

2. 参考我写的导出各个格式的item结果。http://www.cnblogs.com/zhaojiedi1992/p/zhaojiedi_python_005_scrapy.html

这2个方法都是可以的，建议使用第二种方法，这样扩展比较方便。

2.7 指定条件满足就停止爬虫

官方参考： https://doc.scrapy.org/en/latest/faq.html#how-can-i-instruct-a-spider-to-stop-itself

def parse_page(self, response):

    if 'Bandwidth exceeded' in response.body:

        raise CloseSpider('bandwidth_exceeded')

如果设置的抓取到指定item个数就终止的话，可以采用如下方法：

# -*- coding: utf-8 -*-

import scrapy

from scrapy.exceptions import CloseSpider

class ToScrapeSpiderXPath(scrapy.Spider):

    def __init__(self):

        self.count=0     #设置下当前个数

        self.max_count=100  #设置最大抓取个数

    name = 'toscrape-xpath'

    start_urls = [

        'http://quotes.toscrape.com/',

    ]

    def parse(self, response):

        for quote in response.xpath('//div[@class="quote"]'):

            self.count =self.count +1

            if self.count > self.max_count:

                raise CloseSpider('bandwidth_exceeded')

            yield {

                'text': quote.xpath('./span[@class="text"]/text()').extract_first(),

                'author': quote.xpath('.//small[@class="author"]/text()').extract_first(),

                'tags': quote.xpath('.//div[@class="tags"]/a[@class="tag"]/text()').extract()

            }

        next_page_url = response.xpath('//li[@class="next"]/a/@href').extract_first()

        if next_page_url is not None:

            yield scrapy.Request(response.urljoin(next_page_url))

当然，也是可以调用self.crawler.stop()方法。

其实scrapy内置有个中间件可以设置一些指定的条件去关闭爬虫的具体参考 https://doc.scrapy.org/en/latest/topics/extensions.html#module-scrapy.extensions.closespider

关于这个中间件的设置，简单说下：

CLOSESPIDER_TIMEOUT ：爬虫打开超过指定的时间就关闭爬虫
CLOSESPIDER_ITEMCOUNT ：指定数量的item通过了pipeline就关闭爬虫，如果还有请求，是会继续工作的。但是多个item个数不会超过并发个数 CONCURRENT_REQUESTS.
CLOSESPIDER_PAGECOUNT : 抓取到指定页面个数的时候关闭爬虫
CLOSESPIDER_ERRORCOUNT ：捕获到指定的错误次数的时候关闭爬虫

内置的几条如果没能合乎你的心意，你可以自己写一个扩展即可。具体可以参考： https://doc.scrapy.org/en/latest/topics/extensions.html#writing-your-own-extension

2.7 避免爬虫被banned

官方参考： https://doc.scrapy.org/en/latest/topics/practices.html#avoiding-getting-banned

1 设置一个list集合存放userAgent,每次请求从几何里面选择一个userAgent.

2 禁用cookies,有些网址启用cookies来识别bot.

3 使用下载延迟download_delay，有些网址对单位时间内请求次数有限制，过多请求会被禁的。

4 如果肯能的话使用谷歌缓存，而不是直接请求网址。

5 使用ip池，比如ProxyMesh，scrapoxy

6 使用高度分布的下载器，比如Crawlera

2.8 启动爬虫的时候接受参数

官方参考： https://doc.scrapy.org/en/latest/topics/spiders.html#spider-arguments

import scrapy

class MySpider(scrapy.Spider):

    name = 'myspider'

    def __init__(self, category=None, *args, **kwargs):

        super(MySpider, self).__init__(*args, **kwargs)

        self.start_urls = ['http://www.example.com/categories/%s' % category]

这样，我们运行爬虫的时候使用如下的即可

scrapy crawl myspider -a category=electronics

2.9 修该pipeline支持多种格式导出

官方参考： https://doc.scrapy.org/en/latest/topics/exporters.html#using-item-exporters

博客参考（我自己的）： http://www.cnblogs.com/zhaojiedi1992/p/zhaojiedi_python_005_scrapy.html

具体项目的参考： https://github.com/zhaojiedi1992/ScrapyCnblogs

# -*- coding: utf- -*-

# Define your item pipelines here

#

# Don't forget to add your pipeline to the ITEM_PIPELINES setting

# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

from scrapy import signals

from scrapy.exporters import *

import logging

logger=logging.getLogger(__name__)

class BaseExportPipeLine(object):

    def __init__(self,**kwargs):

        self.files = {}

        self.exporter=kwargs.pop("exporter",None)

        self.dst=kwargs.pop("dst",None)

        self.option=kwargs

    @classmethod

    def from_crawler(cls, crawler):

        pipeline = cls()

        crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)

        crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)

        return pipeline

    def spider_opened(self, spider):

        file = open(self.dst, 'wb')

        self.files[spider] = file

        self.exporter = self.exporter(file,**self.option)

        self.exporter.start_exporting()

    def spider_closed(self, spider):

        self.exporter.finish_exporting()

        file = self.files.pop(spider)

        file.close()

    def process_item(self, item, spider):

        self.exporter.export_item(item)

        return item

#

# 'fields_to_export':["url","edit_url","title"] 设定只导出部分字段,以下几个pipeline都支持这个参数

# 'export_empty_fields':False 设定是否导出空字段 以下几个pipeline都支持这个参数

# 'encoding':'utf-8' 设定默认编码，以下几个pipeline都支持这个参数

# 'indent' :： 设置缩进，这个参数主要给JsonLinesExportPipeline使用

# "item_element":"item"设置xml节点元素的名字，只能XmlExportPipeline使用,效果是<item></item>

# "root_element":"items"设置xml根元素的名字，只能XmlExportPipeline使用，效果是<items>里面是很多item</items>

# "include_headers_line":True 是否包含字段行， 只能CsvExportPipeline使用

# "join_multivalued":","设置csv文件的分隔符号， 只能CsvExportPipeline使用

# 'protocol':2设置PickleExportPipeline 导出协议，只能PickleExportPipeline使用

# "dst":"items.json" 设置目标位置

class JsonExportPipeline(BaseExportPipeLine):

    def __init__(self):

        option={"exporter":JsonItemExporter,"dst":"items.json","encoding":"utf-8","indent":,}

        super(JsonExportPipeline, self).__init__(**option)

class JsonLinesExportPipeline(BaseExportPipeLine):

    def __init__(self):

        option={"exporter":JsonLinesItemExporter,"dst":"items.jl","encoding":"utf-8"}

        super(JsonLinesExportPipeline, self).__init__(**option)

class XmlExportPipeline(BaseExportPipeLine):

    def __init__(self):

        option={"exporter":XmlItemExporter,"dst":"items.xml","item_element":"item","root_element":"items","encoding":'utf-8'}

        super(XmlExportPipeline, self).__init__(**option)

class CsvExportPipeline(BaseExportPipeLine):

    def __init__(self):

        # 设置分隔符的这个，我这里测试是不成功的

        option={"exporter":CsvItemExporter,"dst":"items.csv","encoding":"utf-8","include_headers_line":True, "join_multivalued":","}

        super(CsvExportPipeline, self).__init__(**option)

class  PickleExportPipeline(BaseExportPipeLine):

    def __init__(self):

        option={"exporter":PickleItemExporter,"dst":"items.pickle",'protocol':}

        super(PickleExportPipeline, self).__init__(**option)

class  MarshalExportPipeline(BaseExportPipeLine):

    def __init__(self):

        option={"exporter":MarshalItemExporter,"dst":"items.marsha"}

        super(MarshalExportPipeline, self).__init__(**option)

class  PprintExportPipeline(BaseExportPipeLine):

    def __init__(self):

        option={"exporter":PprintItemExporter,"dst":"items.pprint.jl"}

        super(PprintExportPipeline, self).__init__(**option)