Scrapy 笔记（二）

一个scrapy爬虫知乎项目的笔记

1、通过命令创建项目

scrapy startproject zhihu
cd zhihu
scrapy genspider zhihu www.zhihu.com（临时的项目，非正式）

直接通过scrapy crawl zhihu启动爬虫会看到如下错误：500、501之类服务器错误。

解决：在setting.py 中为爬虫添加User-Agent

# Override the default request headers:

DEFAULT_REQUEST_HEADERS = {

    'User-Agent':"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36",

    'authorization':'oauth c3cef7c66a1843f8b3a9e6a1e3160e20'

}

继续尝试后出现401，错误

解决：分析浏览器行为，得到需要添加 header:'authorization':'oauth c3cef7c66a1843f8b3a9e6a1e3160e20'
尝试后成功访问到知乎。

2、书写项目中的所需数据item

class UserItem(scrapy.Item):

    # define the fields for your item here like:

    # name = scrapy.Field()

    name = scrapy.Field()

    headline = scrapy.Field()

    url_token = scrapy.Field()

3、midwares暂时不需要修改

def spider_opened(self, spider):
　　spider.logger.info('Spider opened: %s' % spider.name)

4、主要代码zhihuspider书写

# -*- coding: utf-8 -*-

import json

import scrapy

from scrapy import Spider,Request

from zhihu.items import UserItem

class ZhihuSpider(scrapy.Spider):

    name = "zhihu"

    allowed_domains = ["www.zhihu.com"]

    start_urls = ['http://www.zhihu.com/']

    #这里定义一个start_user存储我们找的大V账号

    #start_user = "-LKs-"

    start_user = 'undo-59-76'

    #这里把查询的参数单独存储为user_query,user_url存储的为查询用户信息的url地址

    user_url = "https://www.zhihu.com/api/v4/members/{user}?include={include}"

    user_query = "locations,employments,gender,educations,business,voteup_count,thanked_Count,follower_count,following_count,cover_url,following_topic_count,following_question_count,following_favlists_count,following_columns_count,avatar_hue,answer_count,articles_count,pins_count,question_count,columns_count,commercial_question_count,favorite_count,favorited_count,logs_count,marked_answers_count,marked_answers_text,message_thread_token,account_status,is_active,is_bind_phone,is_force_renamed,is_bind_sina,is_privacy_protected,sina_weibo_url,sina_weibo_name,show_sina_weibo,is_blocking,is_blocked,is_following,is_followed,mutual_followees_count,vote_to_count,vote_from_count,thank_to_count,thank_from_count,thanked_count,description,hosted_live_count,participated_live_count,allow_message,industry_category,org_name,org_homepage,badge[?(type=best_answerer)].topics"

    #follows_url存储的为关注列表的url地址,fllows_query存储的为查询参数。这里涉及到offset和limit是关于翻页的参数，0，20表示第一页

    follows_url = "https://www.zhihu.com/api/v4/members/{user}/followees?include={include}&offset={offset}&limit={limit}"

    follows_query = "data%5B*%5D.answer_count%2Carticles_count%2Cgender%2Cfollower_count%2Cis_followed%2Cis_following%2Cbadge%5B%3F(type%3Dbest_answerer)%5D.topics"

    #followers_url是获取粉丝列表信息的url地址，followers_query存储的为查询参数。

    followers_url = "https://www.zhihu.com/api/v4/members/{user}/followers?include={include}&offset={offset}&limit={limit}"

    followers_query = "data%5B*%5D.answer_count%2Carticles_count%2Cgender%2Cfollower_count%2Cis_followed%2Cis_following%2Cbadge%5B%3F(type%3Dbest_answerer)%5D.topics"

    def start_requests(self):

        '''

        start_requests方法，分别请求了用户查询的url和关注列表的查询以及粉丝列表信息查询

        :return:

        '''

        yield Request(self.user_url.format(user=self.start_user,include=self.user_query),callback=self.parse_user)

        yield Request(self.follows_url.format(user=self.start_user,include=self.follows_query,offset=0,limit=20),callback=self.parse_follows)

        yield Request(self.followers_url.format(user=self.start_user,include=self.followers_query,offset=0,limit=20),callback=self.parse_followers)

    def parse_user(self, response):

        '''

        因为返回的是json格式的数据，所以这里直接通过json.loads获取结果

        :param response:

        :return:

        '''

        result = json.loads(response.text)

        item = UserItem()

        #这里循环判断获取的字段是否在自己定义的字段中，然后进行赋值

        for field in item.fields:

            if field in result.keys():

                item[field] = result.get(field)

                print(item[field])

        #这里在返回item的同时返回Request请求，继续递归拿关注用户信息的用户获取他们的关注列表

        yield item

        yield Request(self.follows_url.format(user = result.get("url_token"),include=self.follows_query,offset=0,limit=20),callback=self.parse_follows)

        yield Request(self.followers_url.format(user = result.get("url_token"),include=self.followers_query,offset=0,limit=20),callback=self.parse_followers)

    def parse_follows(self, response):

        '''

        查询关注的callback函数

        '''

        results = json.loads(response.text)

        if 'data' in results.keys():

            for result in results.get('data'):

                yield Request(self.user_url.format(user = result.get("url_token"),include=self.user_query),callback=self.parse_user)

        #这里判断page是否存在并且判断page里的参数is_end判断是否为False，如果为False表示不是最后一页，否则则是最后一页

        if 'paging' in results.keys() and results.get('is_end') == False:

            next_page = results.get('paging').get("next")

            #获取下一页的地址然后通过yield继续返回Request请求，继续请求自己再次获取下页中的信息

            yield Request(next_page,self.parse_follows)

    def parse_followers(self, response):

        '''

        这里其实和关乎列表的处理方法是一样的

        用户粉丝列表的解析，这里返回的也是json数据 这里有两个字段data和paging，其中page是分页信息

        '''

        results = json.loads(response.text)

        if 'data' in results.keys():

            for result in results.get('data'):

                yield Request(self.user_url.format(user = result.get("url_token"),include=self.user_query),callback=self.parse_user)

        #这里判断page是否存在并且判断page里的参数is_end判断是否为False，如果为False表示不是最后一页，否则则是最后一页

        if 'paging' in results.keys() and results.get('paging').get('is_end') == False:

            next_page = results.get('paging').get("next")

            #获取下一页的地址然后通过yield继续返回Request请求，继续请求自己再次获取下页中的信息

            yield Request(next_page,self.parse_followers)

Hide

# -*- coding: utf-8 -*-

import json

import scrapy

from scrapy import Spider,Request

from zhihu.items import UserItem

class ZhihuSpider(scrapy.Spider):

    name = "zhihu"

    allowed_domains = ["www.zhihu.com"]

    start_urls = ['http://www.zhihu.com/']

        yield Request(url,callback=self.parse_user)

        yield Request(url,callback=self.parse_follows)

        yield Request(url,callback=self.parse_followers)

    def parse_user(self, response):

        result = json.loads(response.text)

        item = UserItem()

        #这里循环判断获取的字段是否在自己定义的字段中，然后进行赋值

        for field in item.fields:

            if field in result.keys():

                item[field] = result.get(field)

                print(item[field])

        #这里在返回item的同时返回Request请求，继续递归拿关注用户信息的用户获取他们的关注列表

        yield item

        yield Request(url,callback=self.parse_follows)

        yield Request(url,callback=self.parse_followers)

    def parse_follows(self, response):

        '''

        查询关注的callback函数

        '''

        results = json.loads(response.text)

        if 'data' in results.keys():

            for result in results.get('data'):

                yield Request(self.user_url.format(user = result.get("url_token"),include=self.user_query),callback=self.parse_user)

        #这里判断page是否存在并且判断page里的参数is_end判断是否为False，如果为False表示不是最后一页，否则则是最后一页

        if 'paging' in results.keys() and results.get('is_end') == False:

            next_page = results.get('paging').get("next")

            #获取下一页的地址然后通过yield继续返回Request请求，继续请求自己再次获取下页中的信息

            yield Request(next_page,self.parse_follows)

5、存储爬取到的数据

在pipline中处理爬取到的内容

import json

import pymysql

from scrapy.exceptions import DropItem

#存入json文件pipeline

class JsonFilePipeline(object):

    def __init__(self):

        self.file = open('items.jl', 'w',encoding='utf-8')

    def process_item(self, item, spider):

        line = json.dumps(dict(item),ensure_ascii=False) + "\n"

        #print('hello')

        print(line)

        self.file.write(line)

        return item

#去重pipeline

class DuplicatesPipeline(object):

    def __init__(self):

        self.ids_seen = set()

    def process_item(self, item, spider):

        if item['url_token'] in self.ids_seen:

            raise DropItem("Duplicate item found: %s" % item)

        else:

            self.ids_seen.add(item['url_token'])

            return item

#存入mysql数据库pipeline

class MysqlPipeline(object):

    MysqlHandle = ''

    Cusor = ''

    #def __init__(self):

    #打开爬虫时操作

    def open_spider(self, spider):

        print('打开爬虫')

        print('连接sql')

        self.MysqlHandle = self.dbHandle()

        self.Cusor = self.MysqlHandle.cursor()

    def close_spider(self, spider):

        print('关闭爬虫')

        print('关闭sql')

        #关闭游标

        self.Cusor.close()

        #关闭连接fetch

        self.MysqlHandle.close()

    def dbHandle(self):

        conn = pymysql.connect(

            host='localhost',

            user='root',

            passwd='passwd',

            db = "databases",

            charset='utf8',

            use_unicode=False

        )

        return conn

    def process_item(self, item, spider):

        dbObject = self.MysqlHandle

        cursor = self.Cusor

        sql = 'insert ignore into zhihu(url_token,name,headline) values (%s,%s,%s)'

        try:

            cursor.execute(sql,(item['url_token'],item['name'],item['headline']))

            dbObject.commit()

        except Exception as e:

            print(e)

            dbObject.rollback()

        return item

pipeline.py

需要在setting中设置的内容

# Configure item pipelines

# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html

ITEM_PIPELINES = {

    #保存在文件中

    #'zhihu.pipelines.JsonFilePipeline': 300

    #保存在mysql数据库中

    'zhihu.pipelines.MysqlPipeline': 800,

    #去重

    'zhihu.pipelines.DuplicatesPipeline':300

}

6、setting中可设置的选项（待添加）

　　#log change
　　LOG_ENABLED = False
　　#LOG_LEVEL = 'ERROR'
　　#LOG_FILE = 'log.txt'

7、scrapy的增量爬重（暂停的开启与关闭）

　　怎么使用

要启用一个爬虫的持久化，运行以下命令:

scrapy crawl somespider -s JOBDIR=crawls/somespider-1

然后，你就能在任何时候安全地停止爬虫(按Ctrl-C或者发送一个信号)。恢复这个爬虫也是同样的命令:

scrapy crawl somespider -s JOBDIR=crawls/somespider-1

Scrapy 笔记（二）的更多相关文章

《CMake实践》笔记二：INSTALL/CMAKE_INSTALL_PREFIX
<CMake实践>笔记一:PROJECT/MESSAGE/ADD_EXECUTABLE <CMake实践>笔记二:INSTALL/CMAKE_INSTALL_PREFIX &l ...
jQuery源码笔记(二)：定义了一些变量和函数 jQuery = function(){}
笔记(二)也分为三部分: 一. 介绍: 注释说明:v2.0.3版本.Sizzle选择器.MIT软件许可注释中的#的信息索引.查询地址(英文版)匿名函数自执行:window参数及undefined参数意 ...
Mastering Web Application Development with AngularJS 读书笔记(二)
第一章笔记 (二) 一.scopes的层级和事件系统(the eventing system) 在层级中管理的scopes可以被用做事件总线.AngularJS 允许我们去传播已经命名的事件用一种有效 ...
Python 学习笔记二
笔记二 :print 以及基本文件操作笔记一已取消置顶链接地址 http://www.cnblogs.com/dzzy/p/5140899.html 暑假只是快速过了一遍python ,现在起开始仔 ...
WPF的Binding学习笔记(二)
原文: http://www.cnblogs.com/pasoraku/archive/2012/10/25/2738428.htmlWPF的Binding学习笔记(二) 上次学了点点Binding的 ...
webpy使用笔记(二) session/sessionid的使用
webpy使用笔记(二) session的使用 webpy使用系列之session的使用,虽然工作中使用的是django,但是自己并不喜欢那种大而全的东西~什么都给你准备好了,自己好像一个机器人一样赶 ...
AJax 学习笔记二(onreadystatechange的作用)
AJax 学习笔记二(onreadystatechange的作用) 当发送一个请求后,客户端无法确定什么时候会完成这个请求,所以需要用事件机制来捕获请求的状态XMLHttpRequest对象提供了on ...
Learning Scrapy笔记（六）- Scrapy处理JSON API和AJAX页面
摘要:介绍了使用Scrapy处理JSON API和AJAX页面的方法有时候,你会发现你要爬取的页面并不存在HTML源码,譬如,在浏览器打开http://localhost:9312/static/, ...
Learning Scrapy笔记（零） - 前言
我已经使用了scrapy有半年之多,但是却一直都感觉没有入门,网上关于scrapy的文章简直少得可怜,而官网上的文档(http://doc.scrapy.org/en/1.0/index.html)对 ...
《MFC游戏开发》笔记二建立工程、调整窗口
本系列文章由七十一雾央编写,转载请注明出处. http://blog.csdn.net/u011371356/article/details/9300383 作者:七十一雾央新浪微博:http:/ ...

随机推荐

C#微信开发系列笔记（1）入门指引
(1)基本配置这节具体的详情请看官方文档,我只提示一点,非常重要,非常容易忽视的,我是吃了苦头的... 在“修改配置”这个地方,如下图: 在URL处,一定要填写http://www.xxx.com/ ...
Andrew File System
Andrew File System 2015-01-01 #system 突然感觉艺术细胞爆发啊,刚刚去Utown吃饭,一路上发现许多美丽的景色,拿手机一直拍,哈哈,元旦好心情~~不扯淡,还有两篇博 ...
无聊，纯css写了个评分鼠标移入的效果
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/ ...
《JavaScript 实战》：Tween 算法及缓动效果
Flash 做动画时会用到 Tween 类,利用它可以做很多动画效果,例如缓动.弹簧等等.我这里要教大家的是怎么利用 Flash 的 Tween 类的算法,来做js的Tween算法,并利用它做一些简单 ...
mysql查询日期相关的
今天 select * from 表名 where to_days(时间字段名) = to_days(now()); 昨天 SELECT * FROM 表名 WHERE TO_DAYS( NOW( ) ...
CodeForces - 1004B
Sonya decided to organize an exhibition of flowers. Since the girl likes only roses and lilies, she ...
NYOJ 739 笨蛋难题四（数学）
题目链接描述这些日子笨蛋一直研究股票,经过调研,终于发现xxx公司股票规律,更可喜的是笨蛋推算出这家公司每天的股价,为了防止别人发现他的秘密.他决定对于这家公司的股票最多买一次,现在笨蛋已经将 ...
从python入门ruby
1.Ruby的函数可以不使用括号 def h(name) puts "hello #{name}" end h "jack" 2.python可以直接访问实例的 ...
cookie、localstroage与sessionstroage的一些优缺点
1. Cookie 在前端开发中,尽量少用cooie,原因: (1) cookie限制大小,约4k左右,不适合存储业务数据,尤其是数据量较大的值: (2) cookie会每次随http请 ...
L - SOS Gym - 101775L 博弈
题目链接:https://cn.vjudge.net/contest/274151#problem/L 题目大意:给你一个1*n的方格,两个人轮流放字母,每一次可以放"S"或者&q ...

Scrapy 笔记（二）

Scrapy 笔记（二）的更多相关文章

随机推荐

热门专题