scrapy爬取cnblogs文章列表

- scrapy爬取cnblogs文章

目标任务

爬取 https://www.cnblogs.com 首页文章，爬取的内容包括：标题、推荐数、链接、内容预览、作者、作者blogs链接、评论数、查看数。

安装爬虫

pip install scrapy

python 版本 3.7， scrapy 版本 1.6.0

创建爬虫

#  创建工程

scrapy startproject CnblogsSpider

# 创建爬虫

cd CnblogsSpider

scrapy genspider -t crawl cnblogs cnblogs.com

爬虫名称 cnblogs , 作用域 cnblogs.com，爬虫类型 crawl

编写 `items.py`

# -*- coding: utf-8 -*-

# Define here the models for your scraped items

#

# See documentation in:

# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy

class CnblogsspiderItem(scrapy.Item):

    # define the fields for your item here like:

    # name = scrapy.Field()

    # 文章标题

    title = scrapy.Field()

    # 推荐数

    diggnum = scrapy.Field()

    # 文章链接

    link = scrapy.Field()

    # 文章内容预览

    post_item_summary = scrapy.Field()

    # 作者

    author = scrapy.Field()

    # 作者blogs链接

    author_link = scrapy.Field()

    # 文章评论数

    article_comment = scrapy.Field()

    # 文章查看数

    article_view = scrapy.Field()

定义需求爬取的 item 项

编写 `spiders/cnblogs.py`

# -*- coding: utf-8 -*-

import scrapy

# 导入CrawlSpider类和Rule

from scrapy.spiders import CrawlSpider, Rule

# 导入链接规则匹配类，用来提取符合规则的连接

from scrapy.linkextractors import LinkExtractor

from CnblogsSpider.items import CnblogsspiderItem

class CnblogsSpider(CrawlSpider):

    # 爬虫名称

    name = 'cnblogs'

    allowed_domains = ['cnblogs.com']

    start_urls = ['https://www.cnblogs.com/sitehome/p/1']

    # Response里链接的提取规则，返回的符合匹配规则的链接匹配对象的列表

    pagelink = LinkExtractor(allow=("/sitehome/p/\d+"))

    rules = [

        # 获取这个列表里的链接，依次发送请求，并且继续跟进，调用指定回调函数处理

        Rule(pagelink, callback="parse_item", follow=True)

    ]

    # CrawlSpider的rules属性是直接从response对象的文本中提取url，然后自动创建新的请求。

    # 与Spider不同的是，CrawlSpider已经重写了parse函数

    # scrapy crawl spidername开始运行，程序自动使用start_urls构造Request并发送请求，

    # 然后调用parse函数对其进行解析，在这个解析过程中使用rules中的规则从html（或xml）文本中提取匹配的链接，

    # 通过这个链接再次生成Request，如此不断循环，直到返回的文本中再也没有匹配的链接，或调度器中的Request对象用尽，程序才停止。

    # 如果起始的url解析方式有所不同，那么可以重写CrawlSpider中的另一个函数parse_start_url(self, response)用来解析第一个url返回的Response，但这不是必须的。

    def parse_item(self, response):

        for each in response.xpath("//div[@class='post_item']"):

            item = CnblogsspiderItem()

            item['diggnum'] = each.xpath("./div[1]/div[1]/span[1]/text()").extract()[0].strip()

            item['title'] = each.xpath("./div[2]/h3/a/text()").extract()[0].strip()

            item['link'] = each.xpath("./div[2]/h3/a/@href").extract()[0].strip()

            item['post_item_summary'] = ''.join(each.xpath("./div[2]/p[@class='post_item_summary']/text()").extract()).strip()

            item['author'] = each.xpath("./div[2]/div/a/text()").extract()[0].strip()

            item['author_link'] = each.xpath("./div[2]/div/a/@href").extract()[0].strip()

            item['article_comment'] = each.xpath("./div[2]/div/span[1]/a/text()").extract()[0].strip()

            item['article_view'] = each.xpath("./div[2]/div/span[2]/a/text()").extract()[0].strip()

            yield item

爬虫的主逻辑

编写 `pipelines.py`

# -*- coding: utf-8 -*-

# Define your item pipelines here

#

# Don't forget to add your pipeline to the ITEM_PIPELINES setting

# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

import json

class CnblogsspiderPipeline(object):

    def __init__(self):

        self.filename = open("cnblogs.json", "w", encoding='utf-8')

    def process_item(self, item, spider):

        try:

            text = json.dumps(dict(item), ensure_ascii=False) + "\n"

            self.filename.write(text)

        except BaseException as e:

            print(e)

        return item

    def close_spider(self, spider):

        self.filename.close()

处理每个页面爬取得到的 item 项

编写 `settings.py`

# -*- coding: utf-8 -*-

# Scrapy settings for CnblogsSpider project

#

# For simplicity, this file contains only settings considered important or

# commonly used. You can find more settings consulting the documentation:

#

#     https://doc.scrapy.org/en/latest/topics/settings.html

#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html

#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'CnblogsSpider'

SPIDER_MODULES = ['CnblogsSpider.spiders']

NEWSPIDER_MODULE = 'CnblogsSpider.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent

#USER_AGENT = 'CnblogsSpider (+http://www.yourdomain.com)'

# 自定义user_agent

USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) ' \

             'Chrome/73.0.3683.86 Safari/537.36'

# Obey robots.txt rules

# 如果启用,Scrapy将会采用 robots.txt策略

ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)

# 设置最大请求数

CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)

# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay

# See also autothrottle settings and docs

#DOWNLOAD_DELAY = 3

# The download delay setting will honor only one of:

#CONCURRENT_REQUESTS_PER_DOMAIN = 16

#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)

# 禁用Cookie（默认情况下启用）

#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)

# 禁用Telnet控制台（默认启用）

TELNETCONSOLE_ENABLED = False

# Override the default request headers:

#DEFAULT_REQUEST_HEADERS = {

#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',

#   'Accept-Language': 'en',

#}

# 设置请求头

DEFAULT_REQUEST_HEADERS = {

    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',

    'Accept-Encoding': 'gzip,deflate,br',

    'accept-language': 'zh-CN,zh;q=0.9',

    'cache-control': 'no-cache',

    'pragma': 'no-cache',

    'upgrade-insecure-requests': '1',

    'host': 'www.cnblogs.com'

}

# 启用或禁用蜘蛛中间件

# Enable or disable spider middlewares

# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html

#SPIDER_MIDDLEWARES = {

#    'CnblogsSpider.middlewares.CnblogsspiderSpiderMiddleware': 543,

#}

# 启用或禁用下载器中间件

# Enable or disable downloader middlewares

# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html

#DOWNLOADER_MIDDLEWARES = {

#    'CnblogsSpider.middlewares.CnblogsspiderDownloaderMiddleware': 543,

#}

# 启用或禁用扩展程序

# Enable or disable extensions

# See https://doc.scrapy.org/en/latest/topics/extensions.html

#EXTENSIONS = {

#    'scrapy.extensions.telnet.TelnetConsole': None,

#}

# 配置项目管道

# Configure item pipelines

# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html

ITEM_PIPELINES = {

    'CnblogsSpider.pipelines.CnblogsspiderPipeline': 300,

}

# 启用和配置AutoThrottle扩展（默认情况下禁用）

# Enable and configure the AutoThrottle extension (disabled by default)

# See https://doc.scrapy.org/en/latest/topics/autothrottle.html

#AUTOTHROTTLE_ENABLED = True

# 开始下载时限速并延迟时间

# The initial download delay

#AUTOTHROTTLE_START_DELAY = 5

# 高并发请求时最大延迟时间

# The maximum download delay to be set in case of high latencies

#AUTOTHROTTLE_MAX_DELAY = 60

# The average number of requests Scrapy should be sending in parallel to

# each remote server

#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0

# Enable showing throttling stats for every response received:

#AUTOTHROTTLE_DEBUG = False

# 启用和配置HTTP缓存（默认情况下禁用）, 如果开启会优先读取本地缓存，从而加快爬取速度，视情况而定

# Enable and configure HTTP caching (disabled by default)

# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings

#HTTPCACHE_ENABLED = True

#HTTPCACHE_EXPIRATION_SECS = 0

#HTTPCACHE_DIR = 'httpcache'

#HTTPCACHE_IGNORE_HTTP_CODES = []

#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

# 设置日志等级

# LOG_LEVEL = 'DEGUG'

运行爬虫

cd CnblogsSpider

scrapy crawl cnblogs

CnblogsSpider 为项目文件夹， cnblogs 为爬虫名

结果文件 cnblogs.json 在项目文件夹根目录，内容如下：

{"diggnum": "2", "title": "在嵌入式设备中使用 JavaScript 的前景", "link": "https://www.cnblogs.com/conmajia/p/javascript-in-embedded-devices.html", "post_item_summary": "这几年嵌入式/可穿戴设备处理器从 51、AVR 单片机瞬间起飞到 ARM 4 核、8 核，主频 3GHz 起步。硬件能力的提升让硬件集成 JavaScript 引擎成为现实。我大概了解了一下这方面的进展，觉得前景非常广阔。 ...", "author": "Conmajia", "author_link": "https://www.cnblogs.com/conmajia/", "article_comment": "评论(0)", "article_view": "阅读(283)"}

{"diggnum": "1", "title": "Python学习笔记", "link": "https://www.cnblogs.com/xjtu-blacksmith/p/10347247.html", "post_item_summary": "[TOC] 学了N年 的我，现在得为着学业需要而学一下 ，这简明轻快的语言风格真让我有些不习惯。最主要的是，各种各样的命令我总是忘记。以此笔记摘录一下主要内容，以免遗忘。 入门的读本是：《 \"Python语言及其应用\" 》，Bill Lubanovic著，丁嘉瑞、梁杰、禹常隆译。 笔记中若有提到“其 ...", "author": "黑山雁", "author_link": "https://www.cnblogs.com/xjtu-blacksmith/", "article_comment": "评论(0)", "article_view": "阅读(176)"}

{"diggnum": "0", "title": "补习系列(17)-springboot mongodb 内嵌数据库", "link": "https://www.cnblogs.com/littleatp/p/10462395.html", "post_item_summary": "[TOC] 简介 前面的文章中，我们介绍了如何在SpringBoot 中使用MongoDB的一些常用技巧。 那么，与使用其他数据库如 MySQL 一样，我们应该怎么来做MongoDB的单元测试呢？ 使用内嵌数据库的好处是不需要依赖于一个外部环境，如果每一次跑单元测试都需要依赖一个稳定的外部环境，那么 ...", "author": "美码师", "author_link": "https://www.cnblogs.com/littleatp/", "article_comment": "评论(0)", "article_view": "阅读(84)"}

......

......

......

cnblogs 首页一般显示200页，每页20条数据，如果爬取不出错结果应该有4000条
爬虫不保证时效性，源站页面调整就可能导致失效

scrapy爬取cnblogs文章列表的更多相关文章

一文搞定scrapy爬取众多知名技术博客文章保存到本地数据库，包含：cnblog、csdn、51cto、itpub、jobbole、oschina等
本文旨在通过爬取一系列博客网站技术文章的实践,介绍一下scrapy这个python语言中强大的整站爬虫框架的使用.各位童鞋可不要用来干坏事哦,这些技术博客平台也是为了让我们大家更方便的交流.学习.提高 ...
爬虫实战——Scrapy爬取伯乐在线所有文章
Scrapy简单介绍及爬取伯乐在线所有文章一.简说安装相关环境及依赖包 1.安装Python(2或3都行,我这里用的是3) 2.虚拟环境搭建: 依赖包:virtualenv,virtualenvwr ...
Scrapy分布式爬虫打造搜索引擎- (二)伯乐在线爬取所有文章
二.伯乐在线爬取所有文章 1. 初始化文件目录基础环境 python 3.6.5 JetBrains PyCharm 2018.1 mysql+navicat 为了便于日后的部署:我们开发使用了虚拟 ...
scrapy架构与目录介绍、scrapy解析数据、配置相关、全站爬取cnblogs数据、存储数据、爬虫中间件、加代理、加header、集成selenium
今日内容概要 scrapy架构和目录介绍 scrapy解析数据 setting中相关配置全站爬取cnblgos文章存储数据爬虫中间件和下载中间件加代理,加header,集成selenium 内 ...
Scrapy爬取美女图片续集 (原创)
上一篇咱们讲解了Scrapy的工作机制和如何使用Scrapy爬取美女图片,而今天接着讲解Scrapy爬取美女图片,不过采取了不同的方式和代码实现,对Scrapy的功能进行更深入的运用.(我的新书< ...
Scrapy爬取自己的博客内容
python中常用的写爬虫的库有urllib2.requests,对于大多数比较简单的场景或者以学习为目的,可以用这两个库实现.这里有一篇我之前写过的用urllib2+BeautifulSoup做的一 ...
Scrapy爬取美女图片 (原创)
有半个月没有更新了,最近确实有点忙.先是华为的比赛,接着实验室又有项目,然后又学习了一些新的知识,所以没有更新文章.为了表达我的歉意,我给大家来一波福利... 今天咱们说的是爬虫框架.之前我使用pyt ...
【转载】教你分分钟学会用python爬虫框架Scrapy爬取心目中的女神
原文:教你分分钟学会用python爬虫框架Scrapy爬取心目中的女神本博文将带领你从入门到精通爬虫框架Scrapy,最终具备爬取任何网页的数据的能力.本文以校花网为例进行爬取,校花网:http:/ ...
scrapy爬取全部知乎用户信息
# -*- coding: utf-8 -*- # scrapy爬取全部知乎用户信息 # 1:是否遵守robbots_txt协议改为False # 2: 加入爬取所需的headers: user-ag ...

随机推荐

js监听input输入字符变化
<p class="text-input"> <input type="text" id="username" autoC ...
Delphi实现程序只运行一次并激活已打开的程序
我们的程序有时候只允许运行一次,并且最好的情况是,如果程序第二次运行,就激活原来的程序.网上有很多的方法实现程序只运行一次,但对于激活原来的窗口却都不怎么好.关键就在于激活原来的程序,一般的做法是在工 ...
php 实现的功能
1.php写日志函数 (如:前端请求日志记录) : https://www.cnblogs.com/lvchenfeng/p/6794822.html 2.php中(服务器)使用CURL实现GET和P ...
入门级_Tensorflow网络搭建
Tensorflow如何搭建神经网络 1.基本概念基于Tensorflow的神经网络:用张量表示数据,用计算图搭建神经网络,用会话执行计算图,优化线上的权重(参数),得到模型张量:张量就是多维数据 ...
cgroup & oom-killer 简介
cgroup内存限制 memory.failcnt memory.limit_in_bytes memory.usage_in_bytes memory.max_usage_in_bytes memo ...
java发带图片正文和附件的邮件mail
package com.mail; import java.io.UnsupportedEncodingException; import java.util.Date; import java.ut ...
python之lambda,random,timeit,collections,
python之lambda,random,timeit,collections,一. python之lambda函数lambda函数又称为匿名函数,匿名函数就是没有函数名的函数.>>> ...
用 Flask 来写个轻博客 (2) — Hello World！
Blog 项目源码:https://github.com/JmilkFan/JmilkFan-s-Blog 目录目录前文列表扩展阅读实现最简单的 Flask 应用创建 config.py 文 ...
用 Flask 来写个轻博客 (1) — 创建项目
目录目录前言扩展阅读部署开发环境创建 Github 项目前言一步一步的实现一个 Flask 轻博客项目启动,最新的代码会上传到 Github. 扩展阅读欢迎使用 Flask - vir ...
inode缓存与dentry缓存
1. inode缓存 1: struct inode { 2: /* RCU path lookup touches following: */ 3: umode_t i_mode; 4: uid_t ...

scrapy爬取cnblogs文章列表

目标任务

安装爬虫

创建爬虫

编写 items.py

编写 spiders/cnblogs.py

编写 pipelines.py

编写 settings.py

运行爬虫

scrapy爬取cnblogs文章列表的更多相关文章

随机推荐

热门专题

编写 `items.py`

编写 `spiders/cnblogs.py`

编写 `pipelines.py`

编写 `settings.py`