使用scrapy爬取jian shu文章
settings.py中一些东西的含义可以看一下这里
python的scrapy框架的使用 和xpath的使用 && scrapy中request和response的函数参数 && parse()函数运行机制
目录结构
创建一个scrapy项目(最后那个js是你创建项目的名字)
scrapy startprojects js
创建以crawl为模板的爬虫
scrapy genspider -t crawl jianshu wwww.jianshu.com
一、使用scrapy的crawl模板来创建一个爬虫jianshu.py文件
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from js.items import JsItem class JianshuSpider(CrawlSpider):
# 运行scrapy时候的名字
name = 'jianshu'
allowed_domains = ['jianshu.com']
start_urls = ['https://www.jianshu.com']
rules = (
# 要爬取网页上的推荐文章,进而可以使用crawl爬取我们规定的链接
# 网页上的推荐文章链接是“/p/(12个【字母/数字】)”
Rule(LinkExtractor(allow=r'.*?p/.*?'), callback='parse_detail', follow=True),
) def parse_detail(self, response):
# print(response.text())
# print('-'*30)
passage_id = response.url
title = response.xpath('//h1[@class="_1RuRku"]/text()').get()
time = response.xpath('//div[@class="s-dsoj"]/time/text()').get()
author = response.xpath('//span[@class="FxYr8x"]/a/text()').get()
body = response.xpath('//article[@class="_2rhmJa"]').get()
type = response.xpath('//div[@class="_2Nttfz"]/a/img/text()').getall()
type = ','.join(type)
# 比如文章链接为https://www.jianshu.com/p/ef7bb28258c8
# 那么文章的id,passage_id就是ef7bb28258c8,因为有的链接后面还有跟上“?参数”,所以我们先按照“?”
# 切割链接,然后再取出文章id
passage_id = passage_id.split('?')[0]
passage_id = passage_id.split('/')[-1]
# 有些文章违规在简书上不能看,但是链接还存在,所以我们需要判断一下
if author == None:
pass
else: # 只要返回的对象是item类型,无论在那个函数里面返回都是返回到pipelines.py的item参数里面
item = JsItem(
passage_id = passage_id,
title = title,
time = time,
author = author,
body = body,
type = type
)
yield item
二、item.py
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html import scrapy # 定义一下Item的对象的一些变量
class JsItem(scrapy.Item):
passage_id = scrapy.Field()
title = scrapy.Field()
time = scrapy.Field()
author = scrapy.Field()
body = scrapy.Field()
type = scrapy.Field()
三、start.py (启动程序)
from scrapy import cmdline
# jianshu是你的爬虫程序文件,注意要把这个命令用split拆开
cmdline.execute("scrapy crawl jianshu".split())
四、pipelines.py
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html # useful for handling different item types with a single interface
from itemadapter import ItemAdapter
from twisted.enterprise import adbapi
import pymysql # class JsPipeline:
# def process_item(self, item, spider):
# return item class JianShuSpiderPipeline(object):
def __init__(self):
# 连接数据库的一些参数
dpparams = {
'host':'127.0.0.1',
'port': 3306,
'user':'root',
'password':'your_password',
'database':'jianshu',
'charset':'utf8'
}
# 连接数据库
self.conn = pymysql.connect(**dpparams)
# 定义游标
self.cursor = self.conn.cursor()
self._sql = None @property # 定义属性
def sql(self):
if self._sql == None:
self._sql = '''
insert into jsw(passage_id,title,time,author,body,type) values(%s,%s,%s,%s,%s,%s)
'''
return self._sql
else:
return self._sql def process_item(self,item,spider):
# 执行sql语句,并传参
self.cursor.execute(self.sql,(item['passage_id'],item['title'],item['time'],item['author'],item['body'],item['type']))
# 记得执行之后要提交
self.conn.commit()
# 必须返回,不返回就不会处理下一个item对象
return item # 异步向mysql中插入数据
# class JianShuTwistedPipeline:
# def __init__(self):
# dpparams = {
# 'host':'127.0.0.1',
# 'port': 3306,
# 'user':'root',
# 'password':'qu513712qu',
# 'database':'jianshu',
# 'charset':'utf-8'
# }
# self.dbpool = adbapi.Connection('pymysql',**dpparams)
# self._sql = None
#
# @property
# def sql(self):
# if self._sql == None:
# self._sql = '''
# insert into jsw values(%s,%s,%s,%s,%s,%s)
# '''
# return self._sql
# else:
# return self._sql
#
# def process_item(self,item,spider):
# # 把插入Insert函数放到runInteraction里面,插入数据就会变成异步的,如果直接执行函数就是异步执行了
# # runInteraction会返回一个游标给Insert函数
# defer = self.dbpool.runInteraction(self.Insert,item)
# defer.addErrback(self.error,item,spider)
#
# def Insert(self,cursor,item):
# cursor.execute(self._sql,(item['passage_id'],item['title'],item['time'],item['author'],item['body'],item['type']))
#
# def error(self,error,item,spider):
# print('-'*30)
# print('error')
# print('-'*30)
五、setting.py
# Scrapy settings for js project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://docs.scrapy.org/en/latest/topics/settings.html
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME = 'js' SPIDER_MODULES = ['js.spiders']
NEWSPIDER_MODULE = 'js.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'js (+http://www.yourdomain.com)' # Obey robots.txt rules
ROBOTSTXT_OBEY = False # Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default)
#COOKIES_ENABLED = False # Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False # Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36' } # Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
SPIDER_MIDDLEWARES = {
# 'js.middlewares.JsSpiderMiddleware': 543,
'js.middlewares.UserAgentMiddleWare': 543,
} # Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
# 'js.middlewares.JsDownloaderMiddleware': 543,
'js.middlewares.SeleniumDownloadMiddleware': 543,
} # Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#} # Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
# 管道里面使用的存储数据的类
'js.pipelines.JianShuSpiderPipeline': 300,
# 'js.pipelines.JianShuTwistedPipeline': 300,
} # Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
六、middlewares.py
# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html from scrapy import signals
# useful for handling different item types with a single interface
from itemadapter import is_item, ItemAdapter
import random
from selenium import webdriver
import time
from scrapy.http.response.html import HtmlResponse class JsSpiderMiddleware:
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the spider middleware does not modify the
# passed objects. @classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s def process_spider_input(self, response, spider):
# Called for each response that goes through the spider
# middleware and into the spider. # Should return None or raise an exception.
return None def process_spider_output(self, response, result, spider):
# Called with the results returned from the Spider, after
# it has processed the response. # Must return an iterable of Request, or item objects.
for i in result:
yield i def process_spider_exception(self, response, exception, spider):
# Called when a spider or process_spider_input() method
# (from other spider middleware) raises an exception. # Should return either None or an iterable of Request or item objects.
pass def process_start_requests(self, start_requests, spider):
# Called with the start requests of the spider, and works
# similarly to the process_spider_output() method, except
# that it doesn’t have a response associated. # Must return only requests (not items).
for r in start_requests:
yield r def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name) class JsDownloaderMiddleware:
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the downloader middleware does not modify the
# passed objects. @classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s def process_request(self, request, spider):
# Called for each request that goes through the downloader
# middleware. # Must either:
# - return None: continue processing this request
# - or return a Response object
# - or return a Request object
# - or raise IgnoreRequest: process_exception() methods of
# installed downloader middleware will be called
return None def process_response(self, request, response, spider):
# Called with the response returned from the downloader. # Must either;
# - return a Response object
# - return a Request object
# - or raise IgnoreRequest
return response def process_exception(self, request, exception, spider):
# Called when a download handler or a process_request()
# (from other downloader middleware) raises an exception. # Must either:
# - return None: continue processing this exception
# - return a Response object: stops process_exception() chain
# - return a Request object: stops process_exception() chain
pass def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name) class UserAgentMiddleWare:
User_Agent = [
"Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
"Mozilla/5.0 (Windows NT 10.0; WOW64; rv:38.0) Gecko/20100101 Firefox/38.0",
"Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729; InfoPath.3; rv:11.0) like Gecko",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)",
"Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)",
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
"Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
"Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11",
"Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; TencentTraveler 4.0)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; The World)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Avant Browser)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)",
"Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",
"Mozilla/5.0 (iPod; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",
"Mozilla/5.0 (iPad; U; CPU OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",
"Mozilla/5.0 (Linux; U; Android 2.3.7; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1" ]
def process_request(self,request,spider):
user_agent = random.choice(self.User_Agent)
request.headers['User-Agent'] = user_agent class SeleniumDownloadMiddleware(object):
# 通过selenium+chrome来爬取网页的一些动态加载的数据
def __init__(self): # 将你的chrome.exe文件位置传给它
self.driver = webdriver.Chrome(executable_path=r'D:\python-senium\chromedriver.exe') def process_request(self,request,spider):
self.driver.get(request.url)
time.sleep(2)
try:
while 1: # 文章下面有文章分类,我们需要点击加载加载更多才可以找出来这篇文章的所有归属类型
showmore = self.driver.find_element_by_class_name('anticon anticon-down')
showmore.click()
time.sleep(1)
if not showmore:
break
except:
pass
# 网页源码
sourse = self.driver.page_source
# 构造一个response对象,并返回给jianshu.py
response = HtmlResponse(url=self.driver.current_url,body=sourse,request=request,encoding='utf-8')
return response
使用scrapy爬取jian shu文章的更多相关文章
- Python3.6+Scrapy爬取知名技术文章网站
爬取分析 伯乐在线已经提供了所有文章的接口,还有下一页的接口,所有我们可以直接爬取一页,再翻页爬. 环境搭建 Windows下安装Python: http://www.cnblogs.com/0bug ...
- 第4章 scrapy爬取知名技术文章网站(2)
4-8~9 编写spider爬取jobbole的所有文章 # -*- coding: utf-8 -*- import re import scrapy import datetime from sc ...
- Scrapy爬取伯乐在线文章
首先搭建虚拟环境,创建工程 scrapy startproject ArticleSpider cd ArticleSpider scrapy genspider jobbole blog.jobbo ...
- 第4章 scrapy爬取知名技术文章网站(1)
4-1 scrapy安装以及目录结构介绍 安装scrapy可以看我另外一篇博文:Scrapy的安装--------Windows.linux.mac等操作平台,现在是在虚拟环境中安装可能有不同. 1. ...
- scrapy爬取伯乐在线文章数据
创建项目 切换到ArticleSpider目录下创建爬虫文件 设置settings.py爬虫协议为False 编写启动爬虫文件main.py
- scrapy爬取cnblogs文章列表
scrapy爬取cnblogs文章 目标任务 安装爬虫 创建爬虫 编写 items.py 编写 spiders/cnblogs.py 编写 pipelines.py 编写 settings.py 运行 ...
- Scrapy爬取美女图片 (原创)
有半个月没有更新了,最近确实有点忙.先是华为的比赛,接着实验室又有项目,然后又学习了一些新的知识,所以没有更新文章.为了表达我的歉意,我给大家来一波福利... 今天咱们说的是爬虫框架.之前我使用pyt ...
- 【转载】教你分分钟学会用python爬虫框架Scrapy爬取心目中的女神
原文:教你分分钟学会用python爬虫框架Scrapy爬取心目中的女神 本博文将带领你从入门到精通爬虫框架Scrapy,最终具备爬取任何网页的数据的能力.本文以校花网为例进行爬取,校花网:http:/ ...
- scrapy爬取全部知乎用户信息
# -*- coding: utf-8 -*- # scrapy爬取全部知乎用户信息 # 1:是否遵守robbots_txt协议改为False # 2: 加入爬取所需的headers: user-ag ...
随机推荐
- Approach for Unsupervised Bug Report Summarization 无监督bug报告汇总方法
AUSUM: approach for unsupervised bug report summarization 1. Abstract 解决的bug被归类以便未来参考 缺点是还是需要手动的去细读很 ...
- 【JavaWeb】Servlet 程序
Servlet 程序 Servlet Servlet 是在 Web 服务器中运行的小型 Java 程序.Servlet 通常通过 HTTP(超文本传输协议)接收和响应来自 Web 客户端的请求. ...
- 【C++】《Effective C++》第九章
杂项讨论 条款53:不要轻忽编译器的警告 请记住 严肃对待编译器发出的警告信息.努力在你的编译器的最高(最严苛)警告级别下争取"无任何警告"的容易. 不要过度依赖编译器的报警能力, ...
- LeetCode707 设计链表
设计链表的实现.您可以选择使用单链表或双链表.单链表中的节点应该具有两个属性:val 和 next.val 是当前节点的值,next 是指向下一个节点的指针/引用.如果要使用双向链表,则还需要一个属性 ...
- Oracle 索引原理分析
索引是一种允许直接访问数据表中某一数据行的树型结构,为了提高查询效率而引入,是一个独立于表的对象,可以存放在与表不同的表空间中.索引记录中存有索引关键字和指向表中数据的指针(地址).对索引进行的I/O ...
- [工作札记]03: 微软Winform窗体中ListView、DataGridView等控件的Bug,会导致程序编译失败,影响范围:到最新的.net4.7.2都有
工作中,我们发现了微软.net WinForm的一个Bug,会导致窗体设计器自动生成的代码失效,这个Bug从.net4.5到最新的.net4.7.2都存在,一直没有解决.最初是我在教学工作中发现的,后 ...
- 超精讲-逐例分析CS:LAB2-Bomb!(上)
0. 环境要求 关于环境已经在lab1里配置过了这里要记得安装gdb 安装命令 sudo yum install gdb 实验的下载地址 http://csapp.cs.cmu.edu/3e/labs ...
- 为什么[] == false 为true
首先要讲一下js的数据类型分为: 1.基本数据类型(原始数据类型):String.Boolean.Number.null.undefined.Symbol 2.引用数据类型:Object.Array. ...
- Pytorch 中张量的理解
张量是一棵树 长久以来,张量和其中维度的概念把我搞的晕头转向. 一维的张量是数组,二维的张量是矩阵,这也很有道理. 但是给一个二维张量,让我算出它每一行的和,应该用 sum(dim=0) 还是 sum ...
- 低功耗降线性稳压器,24V转5V降压芯片
PW2330开发了一种高效率的同步降压DC-DC变换器3A输出电流.PW2330在4.5V到30V的宽输入电压范围内工作集成主开关和同步开关,具有非常低的RDS(ON)以最小化传导 损失.PW2330 ...