Scrapy入门操作
一、安装Scrapy:
如果您还未安装,请参考https://www.cnblogs.com/dalyday/p/9277212.html
二、Scrapy基本配置
1.创建Scrapy程序
cd D:\daly\PycharmProjects\day19_spider # 根目录自己定
scrapy startprojcet sql # 创建程序
cd sql
scrapy genspider chouti chouti.com # 创建爬虫
scrapy crawl chouti --nolog # 启动爬虫
2.程序目录
三、Scrapy程序操作
1.自定制起始url
a.打开刚才创建chouti.py文件
class ChoutiSpider(scrapy.Spider):
name = 'chouti'
allowed_domains = ['chouti.com']
start_urls = ['http://chouti.com/'] def parse(self, response):
# pass
print('已经下载完成',response.text)
b.定制起始url
import scrapy
from scrapy.http import Request
class ChoutiSpider(scrapy.Spider):
name = 'chouti'
allowed_domains = ['chouti.com'] def start_requests(self):
yield Request(url='http://dig.chouti.com/all/hot/recent/8', callback=self.parse1111)
yield Request(url='http://dig.chouti.com/', callback=self.parse222) # return [
# Request(url='http://dig.chouti.com/all/hot/recent/8', callback=self.parse1111),
# Request(url='http://dig.chouti.com/', callback=self.parse222)
# ] def parse1111(self, response):
print('dig.chouti.com/all/hot/recent/8已经下载完成', response) def parse222(self, response):
print('http://dig.chouti.com/已经下载完成', response)
2.数据持久化(pipeline使用)
a.parse函数中必须yield一个 Item对象
# -*- coding: utf-8 -*-
import scrapy
from bs4 import BeautifulSoup
from scrapy.selector import HtmlXPathSelector
from ..items import SqlItem class ChoutiSpider(scrapy.Spider):
name = 'chouti'
allowed_domains = ['chouti.com']
start_urls = ['http://chouti.com/'] def parse(self, response):
# print('已经下载完成',response.text) # 获取页面上所有的新闻,将标题和URL写入文件
"""
方法一:
soup = BeautifulSoup(response.text,'lxml')
tag = soup.find(name='div',id='content-list')
# tag = soup.find(name='div', attrs={'id':'content_list})
"""
# 方法二:
"""
// 表示从跟开始向下找标签
//div[@id="content-list"]
//div[@id="content-list"]/div[@class="item"]
.// 相对当前对象
"""
hxs = HtmlXPathSelector(response)
tag_list = hxs.select('//div[@id="content-list"]/div[@class="item"]')
for item in tag_list:
text1 = (item.select('.//div[@class="part1"]/a/text()')).extract_first().strip()
url1 = (item.select('.//div[@class="part1"]/a/@href')).extract_first() yield SqlItem(text=text1,url=url1)
chouti.py
b.定义item对象
import scrapy class SqlItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
text = scrapy.Field()
url = scrapy.Field()
item.py
c.编写pipeline
class DbPipeline(object):
def process_item(self, item, spider):
# print('数据DbPipeline',item)
return item class FilePipeline(object):
def open_spider(self, spider):
"""
爬虫开始执行时,调用
:param spider:
:return:
"""
print('开始>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>')
self.f = open('123.txt','a+',encoding='utf-8') def process_item(self, item, spider):
self.f.write(item['text']+'\n')
self.f.write(item['url']+'\n')
self.f.flush()
return item def close_spider(self, spider):
"""
爬虫关闭时,被调用
:param spider:
:return:
"""
self.f.close()
print('结束>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>')
pipelines.py
d.settings中进行注册
#建议300-1000,谁小谁先执行
ITEM_PIPELINES = {
'sql.pipelines.DbPipeline': 300,
'sql.pipelines.FilePipeline': 400,
}
settings.py
对于pipeline可以做更多,如下
from scrapy.exceptions import DropItem class CustomPipeline(object):
def __init__(self,v):
self.value = v def process_item(self, item, spider):
# 操作并进行持久化 # return表示会被后续的pipeline继续处理
return item # 表示将item丢弃,不会被后续pipeline处理
# raise DropItem() @classmethod
def from_crawler(cls, crawler):
"""
初始化时候,用于创建pipeline对象
:param crawler:
:return:
"""
val = crawler.settings.getint('MMMM')
return cls(val) def open_spider(self,spider):
"""
爬虫开始执行时,调用
:param spider:
:return:
"""
print('') def close_spider(self,spider):
"""
爬虫关闭时,被调用
:param spider:
:return:
"""
print('')
自定义pipeline
##################################### 注意事项 #############################################
①. pipeline中最多可以写5个方法,各函数执行顺序:def from_crawler --> def __init__ --> def open_spide r --> def prpcess_item --> def close_spider
②. 如果前面类中的Pipeline的process_item方法,跑出了DropItem异常,则后续类中pipeline的process_item就不再执行
class DBPipeline(object):
def process_item(self, item, spider):
#print('DBPipeline',item)
raise DropItem()
DropItem异常
3.“递归执行”
上面的操作步骤只是将当页的chouti.com所有新闻标题和URL写入文件,接下来考虑将所有的页码新闻标题和URL写入文件
a.yield Request对象
# -*- coding: utf-8 -*-
import scrapy
from bs4 import BeautifulSoup
from scrapy.selector import HtmlXPathSelector
from ..items import SqlItem
from scrapy.http import Request class ChoutiSpider(scrapy.Spider):
name = 'chouti'
allowed_domains = ['chouti.com']
start_urls = ['http://chouti.com/'] def parse(self, response):
# print('已经下载完成',response.text) # 1.获取页面上所有的新闻,将标题和URL写入文件
"""
方法一:
soup = BeautifulSoup(response.text,'lxml')
tag = soup.find(name='div',id='content-list')
# tag = soup.find(name='div', attrs={'id':'content_list})
"""
# 方法二:
"""
// 表示从跟开始向下找标签
//div[@id="content-list"]
//div[@id="content-list"]/div[@class="item"]
.// 相对当前对象
"""
hxs = HtmlXPathSelector(response)
tag_list = hxs.select('//div[@id="content-list"]/div[@class="item"]')
for item in tag_list:
text1 = (item.select('.//div[@class="part1"]/a/text()')).extract_first().strip()
url1 = (item.select('.//div[@class="part1"]/a/@href')).extract_first() yield SqlItem(text=text1,url=url1) # 2.找到所有页码,访问页码,页码下载完成后,继续执行持久化的逻辑+继续找页码
page_list = hxs.select('//div[@id="dig_lcpage"]//a/@href').extract()
# page_list = hxs.select('//div[@id="dig_lcpage"]//a[re:test(@href,"/all/hot/recent/\d+")]/@href').extract()
base_url = "https://dig.chouti.com/{0}"
for page in page_list:
url = base_url.format(page)
yield Request(url=url,callback=self.parse)
chouti.py
##################################### 注意事项 #############################################
①. settings.py 可以设置DEPTH_LIMIT = 2, 表示循环完第二层结束。
4.自定义过滤规则(set( )集合去重性)
为防止爬一样的url数据,需要自定义去重规则,去掉已经爬过的URL,新的URL继续下载
a. 编写类
class RepeatUrl:
def __init__(self):
self.visited_url = set() @classmethod
def from_settings(cls, settings):
"""
初始化时,调用
:param settings:
:return:
"""
return cls() def request_seen(self, request):
"""
检测当前请求是否已经被访问过
:param request:
:return: True表示已经访问过;False表示未访问过
"""
if request.url in self.visited_url:
return True
self.visited_url.add(request.url)
return False def open(self):
"""
开始爬去请求时,调用
:return:
"""
print('open replication') def close(self, reason):
"""
结束爬虫爬取时,调用
:param reason:
:return:
"""
print('close replication') def log(self, request, spider):
"""
记录日志
:param request:
:param spider:
:return:
"""
pass
新建new.py文件
b.配置文件
# 自定义过滤规则
DUPEFILTER_CLASS = 'spl.new.RepeatUrl'
settings.py
5.点赞与取消赞
a.获取cookie
cookie_dict = {}
has_request_set = {} def start_requests(self):
url = 'http://dig.chouti.com/'
yield Request(url=url, callback=self.login) def login(self, response):
# 去响应头中获取cookie
cookie_jar = CookieJar()
cookie_jar.extract_cookies(response, response.request)
for k, v in cookie_jar._cookies.items():
for i, j in v.items():
for m, n in j.items():
self.cookie_dict[m] = n.value
b.scrapy发送POST请求
req = Request(
url='http://dig.chouti.com/login',
method='POST',
headers={'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8'},
body='phone=86138xxxxxxxx&password=xxxxxxxxx&oneMonth=1',
cookies=self.cookie_dict,# 未认证cookie带过去
callback=self.check_login
)
yield req #需要登录帐号
# -*- coding: utf-8 -*-
import scrapy
from scrapy.selector import HtmlXPathSelector
from scrapy.http.request import Request
from scrapy.http.cookies import CookieJar
from scrapy import FormRequest class ChouTiSpider(scrapy.Spider):
# 爬虫应用的名称,通过此名称启动爬虫命令
name = "chouti"
# 允许的域名
allowed_domains = ["chouti.com"] cookie_dict = {}
has_request_set = {} def start_requests(self):
url = 'http://dig.chouti.com/'
yield Request(url=url, callback=self.login) def login(self, response):
# 去响应头中获取cookie
cookie_jar = CookieJar()
cookie_jar.extract_cookies(response, response.request)
for k, v in cookie_jar._cookies.items():
for i, j in v.items():
for m, n in j.items():
self.cookie_dict[m] = n.value
# 未认证cookie
req = Request(
url='http://dig.chouti.com/login',
method='POST',
headers={'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8'},
body='phone=86138xxxxxxxx&password=xxxxxx&oneMonth=1',
cookies=self.cookie_dict,# 未认证cookie带过去
callback=self.check_login
)
yield req def check_login(self, response):
print('登录成功,返回code:9999',response.text)
req = Request(
url='http://dig.chouti.com/',
method='GET',
callback=self.show,
cookies=self.cookie_dict,
dont_filter=True
)
yield req def show(self, response):
# print(response)
hxs = HtmlXPathSelector(response)
news_list = hxs.select('//div[@id="content-list"]/div[@class="item"]')
for new in news_list:
# temp = new.xpath('div/div[@class="part2"]/@share-linkid').extract()
link_id = new.xpath('*/div[@class="part2"]/@share-linkid').extract_first()
# 赞
# yield Request(
# url='http://dig.chouti.com/link/vote?linksId=%s' %(link_id,),
# method='POST',
# cookies=self.cookie_dict,
# callback=self.result
# )
# 取消赞
yield Request(
url='https://dig.chouti.com/vote/cancel/vote.do',
method='POST',
headers={'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8'},
body='linksId=%s' %(link_id),
cookies=self.cookie_dict,
callback=self.result
) page_list = hxs.select('//div[@id="dig_lcpage"]//a[re:test(@href, "/all/hot/recent/\d+")]/@href').extract()
for page in page_list: page_url = 'http://dig.chouti.com%s' % page
"""
手动去重
import hashlib
hash = hashlib.md5()
hash.update(bytes(page_url,encoding='utf-8'))
key = hash.hexdigest()
if key in self.has_request_set:
pass
else:
self.has_request_set[key] = page_url
"""
yield Request(
url=page_url,
method='GET',
callback=self.show
) def result(self, response):
print(response.text)
# chouti.py总代码
##################################### 备注 #############################################
①.构造请求体结构数据
from urllib.parse import urlencode dic = {
'name':'daly',
'age': 'xxx',
'gender':'xxxxx'
} data = urlencode(dic)
print(data)
# 打印结果
name=daly&age=xxx&gender=xxxxx
6.下载中间件
a.编写中间件类
class SqlDownloaderMiddleware(object):
def process_request(self, request, spider):
# Called for each request that goes through the downloader
# middleware. # Must either:
# - return None: continue processing this request
# - or return a Response object
# - or return a Request object
# - or raise IgnoreRequest: process_exception() methods of
# installed downloader middleware will be called
print('中间件>>>>>>>>>>>>>>>>',request)
# 对所有请求做统一操作
# 1. 请求头处理
request.headers['User-Agent'] = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36"
# 2. 添加代理
# request.headers['asdfasdf'] = "asdfadfasdf" def process_response(self, request, response, spider):
# Called with the response returned from the downloader. # Must either;
# - return a Response object
# - return a Request object
# - or raise IgnoreRequest
return response def process_exception(self, request, exception, spider):
# Called when a download handler or a process_request()
# (from other downloader middleware) raises an exception. # Must either:
# - return None: continue processing this exception
# - return a Response object: stops process_exception() chain
# - return a Request object: stops process_exception() chain
pass
middlewares.py
b.配置文件
DOWNLOADER_MIDDLEWARES = {
'sql.middlewares.SqlDownloaderMiddleware': 543,
}
settings.py
c.中间件对所有请求做统一操作:
1.请求头处理
request.headers['User-Agent'] = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36"
2.代理
2.1 内置依赖于环境变量(不推荐)
import os
os.environ['http_proxy'] = "http://root:woshiniba@192.168.11.11:9999/"
os.environ['https_proxy'] = "http://192.168.11.11:9999/" PS: 请求刚开始时前需设置,也就是chouti.py中函数start_requests开始后
2.2 自定义下载中间件(建议)
2.2.1 创建下载中间件
import random
import base64
import six def to_bytes(text, encoding=None, errors='strict'):
if isinstance(text, bytes):
return text
if not isinstance(text, six.string_types):
raise TypeError('to_bytes must receive a unicode, str or bytes '
'object, got %s' % type(text).__name__)
if encoding is None:
encoding = 'utf-8'
return text.encode(encoding, errors) class ProxyMiddleware(object):
def process_request(self, request, spider):
PROXIES = [
{'ip_port': '111.11.228.75:80', 'user_pass': ''},
{'ip_port': '120.198.243.22:80', 'user_pass': ''},
{'ip_port': '111.8.60.9:8123', 'user_pass': ''},
{'ip_port': '101.71.27.120:80', 'user_pass': ''},
{'ip_port': '122.96.59.104:80', 'user_pass': ''},
{'ip_port': '122.224.249.122:8088', 'user_pass': ''},
] proxy = random.choice(PROXIES)
if proxy['user_pass'] is not None:
request.meta['proxy'] = to_bytes("http://%s" % proxy['ip_port'])
encoded_user_pass = base64.encodestring(to_bytes(proxy['user_pass']))
request.headers['Proxy-Authorization'] = to_bytes('Basic ' + encoded_user_pass) else:
request.meta['proxy'] = to_bytes("http://%s" % proxy['ip_port'])
#新建 proxies.py
2.2.2 应用下载中间件
DOWNLOADER_MIDDLEWARES = {
'spl.pronxies.ProxyMiddleware':600,
}
settings.py
7.自定制扩展(利用信号在指定位置注册制定操作)
a.编写类
b.settings.py里注册:
EXTENSIONS = {
'spl.extends.MyExtensione': 600,
}
更多扩展
engine_started = object()
engine_stopped = object()
spider_opened = object()
spider_idle = object()
spider_closed = object()
spider_error = object()
request_scheduled = object()
request_dropped = object()
response_received = object()
response_downloaded = object()
item_scraped = object()
item_dropped = object()
8.自定制命令(同时运行多个爬虫)
1.在spiders同级创建任意目录,如:commands,在其中创建 dalyl.py 文件 (此处文件名就是自定义的命令)
from scrapy.commands import ScrapyCommand
from scrapy.utils.project import get_project_settings class Command(ScrapyCommand):
requires_project = True def syntax(self):
return '[options]' def short_desc(self):
return 'Runs all of the spiders' def run(self, args, opts):
spider_list = self.crawler_process.spiders.list()
for name in spider_list:
self.crawler_process.crawl(name, **opts.__dict__)
self.crawler_process.start()
commands/daly.py
2.settings.py注册:
# 自定制命令
COMMANDS_MODULE = 'sql.commands'
-在项目目录执行命令:scrapy daly --nolog
运行结果:
更多文档参见
scrapy知识点:https://www.cnblogs.com/wupeiqi/articles/6229292.html
scrapy分布式:http://www.cnblogs.com/wupeiqi/articles/6912807.html
Scrapy入门操作的更多相关文章
- [转]Scrapy入门教程
关键字:scrapy 入门教程 爬虫 Spider 作者:http://www.cnblogs.com/txw1958/ 出处:http://www.cnblogs.com/txw1958/archi ...
- Scrapy入门教程
关键字:scrapy 入门教程 爬虫 Spider作者:http://www.cnblogs.com/txw1958/出处:http://www.cnblogs.com/txw1958/archive ...
- scrapy入门使用
scrapy入门 创建一个scrapy项目 scrapy startporject mySpider 生产一个爬虫 scrapy genspider itcast "itcast.cn&qu ...
- Scrapy入门教程(转)
关键字:scrapy 入门教程 爬虫 Spider作者:http://www.cnblogs.com/txw1958/出处:http://www.cnblogs.com/txw1958/archive ...
- 0.Python 爬虫之Scrapy入门实践指南(Scrapy基础知识)
目录 0.0.Scrapy基础 0.1.Scrapy 框架图 0.2.Scrapy主要包括了以下组件: 0.3.Scrapy简单示例如下: 0.4.Scrapy运行流程如下: 0.5.还有什么? 0. ...
- 2019-03-22 Python Scrapy 入门教程 笔记
Python Scrapy 入门教程 入门教程笔记: # 创建mySpider scrapy startproject mySpider # 创建itcast.py cd C:\Users\theDa ...
- 小白学 Python 爬虫(34):爬虫框架 Scrapy 入门基础(二)
人生苦短,我用 Python 前文传送门: 小白学 Python 爬虫(1):开篇 小白学 Python 爬虫(2):前置准备(一)基本类库的安装 小白学 Python 爬虫(3):前置准备(二)Li ...
- 小白学 Python 爬虫(35):爬虫框架 Scrapy 入门基础(三) Selector 选择器
人生苦短,我用 Python 前文传送门: 小白学 Python 爬虫(1):开篇 小白学 Python 爬虫(2):前置准备(一)基本类库的安装 小白学 Python 爬虫(3):前置准备(二)Li ...
- 小白学 Python 爬虫(36):爬虫框架 Scrapy 入门基础(四) Downloader Middleware
人生苦短,我用 Python 前文传送门: 小白学 Python 爬虫(1):开篇 小白学 Python 爬虫(2):前置准备(一)基本类库的安装 小白学 Python 爬虫(3):前置准备(二)Li ...
随机推荐
- Java 最全异常讲解
1. 导引问题 实际工作中,遇到的情况不可能是非常完美的.比如:你写的某个模块,用户输入不一定符合你的要求.你的程序要打开某个文件,这个文件可能不存在或者文件格式不对,你要读取数据库的数据,数据可能是 ...
- Python爬虫爬取全书网小说,程序源码+程序详细分析
Python爬虫爬取全书网小说教程 第一步:打开谷歌浏览器,搜索全书网,然后再点击你想下载的小说,进入图一页面后点击F12选择Network,如果没有内容按F5刷新一下 点击Network之后出现如下 ...
- MongoDB的介绍安装与基本使用
MongoDB的介绍安装 关于MongoDB的介绍于安装可参考:https://www.cnblogs.com/DragonFire/p/9135630.html 除了官网下载,可以下载他人下载好分享 ...
- TypeScript进阶开发——ThreeJs基础实例,从入坑到入门
前言 我们前面使用的是自己编写的ts,以及自己手动引入的jquery,由于第三方库采用的是直接引入js,没有d.ts声明文件,开发起来很累,所以一般情况下我们使用npm引入第三方的库,本文记录使用np ...
- 不修改的主席(HJT)树-HDU2665,POJ-2104;
参考:优秀的B站视频: 和 https://blog.csdn.net/creatorx/article/details/75446472 感觉主席树这个思路是真的优秀,每次在前一次的线段树的基础 ...
- HDU-3507Print Article 斜率优化DP
学习:https://blog.csdn.net/bill_yang_2016/article/details/54667902 HDU-3507 题意:有若干个单词,每个单词有一个费用,连续的单词组 ...
- 爬虫基本知识之C/S交互
概念 爬虫就是对网页的获取. 一般获取的网页中又有通向其他网页的通路,我们叫做超链接,那么就可以通过这样的通路获取更多其他的网页,就像一只在网路上爬行的蜘蛛,所以俗称爬虫. 爬虫的工作原理和浏览器浏览 ...
- 【LeetCode】75-颜色分类
题目描述 给定一个包含红色.白色和蓝色,一共 n 个元素的数组,原地对它们进行排序,使得相同颜色的元素相邻,并按照红色.白色.蓝色顺序排列. 此题中,我们使用整数 0. 1 和 2 分别表示红色.白色 ...
- EditPlus5.0破解激活
永久激活用户名激活码: 用户名:Vovan注册码:3AG46-JJ48E-CEACC-8E6EW-ECUAW 然后重启软件即可
- BigDecimal转String
代码: public static void main(String[] args) { // 浮点数的打印 System.out.println(new BigDecimal("10000 ...