一、移动端数据的爬取

基于某一款抓包工具,fiddler,青花瓷,miteproxy
fillder进行一个基本的配置:tools->options->connection->allow remote ...
http://fillder所在pc机的ip:58083/:访问到一张提供了证书下载功能的页面
fiddler所在的机器和手机在同一网段下:在手机浏览器中访问http://fillder所在pc机的ip:58083/
获取子页面进行证书的下载和安装(证书信任的操作)
配置你的手机的代理:将手机的代理配置成fiddler所对应pc机的ip和fillder自己的端口
就可以让fiddler捕获手机发起的http和https的请求

二、scrapy,pyspider

什么是框架?如何学习框架?
就是一个集成了各种功能且具有很强通用性(可以被应用在各种不同的需求中)的一个项目模板.
我们只需要学习框架中封装好的相关功能的使用即可.

1.scrapy集成了哪些功能:

高性能的数据解析操作,持久化存储操作,高性能的数据下载的操作.....

2.环境的安装:

pip3 install wheel
下载twisted http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted
进入下载目录，执行 pip3 install Twisted‑17.1.0‑cp35‑cp35m‑win_amd64.whl
pip3 install pywin32
pip3 install scrapy

3.scrapy的基本使用

创建一个工程:scrapy startproject firstBlood
必须在spiders这个目录下创建一个爬虫文件
cd proName
scrapy genspider spiderName www.xxx.com
执行工程:scrapy crawl spiderName
settings.py:
不遵从robots协议
进行UA伪装
进行日志等级设定:LOG_LEVEL = 'ERROR'

4.持久化存储:

基于终端指令:
- 特性:只可以将parse方法的返回值存储到本地的磁盘文件中
- 指令:scrapy crawl spiderName -o filePath
基于管道:实现流程

　　　　1.数据解析
　　　　2.在item类中定义相关的属性
　　　　3.将解析的数据存储或者封装到一个item类型的对象(items文件中对应类的对象)
　　　　4.向管道提交item
　　　　5.在管道文件的process_item方法中接收item进行持久化存储
　　　　6.在配置文件中开启管道

4.1将同一份数据持久化到不同的平台中?

分析:

　　　　1.管道文件中的一个管道类负责数据的一种形式的持久化存储
　　　　2.爬虫文件向管道提交的item只会提交给优先级最高的那一个管道类
　　　　3.在管道类的process_item中的return item表示的是将当前管道接收的item返回/提交给下一个即将被执行的管道类

5.在scrapy中如何进行手动请求发送(GET)

使用场景:爬取多个页码对应的页面源码数据
yield scrapy.Request(url,callback)
在scrapy中如何进行手动请求发送(POST)

data = { #post请求的请求参数

'kw':'aaa'

}

yield scrapy.FormRequest(url,formdata=data,callback)

6.scrapy五大核心组件的工作流程:

引擎(Scrapy)
- 用来处理整个系统的数据流处理, 触发事务(框架核心)
调度器(Scheduler)
- 用来接受引擎发过来的请求, 压入队列中, 并在引擎再次请求的时候返回. 可以想像成一个URL（抓取网页的网址或者说是链接）的优先队列, 由它来决定下一个要抓取的网址是什么, 同时去除重复的网址
下载器(Downloader)
- 用于下载网页内容, 并将网页内容返回给蜘蛛(Scrapy下载器是建立在twisted这个高效的异步模型上的)
爬虫(Spiders)
- 爬虫是主要干活的, 用于从特定的网页中提取自己需要的信息, 即所谓的实体(Item)。用户也可以从中提取出链接,让Scrapy继续抓取下一个页面
项目管道(Pipeline)
- 负责处理爬虫从网页中抽取的实体，主要的功能是持久化实体、验证实体的有效性、清除不需要的信息。当页面被爬虫解析后，将被发送到项目管道，并经过几个特定的次序处理数据。

 1 # -*- coding: utf-8 -*-

 2 import scrapy

 3 from qiubaiPro.items import QiubaiproItem

 4

 5 class QiubaiSpider(scrapy.Spider):

 6     name = 'qiubai'

 7     # allowed_domains = ['www.xxx.com']

 8     start_urls = ['https://www.qiushibaike.com/text/']

 9     def start_requests(self):

10         for url in self.start_urls:

11             yield scrapy.Request(url,callback=self.parse)

12     #基于终端指令的持久化存储操作

13     # def parse(self, response):

14     #     div_list = response.xpath('//*[@id="content-left"]/div')

15     #     all_data = []

16     #     for div in div_list:

17     #         #scrapy中的xpath返回的列表的列表元素一定是Selector对象,我们最终想要的解析的

18     #         #数据一定是存储在该对象中

19     #         #extract()将Selector对象中data参数的值取出

20     #         # author = div.xpath('./div[1]/a[2]/h2/text()')[0].extract()

21     #         author = div.xpath('./div[1]/a[2]/h2/text()').extract_first()

22     #         #列表直接调用extract表示的是将extract作用到每一个列表元素中

23     #         content = div.xpath('./a[1]/div/span//text()').extract()

24     #         content = ''.join(content)

25     #         dic = {

26     #             'author':author,

27     #             'content':content

28     #         }

29     #         all_data.append(dic)

30     #     return all_data

31     #基于管道的持久化存储

32     # def parse(self, response):

33     #     div_list = response.xpath('//*[@id="content-left"]/div')

34     #     all_data = []

35     #     for div in div_list:

36     #         #scrapy中的xpath返回的列表的列表元素一定是Selector对象,我们最终想要的解析的

37     #         #数据一定是存储在该对象中

38     #         #extract()将Selector对象中data参数的值取出

39     #         # author = div.xpath('./div[1]/a[2]/h2/text()')[0].extract()

40     #         author = div.xpath('./div[1]/a[2]/h2/text()').extract_first()

41     #         #列表直接调用extract表示的是将extract作用到每一个列表元素中

42     #         content = div.xpath('./a[1]/div/span//text()').extract()

43     #         content = ''.join(content)

44     #

45     #         #将解析的数据存储到item对象

46     #         item = QiubaiproItem()

47     #         item['author'] = author

48     #         item['content'] = content

49     #

50     #         #将item提交给管道

51     #         yield item #item一定是提交给了优先级最高的管道类

52

53     #将多个页码对应的页面数据进行爬取和解析的操作

54     url = 'https://www.qiushibaike.com/text/page/%d/'#通用的url模板

55     pageNum = 1

56     #parse第一次调用表示的是用来解析第一页对应页面中的段子内容和作者

57     def parse(self, response):

58         div_list = response.xpath('//*[@id="content-left"]/div')

59         all_data = []

60         for div in div_list:

61             # scrapy中的xpath返回的列表的列表元素一定是Selector对象,我们最终想要的解析的

62             # 数据一定是存储在该对象中

63             # extract()将Selector对象中data参数的值取出

64             # author = div.xpath('./div[1]/a[2]/h2/text()')[0].extract()

65             author = div.xpath('./div[1]/a[2]/h2/text()').extract_first()

66             # 列表直接调用extract表示的是将extract作用到每一个列表元素中

67             content = div.xpath('./a[1]/div/span//text()').extract()

68             content = ''.join(content)

69

70             # 将解析的数据存储到item对象

71             item = QiubaiproItem()

72             item['author'] = author

73             item['content'] = content

74

75             # 将item提交给管道

76             yield item  # item一定是提交给了优先级最高的管道类

77

78         if self.pageNum <= 5:

79             self.pageNum += 1

80             new_url = format(self.url%self.pageNum)

81             #手动请求(get)的发送

82             yield scrapy.Request(new_url,callback=self.parse)

qiubai.py

# -*- coding: utf-8 -*-

# Scrapy settings for qiubaiPro project

#

# For simplicity, this file contains only settings considered important or

# commonly used. You can find more settings consulting the documentation:

#

#     https://docs.scrapy.org/en/latest/topics/settings.html

#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html

#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'qiubaiPro'

USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36'

SPIDER_MODULES = ['qiubaiPro.spiders']

NEWSPIDER_MODULE = 'qiubaiPro.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent

#USER_AGENT = 'qiubaiPro (+http://www.yourdomain.com)'

# Obey robots.txt rules

ROBOTSTXT_OBEY = False

LOG_LEVEL = 'ERROR'

# Configure maximum concurrent requests performed by Scrapy (default: 16)

#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)

# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay

# See also autothrottle settings and docs

#DOWNLOAD_DELAY = 3

# The download delay setting will honor only one of:

#CONCURRENT_REQUESTS_PER_DOMAIN = 16

#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)

#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)

#TELNETCONSOLE_ENABLED = False

# Override the default request headers:

#DEFAULT_REQUEST_HEADERS = {

#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',

#   'Accept-Language': 'en',

#}

# Enable or disable spider middlewares

# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html

#SPIDER_MIDDLEWARES = {

#    'qiubaiPro.middlewares.QiubaiproSpiderMiddleware': 543,

#}

# Enable or disable downloader middlewares

# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html

#DOWNLOADER_MIDDLEWARES = {

#    'qiubaiPro.middlewares.QiubaiproDownloaderMiddleware': 543,

#}

# Enable or disable extensions

# See https://docs.scrapy.org/en/latest/topics/extensions.html

#EXTENSIONS = {

#    'scrapy.extensions.telnet.TelnetConsole': None,

#}

# Configure item pipelines

# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html

ITEM_PIPELINES = {

   'qiubaiPro.pipelines.QiubaiproPipeline': 300, #300表示的是优先级

   #  'qiubaiPro.pipelines.MysqlPL': 301,

   #  'qiubaiPro.pipelines.RedisPL': 302,

}

# Enable and configure the AutoThrottle extension (disabled by default)

# See https://docs.scrapy.org/en/latest/topics/autothrottle.html

#AUTOTHROTTLE_ENABLED = True

# The initial download delay

#AUTOTHROTTLE_START_DELAY = 5

# The maximum download delay to be set in case of high latencies

#AUTOTHROTTLE_MAX_DELAY = 60

# The average number of requests Scrapy should be sending in parallel to

# each remote server

#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0

# Enable showing throttling stats for every response received:

#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)

# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings

#HTTPCACHE_ENABLED = True

#HTTPCACHE_EXPIRATION_SECS = 0

#HTTPCACHE_DIR = 'httpcache'

#HTTPCACHE_IGNORE_HTTP_CODES = []

#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

settings

# -*- coding: utf-8 -*-

# Define here the models for your scraped items

#

# See documentation in:

# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy

class QiubaiproItem(scrapy.Item):

    # define the fields for your item here like:

    author = scrapy.Field() #Field可以将其理解成是一个万能的数据类型

    content = scrapy.Field()

items.py

# -*- coding: utf-8 -*-

# Define your item pipelines here

#

# Don't forget to add your pipeline to the ITEM_PIPELINES setting

# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html

import pymysql

from redis import Redis

class QiubaiproPipeline(object):

    fp = None

    def open_spider(self,spider):

        print('开始爬虫......')

        self.fp = open('qiushibaike.txt','w',encoding='utf-8')

    #使用来接收爬虫文件提交过来的item,然后将其进行任意形式的持久化存储

    #参数item:就是接收到的item对象

    #该方法每接收一个item就会调用一次

    def process_item(self, item, spider):

        author = item['author']

        content= item['content']

        self.fp.write(author+':'+content+'\n')

        return item #item是返回给了下一个即将被执行的管道类

    def close_spider(self,spider):

        print('结束爬虫!')

        self.fp.close()

#负责将数据存储到mysql

class MysqlPL(object):

    conn = None

    cursor = None

    def open_spider(self,spider):

        self.conn = pymysql.Connect(host='127.0.0.1',port=3306,user='root',password='123',db='spider',charset='utf8')

        print(self.conn)

    def process_item(self,item,spider):

        author = item['author']

        content = item['content']

        sql = 'insert into qiubai values ("%s","%s")'%(author,content)

        self.cursor = self.conn.cursor()

        try:

            self.cursor.execute(sql)

            self.conn.commit()

        except Exception as e:

            print(e)

            self.conn.rollback()

        return item

    def close_spider(self,spider):

        self.cursor.close()

        self.conn.close()

class RedisPL(object):

    conn = None

    def open_spider(self,spider):

        self.conn = Redis(host='127.0.0.1',port=6379)

        print(self.conn)

    def process_item(self,item,spider):

        self.conn.lpush('all_data',item)

        #注意:如果将字典写入redis报错:pip install -U redis==2.10.6

pipelines.py

爬虫必知必会（5）_scrapy框架_基础的更多相关文章

python网络爬虫，知识储备，简单爬虫的必知必会，【核心】
知识储备,简单爬虫的必知必会,[核心] 一.实验说明 1. 环境登录无需密码自动登录,系统用户名shiyanlou 2. 环境介绍本实验环境采用带桌面的Ubuntu Linux环境,实验中会用到桌 ...
Django框架之第六篇（模型层）--单表查询和必知必会13条、单表查询之双下划线、Django ORM常用字段和参数、关系字段
单表查询补充一个知识点:在models.py建表是 create_time = models.DateField() 关键字参数: 1.auto_now:每次操作数据,都会自动刷新当前操作的时间 2 ...
2015 前端[JS]工程师必知必会
2015 前端[JS]工程师必知必会本文摘自:http://zhuanlan.zhihu.com/FrontendMagazine/20002850 ,因为好东东西暂时没看懂,所以暂时保留下来,供以 ...
[ 学习路线 ] 2015 前端(JS)工程师必知必会 (2)
http://segmentfault.com/a/1190000002678515?utm_source=Weibo&utm_medium=shareLink&utm_campaig ...
Android程序员必知必会的网络通信传输层协议——UDP和TCP
1.点评互联网发展至今已经高度发达,而对于互联网应用(尤其即时通讯技术这一块)的开发者来说,网络编程是基础中的基础,只有更好地理解相关基础知识,对于应用层的开发才能做到游刃有余. 对于Android ...
迈向高阶：优秀Android程序员必知必会的网络基础
1.前言网络通信一直是Android项目里比较重要的一个模块,Android开源项目上出现过很多优秀的网络框架,从一开始只是一些对HttpClient和HttpUrlConnection简易封装使用 ...
脑残式网络编程入门(三)：HTTP协议必知必会的一些知识
本文原作者:“竹千代”,原文由“玉刚说”写作平台提供写作赞助,原文版权归“玉刚说”微信公众号所有,即时通讯网收录时有改动. 1.前言无论是即时通讯应用还是传统的信息系统,Http协议都是我们最常打交 ...
RecyclerView 必知必会（转）
[腾讯Bugly干货分享]RecyclerView 必知必会本文来自于腾讯Bugly公众号(weixinBugly),未经作者同意,请勿转载,原文地址:http://mp.weixin.qq.com ...
H5系列之History（必知必会）
H5系列之History(必知必会) 目录概念兼容性属性方法 H5方法概念理解History Api的使用方式目的是为了解决哪些问题作用:ajax获取数据时 ...

随机推荐

关于markdown的入门使用
关于标题方式一: 使用 = - 标示一,二级标题 = 表示一级标题 - 表示二级标题示例: 我展示的是一级标题 ================= 我展示的是二级标题 -------------- ...
MyBatis中传参时为什么要用#{}
MyBatis中传参时为什么要用#{},这个问题和MyBatis如何防止SQL注入类似.不过在解释这个问题之前,先解释一下什么是SQL注入,还有些称作注入攻击这个问题. SQL注入就是SQL 对传入参 ...
大数据开发--Hbase协处理器案例
大数据开发--Hbase协处理器案例 1. 需求描述在社交网站,社交APP上会存储有大量的用户数据以及用户之间的关系数据,比如A用户的好友列表会展示出他所有的好友,现有一张Hbase表,存储就是当前 ...
基金术语 All In One
基金术语 All In One GP.LP.PE.VC.FOF LP 有限合伙人(Limited Partner, LP):我们可以简单的理解为出资人. 很多时候,一个项目需要投资上千万乃至数个亿的资 ...
JSON-LD 结构化数据
JSON-LD 结构化数据 SEO JSON-LD JSON for Linking Data JSON 链接数据 https://json-ld.org/ https://en.wikipedia. ...
JWT & JSON Web Tokens
JSON Web Tokens https://jwt.io json web token example https://jwt.io/introduction/ https://medium.co ...
github & coding 2018
github & coding 2018 github & coding all in one https://github.com/topics/javascript react r ...
how to input special keyboard symbol in macOS(⌘⇧⌃⌥)
how to input special keyboard symbol in macOS(⌘⇧⌃⌥) emoji ctrl + command + space / ⌘⇧⌃ ⌘⇧⌃ Character ...
dragable tabs & iframe & new window
dragable tabs & iframe & new window https://www.npmjs.com/package/react-draggable-tab demo h ...
js 获取是否为闰年，以及各月的天数 & isLeapYear
js 获取是否为闰年,以及各月的天数 calendar utils isLeapYear const isLeapYear = (year) => { return (year % 4 === ...

爬虫必知必会（5）_scrapy框架_基础