项目简介

爬取趣头条新闻(http://home.qutoutiao.net/pages/home.html),具体内容:
1、列表页(json):标题,简介、封面图、来源、发布时间
2、详情页(html):详细内容和图片

目录结构

生成的数据文件-单条记录

主要代码说明

爬虫:
#爬取趣头条列表和详情页
qutoutiao.spiders.qutoutiaos.QutoutiaosSpider
管道文件:
#封面图片处理类
qutoutiao.imagepipelines.CoverImagePipeline
#内容图片处理类
qutoutiao.imagepipelines.ContentImagePipeline
#数据处理类
qutoutiao.pipelines.QutoutiaoPipeline
中间件:
#请求头设置类-这里只设置了user agent
qutoutiao.middlewares.RandomUserAgent
#代理设置类

qutoutiao.middlewares.RandomProxy

自定义:
#配置文件
qutoutiao.qttconfig.py  
#工具类
qutoutiao.qttutils.QttUtils

创建项目

cd /home/chaoge/mypython/crawler/

scrapy startproject qutoutiao

创建爬虫类即(qutoutiao.spiders.qutoutiaos.QutoutiaosSpider)

cd qutoutiao/qutoutiao/spiders
scrapy genspider qutoutiaos "api.1sapp.com"

执行

scrapy crawl qutoutiaos

#scrapy crawl qutoutiaos --nolog#不显示log

#scrapy crawl qutoutiaos -o qutoutiaos_log.json #将log输出到qutoutiaos_log.json

代码实现

qutoutiao.qttconfig.py  

# 爬取域名(趣头条)
DOMAIN = 'http://home.qutoutiao.net/pages/home.html' #数据存储路径
DATA_STORE = '/home/chaoge/mypython/crawler/qutoutiao/data' #列表:http://api.1sapp.com/content/outList?cid=255&tn=1&page=1&limit=10
#列表API
LIST_API = 'http://api.1sapp.com/content/outList?'
#列表记录数
LIST_LIMIT = 10
#分类
CATEGORY_INFO = [
{"cid":255,"name":"推荐"},
{"cid":1,"name":"热点"},
{"cid":6,"name":"娱乐"},
{"cid":5,"name":"养生"},
{"cid":2,"name":"搞笑"},
{"cid":7,"name":"科技"},
{"cid":8,"name":"生活"},
{"cid":10,"name":"财经"},
{"cid":9,"name":"汽车"},
]

qutoutiao.qttutils.py  

# -*- coding: utf-8 -*-
# 趣头条工具类 import time
import os
import shutil
from qutoutiao import qttconfig as QttConfig class QttUtils: # 获取存储路径
#
# @param [string] action [remove删除目录,默认create]
# @return [string] path/year/month/day/* @staticmethod
def getStorePath(action='create'):
localtimes = time.localtime()
year = time.strftime("%Y", localtimes)
month = time.strftime("%m", localtimes)
day = time.strftime("%d", localtimes)
store_path = QttConfig.DATA_STORE+"/%s/%s/%s"%(year,month,day) #删除目录
if os.path.exists(store_path) and action == 'remove':
#os.rmdir(store_path)
shutil.rmtree(store_path) #创建多级目录
if not os.path.exists(store_path) and action == 'create':
os.makedirs(store_path) return store_path

qutoutiao.settings.py

# -*- coding: utf-8 -*-

# Scrapy settings for qutoutiao project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://doc.scrapy.org/en/latest/topics/settings.html
# https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME = 'qutoutiao' SPIDER_MODULES = ['qutoutiao.spiders']
NEWSPIDER_MODULE = 'qutoutiao.spiders' #日志
#LOG_FILE = "qutoutiao.log"
#日志等级
#LOG_LEVEL = "DEBUG" # Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'qutoutiao (+http://www.yourdomain.com)' # Obey robots.txt rules
ROBOTSTXT_OBEY = True # Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default)
#COOKIES_ENABLED = False # Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False # Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.108 Safari/537.36',
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#} # Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
SPIDER_MIDDLEWARES = {
# 'qutoutiao.middlewares.QutoutiaoSpiderMiddleware': 543,
'scrapy.contrib.spidermiddleware.offsite.OffsiteMiddleware': None,#spider中的allowed_domains将不受限制
} # Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
'qutoutiao.middlewares.RandomUserAgent': 100,
'qutoutiao.middlewares.RandomProxy': 200,
}
#中间件中的UserAgent池
USER_AGENTS = [
'User-Agent:Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
'User-Agent:Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
'User-Agent:Mozilla/5.0 (Windows NT 10.0; WOW64; rv:38.0) Gecko/20100101 Firefox/38.0',
'User-Agent:Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729; InfoPath.3; rv:11.0) like Gecko',
'User-Agent:Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;',
'User-Agent:Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11',
'User-Agent:Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.133 Safari/534.16',
]
#中间件中的Proxy池
PROXIES = [
{'ip_port':'121.42.140.113:16816','user_password':'username-xxxx:password-xxxx'},
{'ip_port':'117.90.137.181:9000'},
{'ip_port':'117.90.2.151:9000'},
{'ip_port':'114.235.23.147:9000'},
] # Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#} # Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'qutoutiao.imagepipelines.CoverImagePipeline': 301,#封面图片下载
'qutoutiao.imagepipelines.ContentImagePipeline': 302,#内容图片下载
'qutoutiao.pipelines.QutoutiaoPipeline': 400,#数据处理
}
#图片存储路径
IMAGES_STORE = "/home/chaoge/mypython/crawler/qutoutiao/tmp/images"
#缩图设置
#IMAGES_THUMBS = {
# 'small':(50,50),
# 'big':(270,270),
#}
#图片宽和高在110*110以下忽略
IMAGE_MIN_HEIGHT = 110
IMAGE_MIN_WIDTH = 110
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

qutoutiao.items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html import scrapy class QutoutiaoItem(scrapy.Item):
# define the fields for your item here like: #文章id
aid = scrapy.Field()
#来源
source_name = scrapy.Field()
#标题
title = scrapy.Field()
#详细页url
url = scrapy.Field()
#简介
introduction = scrapy.Field()
#封面图
cover = scrapy.Field()
#发布时间
publish_time = scrapy.Field()
#分类ID
cid = scrapy.Field()
#内容
content = scrapy.Field()
#内容-中的图片
content_images = scrapy.Field()

qutoutiao.middlewares.py  

# -*- coding: utf-8 -*- 

import random
import base64
from settings import USER_AGENTS
from settings import PROXIES #随机User-Agent
class RandomUserAgent(object):
def process_request(self,request,spider):
useragent = random.choice(USER_AGENTS)
request.headers.setdefault('User-Agent',useragent)
#request.headers.setdefault('Host','html2.qktoutiao.com')
#request.headers.setdefault('Referer','http://home.qutoutiao.net/pages/home.html') #随机代理
class RandomProxy(object):
def process_request(self,request,spider):
proxy = random.choice(PROXIES)
request.meta['proxy'] = 'http://'+proxy['ip_port']
#base64_user_password = base64.b64encode(bytes(proxy['user_password'], 'utf-8'))
#decodebs64 = base64.b64decode(base64_user_password)
#print(base64_user_password,decodebs64)
if 'user_password' in proxy and proxy['user_password']:#需要用户名密码的代理
base64_user_password = str(base64.b64encode(bytes(proxy['user_password'], 'utf-8')))
request.headers['Proxy-Authorization'] = 'Basic '+base64_user_password

qutoutiao.imagepipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html import scrapy
from scrapy.utils.project import get_project_settings
from scrapy.pipelines.images import ImagesPipeline
import os
from qutoutiao.qttutils import QttUtils #封面图下载
class CoverImagePipeline(ImagesPipeline):
#获取settings中的常量
IMAGES_STORE = get_project_settings().get('IMAGES_STORE')
#下载图片
def get_media_requests(self, item, info):
cover_images = item['cover']
if cover_images:
for image_url in cover_images:
yield scrapy.Request(url=image_url) #下载完成
def item_completed(self, results, item, info):
#print('*'*20,results,item,info)
image_path = [x['path'] for ok, x in results if ok]
#获取自定义存储路径
store_path = QttUtils.getStorePath()
coverImages = []
#将图片移动到新的路径
if image_path:
for image_url in image_path:
file_name = os.path.split(str(image_url))
new_image = store_path+"/"+file_name[1]
coverImages.append(new_image)
os.rename(self.IMAGES_STORE+"/"+image_url,new_image)
item['cover'] = coverImages
return item
#内容图片下载
class ContentImagePipeline(ImagesPipeline):
#获取settings中的常量
IMAGES_STORE = get_project_settings().get('IMAGES_STORE')
#下载图片
def get_media_requests(self, item, info):
content_images = item['content_images']
if content_images:
for image_url in content_images:
yield scrapy.Request(image_url) #下载完成
def item_completed(self, results, item, info):
image_info = [(x['path'],x['url']) for ok, x in results if ok]
#获取自定义存储路径
store_path = QttUtils.getStorePath()
contentImages = []
#将图片移动到新的路径
if image_info:
for value in image_info:
image_url = value[0]
image_source = value[1]
file_name = os.path.split(str(image_url))
new_image = store_path+"/"+file_name[1]
contentImages.append((new_image,image_source))
os.rename(self.IMAGES_STORE+"/"+image_url,new_image)
item['content_images'] = contentImages
return item

qutoutiao.pipelines.py 

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html import json
from qutoutiao.qttutils import QttUtils class QutoutiaoPipeline(object):
def __init__(self):
#获取自定义存储路径
store_path = QttUtils.getStorePath()
json_path = store_path+"/"+"qutoutiao.json"
self.filename = open(json_path,"wb") def process_item(self, item, spider):
text = json.dumps(dict(item),ensure_ascii=False)+"\n"
self.filename.write(text.encode("utf-8"))
return item def close_spider(self,spider):
self.filename.close()

qutoutiao.spiders.qutoutiaos.py

# -*- coding: utf-8 -*-
#web site:http://home.qutoutiao.net/pages/home.html import scrapy
#通过CrawlSpider,Rule类爬取
#-*-from scrapy.spiders import CrawlSpider,Rule-*-
#-*-from scrapy.linkextractors import LinkExtractor-*-
from qutoutiao.items import QutoutiaoItem
import json
import re
from qutoutiao import qttconfig as QttConfig #-*-class QutoutiaosSpider(CrawlSpider):-*- class QutoutiaosSpider(scrapy.Spider):
name = 'qutoutiaos'
allowed_domains = ['api.1sapp.com'] #爬取地址
start_urls = []
categoryInfo = QttConfig.CATEGORY_INFO
limit = QttConfig.LIST_LIMIT
for value in categoryInfo:
url = QttConfig.LIST_API+"cid=%s&tn=1&page=1&limit=%s"%(str(value['cid']),str(limit))
start_urls.append(url) #response里链接的提取规则
# -*-pageLink = LinkExtractor(allow=("start=\d+"))-*-
# -*-rules = [
# -*- #用pageLink提取规则跟进,通过parseQtt进行解析
# -*- Rule(pageLink,callback="parseQtt",follow=True)
# -*-]
def parse(self, response):
response_url = response.url
#分类id从url又获取了一次
searchObj = re.search( r'(.*)cid=(\d+)', response_url)
cid = searchObj and searchObj.group(2) or 0 data = json.loads(response.text)['data']['data'] for value in data:
#初始化模型对象
item = QutoutiaoItem()
#来源
item['source_name'] = value['source_name']
#标题
item['title'] = value['title']
#详细页url
url = item['url'] = value['url']
#url = url[0:url.find('?')]
#简介
item['introduction'] = value['introduction']
#封面图
item['cover'] = value['cover']
#发布时间
item['publish_time'] = value['publish_time']
#分类
item['cid'] = cid #爬取详情页
yield scrapy.Request( url = item['url'], meta={'meta_item': item}, callback=self.detail_parse) #详情页
def detail_parse(self, response):
# 提取每次Response的meta数据
meta_item = response.meta['meta_item']
#取内容
content_selector = response.xpath('//div[@class="content"]')
meta_item['content_images'] = content_selector.xpath('//img/@src|//img/@data-src').extract()
meta_item['content'] = content_selector.extract()[0]
yield meta_item

python3+scrapy 趣头条爬虫实例的更多相关文章

  1. python3.4学习笔记(十三) 网络爬虫实例代码,使用pyspider抓取多牛投资吧里面的文章信息,抓取政府网新闻内容

    python3.4学习笔记(十三) 网络爬虫实例代码,使用pyspider抓取多牛投资吧里面的文章信息PySpider:一个国人编写的强大的网络爬虫系统并带有强大的WebUI,采用Python语言编写 ...

  2. python3.4学习笔记(十四) 网络爬虫实例代码,抓取新浪爱彩双色球开奖数据实例

    python3.4学习笔记(十四) 网络爬虫实例代码,抓取新浪爱彩双色球开奖数据实例 新浪爱彩双色球开奖数据URL:http://zst.aicai.com/ssq/openInfo/ 最终输出结果格 ...

  3. 简单scrapy爬虫实例

    简单scrapy爬虫实例 流程分析 抓取内容:网站课程 页面:https://edu.hellobi.com 数据:课程名.课程链接及学习人数 观察页面url变化规律以及页面源代码帮助我们获取所有数据 ...

  4. scrapy爬虫实例(1)

    爬虫实例 对象 阳光问政平台 目标 : 主题,时间,内容 爬取思路 预先设置好items import scrapy class SuperspiderItem(scrapy.Item): title ...

  5. Python3 Scrapy 安装方法

    Python3 Scrapy 安装方法 (一脸辛酸泪) 写在前面 最近在学习爬虫,在熟悉了Python语言和BeautifulSoup4后打算下个爬虫框架试试. 没想到啊,这坑太深了... 看了看相关 ...

  6. Scrapy 框架,爬虫文件相关

    Spiders 介绍 由一系列定义了一个网址或一组网址类如何被爬取的类组成 具体包括如何执行爬取任务并且如何从页面中提取结构化的数据. 简单来说就是帮助你爬取数据的地方 内部行为 #1.生成初始的Re ...

  7. [转]Scrapy简单入门及实例讲解

    Scrapy简单入门及实例讲解 中文文档:   http://scrapy-chs.readthedocs.io/zh_CN/0.24/ Scrapy是一个为了爬取网站数据,提取结构性数据而编写的应用 ...

  8. 群辉6.1.7安装scrapy框架执行爬虫

    只针对会linux命令,会python的伙伴, 使用环境为: 群辉ds3615xs 6.1.7 python3.5 最近使用scrapy开发了一个小爬虫,因为很穷没有服务器可已部署. 所以打起了我那台 ...

  9. python爬虫实例大全

    WechatSogou [1]- 微信公众号爬虫.基于搜狗微信搜索的微信公众号爬虫接口,可以扩展成基于搜狗搜索的爬虫,返回结果是列表,每一项均是公众号具体信息字典. DouBanSpider [2]- ...

随机推荐

  1. MySQL Point in Time Recovery the Right Way

    In this blog, I’ll look at how to do MySQL point in time recovery (PITR) correctly. Sometimes we nee ...

  2. Excel函数之rank应用

    该函数的功能就是对现有数据指标进行排名 示例:对产品进行销售总额的排名 首先要知道排名需要用到rank函数 number参数就是你要进行排名的数据 ref参数就是该指标需要在哪个区域内进行比较定位排名 ...

  3. Python变量以及类型

    变量的定义 在程序中,有时我们需要对2个数据进行求和,那么该怎样做呢? 大家类比一下现实生活中,比如去超市买东西,往往咱们需要一个菜篮子,用来进行存储物品,等到所有的物品都购买完成后,在收银台进行结账 ...

  4. [转帖][分享] 关于系统DIY--by 原罪

    http://wuyou.net/forum.php?mod=viewthread&tid=399277&extra=page%3D1 前几天我发了一个帖子<Windows组件w ...

  5. 3076: 神经网络(bfs和拓扑排序)

    3076: 神经网络 时间限制: 1 Sec  内存限制: 125 MB提交: 7  解决: 5[提交][状态][讨论版][命题人:外部导入][Edit] [TestData] [同步数据] 题目描述 ...

  6. Linux Bash Shell字符串分割substring等(转)

    原文https://blog.csdn.net/wuyinggui10000/article/details/52779364 脚本开发中遇到的问题是:原java中的字符串分割操作(substring ...

  7. Linux下安装jdk8步骤详述

    作为Java开发人员,在Linux下安装一些开发工具是必备技能,本文以安装jdk为例,详细记录了每一步的操作命令,以供参考. 0.下载jdk8 登录网址:http://www.oracle.com/t ...

  8. 刘志梅201771010115.《面向对象程序设计(java)》第六周学习总结

    实验六 继承定义与使用 实验时间 2018-9-28 1.实验目的与要求 (1) 继承的定义:用已有类来构建新类的一种机制.当定义了一个新类继承了一个类时,这个新类就继承了这个类的方法和域,同时在新类 ...

  9. vue仿淘宝结账订单

    <template>  <div class="container">  <div class="checkout-title"& ...

  10. linux 僵尸进程查看方式

    ps -A -ostat,ppid,pid,cmd |grep -e '^[Zz]' # 结果 Z 169925 49893 [sw] <defunct> Z 169925 120690 ...