分析Ajax抓取今日头条街拍美图

spider.py

 # -*- coding:utf-8 -*-

 from urllib import urlencode

 import requests

 from requests.exceptions import RequestException

 import json

 import re

 import os

 from hashlib import md5

 from bs4 import BeautifulSoup

 import pymongo

 from multiprocessing import Pool

 from json.decoder import JSONDecoder

 from config import *

 client = pymongo.MongoClient(MONGO_URL, connect=False)

 db = client[MONGO_DB]

 def get_page_index(offset,keyword):

     data = {

         'offset': offset,

         'format': 'json',

         'keyword': keyword,

         'autoload': 'true',

         'count': '',

         'cur_tab': 3

     }

     url = 'http://www.toutiao.com/search_content/?' + urlencode(data)

     try:

         response = requests.get(url)

         if response.status_code == 200:

             return response.text

         return None

     except RequestException:

         print u'请求索引页失败', url

         return None

 def parse_page_index(html):

     data = json.loads(html)

     if data and 'data' in data.keys():

         for item in data.get('data'):

             yield item.get('article_url')

 def get_page_detail(url):

     try:

         response = requests.get(url)

         if response.status_code == 200:

             return response.text

         return None

     except RequestException:

         print u'请求详情页失败', url

         return None

 def parse_page_detail(html, url):

     soup = BeautifulSoup(html, 'lxml')

     title = soup.select('title')[0].get_text()

     print(title)

     images_pattern = re.compile('gallery: (.*?),\n', re.S)

     result = re.search(images_pattern, html)

     if result:

         data = json.loads(result.group(1))

         if data and 'sub_images' in data.keys():

             sub_images = data.get('sub_images')

             images = [item.get('url') for item in sub_images]

             for image in images: download_image(image)

             return {

                 'title': title,

                 'url': url,

                 'images': images

             }

 def save_to_mongo(result):

     if db[MONGO_TABLE].insert(result):

         print u'存储到MongoDB成功', result

         return True

     return False

 def download_image(url):

     print u'正在下载', url

     try:

         response = requests.get(url)

         if response.status_code == 200:

             save_image(response.content)

         return None

     except RequestException:

         print u'请求图片失败', url

         return None

 def save_image(content):

     file_path = '{0}/{1}.{2}'.format(os.getcwd(), md5(content).hexdigest(), 'jpg')

     if not os.path.exists(file_path):

         with open(file_path, 'wb') as f:

             f.write(content)

             f.close()

 def main(offset):

     html = get_page_index(offset, KEYWORD)

     for url in parse_page_index(html):

         html = get_page_detail(url)

         if html:

             result = parse_page_detail(html, url)

             if result: save_to_mongo(result)

 if __name__ == '__main__':

     groups = [x*20 for x in range(GROUP_START, GROUP_END+1)]

     pool = Pool()

     pool.map(main, groups)

config.py

 # -*- coding:utf-8 -*-

 MONGO_URL = 'localhost'

 MONGO_DB = 'toutiao'

 MONGO_TABLE = 'toutiao'

 GROUP_START = 0

 GROUP_END = 20

 KEYWORD = '街拍'

分析Ajax抓取今日头条街拍美图的更多相关文章

【Python3网络爬虫开发实战】分析Ajax爬取今日头条街拍美图
前言本文的文字及图片来源于网络,仅供学习.交流使用,不具有任何商业用途,版权归原作者所有,如有问题请及时联系我们以作处理.作者:haoxuan10 本节中,我们以今日头条为例来尝试通过分析Ajax请求 ...
分析Ajax爬取今日头条街拍美图-崔庆才思路
站点分析源码及遇到的问题代码结构方法定义需要的常量关于在代码中遇到的问题 01. 数据库连接 02.今日头条的反爬虫机制 03. json解码遇到的问题 04. 关于response.tex ...
关于爬虫的日常复习（9）—— 实战：分析Ajax抓取今日头条接拍美图
python爬虫之分析Ajax请求抓取抓取今日头条街拍美图（七）
python爬虫之分析Ajax请求抓取抓取今日头条街拍美图一.分析网站 1.进入浏览器,搜索今日头条,在搜索栏搜索街拍,然后选择图集这一栏. 2.按F12打开开发者工具,刷新网页,这时网页回弹到综合 ...
15-分析Ajax请求并抓取今日头条街拍美图
流程框架: 抓取索引页内容:利用requests请求目标站点,得到索引网页HTML代码,返回结果. 抓取详情页内容:解析返回结果,得到详情页的链接,并进一步抓取详情页的信息. 下载图片与保存数据库:将 ...
Python Spider 抓取今日头条街拍美图
""" 抓取今日头条街拍美图 """ import os import time import requests from hashlib ...
分析Ajax请求并抓取今日头条街拍美图
项目说明本项目以今日头条为例,通过分析Ajax请求来抓取网页数据. 有些网页请求得到的HTML代码里面并没有我们在浏览器中看到的内容.这是因为这些信息是通过Ajax加载并且通过JavaScript渲 ...
【Python3网络爬虫开发实战】6.4-分析Ajax爬取今日头条街拍美图【华为云技术分享】
[摘要] 本节中,我们以今日头条为例来尝试通过分析Ajax请求来抓取网页数据的方法.这次要抓取的目标是今日头条的街拍美图,抓取完成之后,将每组图片分文件夹下载到本地并保存下来. 1. 准备工作在本节 ...
转：【Python3网络爬虫开发实战】6.4-分析Ajax爬取今日头条街拍美图
[摘要] 本节中,我们以今日头条为例来尝试通过分析Ajax请求来抓取网页数据的方法.这次要抓取的目标是今日头条的街拍美图,抓取完成之后,将每组图片分文件夹下载到本地并保存下来. 1. 准备工作在本节 ...

随机推荐

laravel view not found
在windows开发的laravel项目,部署到Linux服务器找不到视图,代码格式可能是这样的 return view('news\list'); 原因是在Linux下不能识别反斜杠路径,解决办法是 ...
innodb 体系结构（后台进程）
一.后台进程(innodb 1.0.x以前的) 1.master thread master thread具有最高的线程优先级别,其内部由多个循环(loop)组成:主循环(loop).后台循环(bac ...
leetcode 字谜
242. Valid Anagram Easy 66298FavoriteShare Given two strings s and t , write a function to determine ...
zabbix学习笔记----概念----2019.03.25
1.zabbix支持的通讯方式 1)agent:专用的代理程序,首推: 2)SNMP: 3)SSH/Telnet: 4)IPMI,通过标准的IPMI硬件接口,监控被监控对象的硬件特性. 2)zab ...
vue项目两级全选（多级原理也一样），感觉有点意思，随手一记
需求: 首先说一下思路:我首先把数据列表两级遍历了一下,增加了一个checked属性来控制勾选和不勾线 this.productList.forEach((item)=>{ this.$set( ...
vue间通信
1,父子组件通信 props 传递父组件: 子组件: 2,子父组件通信父组件: 子组件: 3,子组件与子组件传递使用bus.js 如下传递子组件: 接收子组件
Python基础与进阶
1 Python基础与进阶欢迎来到Python世界搭建编程环境变量 | 字符串 | 注释 | 错误消除他只用一张图,就把Python中的列表拿下了! 使用 If 语句进行条件测试使用字典更准 ...
nodejs 使用官方oracledb库连接数据库教程
https://www.cnblogs.com/rysinal/p/7779055.html 导读 linux下安装使用 gcc安装 nodejs安装 oracle客户端安装 npm安装oracled ...
Yii2.0 解决“the requested URL was not found on this server”问题
在你下了 Yii 框架,配置完路由 urlManager 后,路由访问页面会报错“the requested URL was not found on this server”,url类似于这种“ht ...
LVS的DR模式测试案例<仅个人记录>
初始概念大家都知道LVS,是章文嵩博士创建的,所以首先推一下主站吧!http://zh.linuxvirtualserver.org/ LVS集群分为三层结构: 负载调度器(load balance ...

分析Ajax抓取今日头条街拍美图

分析Ajax抓取今日头条街拍美图的更多相关文章

随机推荐

热门专题