Python爬虫系列-分析Ajax请求并抓取今日头条街拍图片

1.抓取索引页内容

利用requests请求目标站点，得到索引网页HTML代码，返回结果。

2.抓取详情页内容

解析返回结果，得到详情页的链接，并进一步抓取详情页的信息。

3.下载图片与保存数据库

将图片下载到本地，并把页面信息及图片URL保存到MongDB。

4.开启循环及多线程

对多页内容遍历，开启多线程提高抓取速度。

1.抓取索引页

from urllib.parse import urlencode
from requests.exceptions import RequestException
import requests
def get_page_index(offset, keyword):
    headers = { 'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36' }
    data = {
        'format': 'json',
        'offset': offset,
        'keyword': keyword,
        'autoload': 'true',
        'count': 20,
        'cur_tab': 1,
        'from': 'search_tab',
        'pd': 'synthesis',
    }
    url = 'https://www.toutiao.com/search_content/?' + urlencode(data)
    response = requests.get(url, headers=headers);
    try:
        if response.status_code == 200:
            return response.text
        return None
    except RequestException:
        print('请求索引页失败')
        return None
def main():
    html = get_page_index(0,'街拍')
    print(html)
if __name__=='__main__':
    main()

2.抓取详情页内容

获取页面网址：

def parse_page_index(html):
  data = json.loads(html)
  if data and 'data' in data.keys():
    for item in data.get('data'):
      yield item.get('article_url')

单个页面代码：

def get_page_detail(url):
  headers = { 'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36' }
  try:
    response = requests.get(url, headers=headers)
    if response.status_code == 200:
      return response.text
    return None
  except RequestException:
    print('请求详情页页失败')
    return None

图片地址

def parse_page_detail(html,url):
  soup = BeautifulSoup(html,'lxml')
  title = soup.select('title')[0].get_text()
  images_pattern = re.compile('gallery: JSON.parse\((.*?)\)', re.S)
  result = re.search(images_pattern, html)
  if result:
    data = json.loads(result.group(1))
    data = json.loads(data) #将字符串转为dict，因为报错了
    if data and 'sub_images' in data.keys():
      sub_images = data.get('sub_images')
      images = [item.get('url') for item in sub_images]
      for image in images: download_image(image)
      return {
        'title': title,
        'images':images,
        'url':url
      }

3.下载图片与保存数据库

# 存到数据库
def save_to_mongo(result):
  if db[MONGO_TABLE].insert(result):
    print('存储到MongoDb成功', result)
    return True
  return False
# 下载图片
def download_image(url):
  print('正在下载',url)
  headers = { 'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.    36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36' }
  try:
    response = requests.get(url, headers=headers)
    if response.status_code == 200:
      save_image(response.content)
    return None
  except RequestException:
    print('请求图片失败', url)
    return None
def save_image(content):
  file_path = '{0}/{1}.{2}'.format(os.getcwd(),md5(content).hexdigest(),'jpg')
  if not os.path.exists(file_path):
    with open(file_path,'wb') as f:
      f.write(content)

4.开启循环及多线程

groups = [x*20 for x in range(GROUP_START, GROUP_END+1)]
    pool = Pool()
    pool.map(main,groups)

完整代码:spider.py

from urllib.parse import urlencode
from requests.exceptions import RequestException
from bs4 import BeautifulSoup
from hashlib import md5
from multiprocessing import Pool
from config import *
import pymongo
import requests
import json
import re
import os
client = pymongo.MongoClient(MONGO_URL)
db = client[MONGO_DB]
def get_page_index(offset, keyword):
  headers = { 'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36' }
  data = { 'format': 'json','offset': offset,'keyword': keyword,'autoload': 'true','count': 20,'cur_tab': 1,'from': 'search_tab','pd': 'synthesis' }
  url = 'https://www.toutiao.com/search_content/?' + urlencode(data)
  try:
    response = requests.get(url, headers=headers)
    if response.status_code == 200:
      return response.text
    return None
  except RequestException:
    print('请求索引页失败')
    return None
def parse_page_index(html):
  data = json.loads(html)
  if data and 'data' in data.keys():
    for item in data.get('data'):
      yield item.get('article_url')
def get_page_detail(url):
  headers = { 'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36' }
  try:
    response = requests.get(url, headers=headers)
    if response.status_code == 200:
      return response.text
    return None
  except RequestException:
    print('请求详情页页失败')
    return None
def parse_page_detail(html,url):
  soup = BeautifulSoup(html,'lxml')
  title = soup.select('title')[0].get_text()
  images_pattern = re.compile('gallery: JSON.parse\((.*?)\)', re.S)
  result = re.search(images_pattern, html)
  if result:
    data = json.loads(result.group(1))
    data = json.loads(data) #将字符串转为dict，因为报错了
    if data and 'sub_images' in data.keys():
      sub_images = data.get('sub_images')
      images = [item.get('url') for item in sub_images]
      for image in images: download_image(image)
      return {
        'title': title,
        'images':images,
        'url':url
      }
def save_to_mongo(result):
  if db[MONGO_TABLE].insert(result):
    print('存储到MongoDb成功', result)
    return True
  return False
def download_image(url):
  print('正在下载',url)
  headers = { 'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.    36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36' }
  try:
    response = requests.get(url, headers=headers)
    if response.status_code == 200:
      save_image(response.content)
    return None
  except RequestException:
    print('请求图片失败', url)
    return None
def save_image(content):
  file_path = '{0}/{1}.{2}'.format(os.getcwd(),md5(content).hexdigest(),'jpg')
  if not os.path.exists(file_path):
    with open(file_path,'wb') as f:
      f.write(content)
def main(offset):
  html = get_page_index(offset,KEYWORD)
  for url in parse_page_index(html):
     html = get_page_detail(url)
     if html:
       result = parse_page_detail(html,url)
       if isinstance(result,dict):
         save_to_mongo(result)
if __name__=='__main__':
    groups = [x*20 for x in range(GROUP_START, GROUP_END+1)]
    pool = Pool()
    pool.map(main,groups)

config.py

MONGO_URL = 'localhost'
MONGO_DB = 'toutiao'
MONGO_TABLE = 'jiepai'
GROUP_START = 1
GROUP_END = 20
KEYWORD = '街拍'
~

Python爬虫系列-分析Ajax请求并抓取今日头条街拍图片的更多相关文章

【Python爬虫案例学习】分析Ajax请求并抓取今日头条街拍图片
1.抓取索引页内容利用requests请求目标站点,得到索引网页HTML代码,返回结果. from urllib.parse import urlencode from requests.excep ...
分析Ajax请求并抓取今日头条街拍美图
项目说明本项目以今日头条为例,通过分析Ajax请求来抓取网页数据. 有些网页请求得到的HTML代码里面并没有我们在浏览器中看到的内容.这是因为这些信息是通过Ajax加载并且通过JavaScript渲 ...
分析 ajax 请求并抓取今日头条街拍美图
首先分析街拍图集的网页请求头部: 在 preview 选项卡我们可以找到 json 文件,分析 data 选项,找到我们要找到的图集地址 article_url: 选中其中一张图片,分析 json 请 ...
2.分析Ajax请求并抓取今日头条街拍美图
import requests from urllib.parse import urlencode # 引入异常类 from requests.exceptions import RequestEx ...
python爬虫知识点总结（十）分析Ajax请求并抓取今日头条街拍美图
一.流程框架
15-分析Ajax请求并抓取今日头条街拍美图
流程框架: 抓取索引页内容:利用requests请求目标站点,得到索引网页HTML代码,返回结果. 抓取详情页内容:解析返回结果,得到详情页的链接,并进一步抓取详情页的信息. 下载图片与保存数据库:将 ...
分析 ajax 请求并抓取 “今日头条的街拍图”
今日头条抓取页面: 分析街拍页面的 ajax 请求: 通过在 XHR 中查看内容,获取 url 链接,params 参数信息,将两者进行拼接后取得完整 url 地址.data 中的 article_u ...
python爬虫之分析Ajax请求抓取抓取今日头条街拍美图（七）
python爬虫之分析Ajax请求抓取抓取今日头条街拍美图一.分析网站 1.进入浏览器,搜索今日头条,在搜索栏搜索街拍,然后选择图集这一栏. 2.按F12打开开发者工具,刷新网页,这时网页回弹到综合 ...
PYTHON 爬虫笔记九:利用Ajax+正则表达式+BeautifulSoup爬取今日头条街拍图集（实战项目二）
利用Ajax+正则表达式+BeautifulSoup爬取今日头条街拍图集目标站点分析今日头条这类的网站制作,从数据形式,CSS样式都是通过数据接口的样式来决定的,所以它的抓取方法和其他网页的抓取方 ...

随机推荐

Vue中登录模块
shell中变量内容的删除，替代
删除 ${varname#strMatch} // 在varname中从头匹配strMatch,然后删除从头到第一次匹配到的位置 ${varname##strMatch} // 在varname中从头 ...
Linux —— gcc编译文件
编译过程预处理: 作用: 负责展开在源文件重定义的宏操作: g++ -E 源文件.c -o 目标文件.i 汇编: 作用: 将目标文件生成汇编代码文件操作: g++ -S 目标文件.i -o 汇编 ...
UVa1471
保留有价值的数字的做法,实际上这道题因为n只有1e5,所以不需要这种优化. #include<bits/stdc++.h> #define inf 0x3f3f3f3f ; using n ...
NET Core 2.1 Preview 1
NET Core 2.1 Preview 1 [翻译] .NET Core 2.1 Preview 1 发布原文: Announcing .NET Core 2.1 Preview 1 今天,我们宣 ...
NET Core应用中实现与第三方IoC/DI框架的整合？
NET Core应用中实现与第三方IoC/DI框架的整合? 我们知道整个ASP.NET Core建立在以ServiceCollection/ServiceProvider为核心的DI框架上,它甚至提供 ...
062 Unique Paths 不同路径
机器人位于一个 m x n 网格的左上角, 在下图中标记为“Start” (开始).机器人每次只能向下或者向右移动一步.机器人试图达到网格的右下角,在下图中标记为“Finish”(结束).问有多少条不 ...
IO扩展芯片
PCF8574:一个I2C接口+INT中断引脚口扩展出一个可输出输出的并口P0~P7,INT可以用于中断响应
HDU 1027 G - Can you answer these queries?
http://acm.hdu.edu.cn/showproblem.php?pid=4027 Can you answer these queries? Time Limit: 4000/2000 M ...
java join 方法的使用
在很多情况下,主线程创建并启动子线程,如果子线程中要进行大量的耗时运算,主线程往往将早于子线程结束之前结束.这时,如果主线程想等待子线程执行完成之后再结束,比如子线程处理一个数据,主线程要取得这个数据 ...