获取URL

进入某个知乎问题的主页下,按F12打开开发者工具后查看network面板。

network面板可以查看页面向服务器请求的资源、资源的大小、加载资源花费的时间以及哪些资源加载失败等信息。还可以查看HTTP的请求头,返回内容等。

以“你有哪些可爱的猫猫照片?”问题为例,我们可以看到network面板如下:

按一下快捷键Ctrl + F在搜索面板中直接搜索对应的答案出现的文字,可以找到对应的目标url及其response

安装对应的package,其他包都比较简单,需要注意的是python图像处理的包cv2安装命令如下:

  1. pip install opencv-python

URL分析

1. 参数分析

我们刚才获取的URL如下:

  1. https://www.zhihu.com/api/v4/questions/356541789/answers?include=data%5B*%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cattachment%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Cis_labeled%2Cpaid_info%2Cpaid_info_content%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_recognized%3Bdata%5B*%5D.mark_infos%5B*%5D.url%3Bdata%5B*%5D.author.follower_count%2Cvip_info%2Cbadge%5B*%5D.topics%3Bdata%5B*%5D.settings.table_of_content.enabled&offset=&limit=3&sort_by=default&platform=desktop

其中包含的参数为:

  • limit: 一页显示的答案条数
  • offset:页面的偏移量
  • sort_by:答案的排序方式,支持默认排序或者按时间排序

2.解析Response

尝试着发一个请求并截获http response

  1. # python3
  2. import requests
  3. import json
  4. if __name__ == '__main__':
  5. target_url = "https://www.zhihu.com/api/v4/questions/356541789/answers?include=data%5B*%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cattachment%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Cis_labeled%2Cpaid_info%2Cpaid_info_content%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_recognized%3Bdata%5B*%5D.mark_infos%5B*%5D.url%3Bdata%5B*%5D.author.follower_count%2Cvip_info%2Cbadge%5B*%5D.topics%3Bdata%5B*%5D.settings.table_of_content.enabled&offset=&limit=3&sort_by=default&platform=desktop"
  6. headers = {
  7. 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36',
  8. }
  9. response = requests.get(url = target_url, headers = headers)
  10. html = response.text
  11. print(html)

获取到的response如下,我们需要做的是找到所有图片对应的链接,使用Json工具解析后可以从http返回值json中找到图片所在的位置,后续就是通过爬虫解析到下载地址即可:

Tips:值得注意的是网站的返回值样式经常变动,而且不同网站返回值的组织样式也不一样,所以不可盲目借鉴。

3.获取所有答案url

仍然使用在“开发者工具中”查找答案关键字的方法,我们可以拿到多个答案对应的url,我们需要从这些url中找到规律:

  1. https://www.zhihu.com/api/v4/questions/356541789/answers?include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cattachment%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Cis_labeled%2Cpaid_info%2Cpaid_info_content%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_recognized%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cvip_info%2Cbadge%5B%2A%5D.topics%3Bdata%5B%2A%5D.settings.table_of_content.enabled&limit=3&offset=3&platform=desktop&sort_by=default
  2. https://www.zhihu.com/api/v4/questions/356541789/answers?include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cattachment%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Cis_labeled%2Cpaid_info%2Cpaid_info_content%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_recognized%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cvip_info%2Cbadge%5B%2A%5D.topics%3Bdata%5B%2A%5D.settings.table_of_content.enabled&limit=3&offset=0&platform=desktop&sort_by=default

尽管url的格式不尽相同,但是我发现基本都遵循如下格式,只需要变更offset参数即可

  1. https://www.zhihu.com/api/v4/questions/356541789/answers?include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cattachment%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Cis_labeled%2Cpaid_info%2Cpaid_info_content%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_recognized%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cvip_info%2Cbadge%5B%2A%5D.topics%3Bdata%5B%2A%5D.settings.table_of_content.enabled&limit=3&offset=0&platform=desktop&sort_by=default

Code

1. 模拟请求

简单加上headers即可,知乎的校验没有其他网站来得严格,访问过于频繁时会限制访问一段时间,我这里简单使用随机请求头和代理IP来处理:

  1. def get_http_content(number, offset):
  2. """读取知乎某问题下的答案url, 返回对应json
  3. Args:
  4. number: 知乎问题唯一标识
  5. offset: 偏移量
  6. """
  7. target_url = "https://www.zhihu.com/api/v4/questions/{number}/answers?include=data%5B*%5D.is_normal%2Cadmin_closed_comment%2" \
  8. "Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2" \
  9. "Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2" \
  10. "Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Crelationship.is_authorized%2Cis_author%2Cvoting%2" \
  11. "Cis_thanked%2Cis_nothelp%2Cis_labeled%2Cis_recognized%2Cpaid_info%2Cpaid_info_content%3Bdata%5B*%5D.mark_infos%5B*%5D.url%3Bdata%5B*%5" \
  12. "D.author.follower_count%2Cbadge%5B*%5D.topics&offset={offset}&limit={limit}&sort_by=default&platform=desktop".format(
  13. number=number, offset=offset, limit=limit)
  14. logger.info("target_url:{}", target_url)
  15. headers = {
  16. 'User-Agent': fake_useragent.get_random_useragent(),
  17. }
  18. ip = IPPool().get_random_key()
  19. proxies = {"http": "http://" + ip}
  20. response = requests.get(target_url, headers=headers, proxies=proxies)
  21. if (response is None) or (response.status_code != 200):
  22. logger.warning("http response is None, number={}, offset={}, status_code={}".format(
  23. number, offset, response.status_code))
  24. return None
  25. html = response.text
  26. return json.loads(html)

2. 解析出图片地址

  1. def start_crawl():
  2. """开始爬虫获取图片
  3. """
  4. for i in range(0, max_pages):
  5. offset = limit * i
  6. logger.info("download pictures with offset {}". format(offset))
  7. # 获取html
  8. content_dict = get_http_content(number, offset)
  9. if content_dict is None:
  10. logger.error(
  11. "get http resp fail, number={} offset={}", number, offset)
  12. continue
  13. # content_dict['data']存储了答案列表
  14. if 'data' not in content_dict:
  15. logger.error("parse data from http resp fail, dict={}", dict)
  16. continue
  17. for answer_text in content_dict['data']:
  18. logger.info(
  19. "get pictures from answer: https://www.zhihu.com/question/{}/answer/{}", number, answer_text['id'])
  20. if 'content' not in answer_text:
  21. logger.error(
  22. "parse content from answer text fail, text={}", answer_text)
  23. continue
  24. answer_content = pq(answer_text['content'])
  25. img_urls = answer_content.find('noscript').find('img')
  26. # 此篇问答不包含图片时打印对应信息, 方便debug
  27. if len(list(img_urls)) <= 0:
  28. logger.warning(
  29. "this answer has no pictures, url:https://www.zhihu.com/question/{}/answer/{}", number, answer_text['id'])
  30. continue
  31. for img_url in img_urls.items():
  32. # src例子: https://pic2.zhimg.com/50/v2-c970108cd260ea095383627362c1d04f_720w.jpg?source=1940ef5c
  33. src = img_url.attr("src")
  34. # 解析出图片格式后缀: .jpeg 或者 .gif等
  35. source_index = src.rfind('?source')
  36. if source_index == -1:
  37. logger.error("find source index fail, src:{} source_index{}",
  38. src, source_index)
  39. suffix = src[0:source_index]
  40. suffix_index = src.rfind('.')
  41. if source_index == -1:
  42. logger.error("find suffix fail, src:{} suffix_index{}".format(
  43. src, suffix_index))
  44. suffix = suffix[suffix_index:]
  45. logger.info("get picture url, src:{} suffix:{}", src, suffix)
  46. store_picture(src, suffix)
  47. time.sleep(1)

3. 将图片存储到本地

  1. def store_picture(img_url, suffix):
  2. """将图片存储到文件夹中
  3. Args:
  4. img_url: 图片链接
  5. suffix: 图片后缀, 比如'.jpg', '.gif'等
  6. """
  7. headers = {
  8. 'User-Agent': fake_useragent.get_random_useragent(),
  9. }
  10. ip = IPPool().get_random_key()
  11. proxies = {"http": "http://" + ip}
  12. http_resp = requests.get(img_url, headers=headers, proxies=proxies)
  13. if (http_resp is None) or (http_resp.status_code != 200):
  14. logger.warning("get http resp fail, url={} http_resp={}",
  15. img_url, http_resp)
  16. return
  17. content = http_resp.content
  18. with open(f"{picture_path}/{uuid.uuid4()}{suffix}", 'wb') as f:
  19. f.write(content)

4. 去除图片水印

本来打算使用图像识别进行抠图去除水印的(因为知乎的水印比较简单而且样式统一),无奈最近需要处理的事情比较多,因此就简单通过opencv包进行裁剪:

  1. def crop_watermark(ori_dir, adjusted_dir):
  2. """通过裁剪图片的方式来去除水印, 注意无法处理gif格式的图片
  3. Args:
  4. ori_dir: 图片所在的文件夹
  5. adjusted_dir: 去除水印后存放的文件夹
  6. """
  7. img_path_list = os.listdir(ori_dir) # 获取目录下的所有文件
  8. total = len(img_path_list)
  9. cnt = 1
  10. for img_path in img_path_list:
  11. logger.info(
  12. "the overall process::{}/{}, now handle the picture:{}", cnt, total, img_path)
  13. img_abs_path = ori_dir + '/' + img_path
  14. img = cv2.imread(img_abs_path)
  15. if img is None:
  16. logger.error("cv2.imread fail, picture:{}", img_path)
  17. continue
  18. height, width = img.shape[0:2]
  19. cropped = img[0:height-40, 0:width]
  20. adjusted_img_abs_path = adjusted_dir + '/' + img_path
  21. cv2.imwrite(adjusted_img_abs_path, cropped)
  22. cnt += 1

写在最后

写这个程序主要还是为了学习html解析和锤炼一下python编程,虽然写完了之后回过头来看确实没啥值得称道的地方,就把代码放这里供大家一起参考了:

https://gitee.com/tomocat/zhi-hu-picture-crawler

另外此程序的主要目的仅仅是将我搜集图片和剔除水印的过程自动化而已,还是再告诫大家一下不要因为爬虫给别人的服务器带来压力。

Reference

[1] https://www.cnblogs.com/jxlsblog/p/10445066.html

[Python]爬虫获取知乎某个问题下所有图片并去除水印的更多相关文章

  1. Python爬虫获取知乎图片

    前段时间想抓点知乎问题中的图片,了解了下爬虫,发现还是Python的简单方便,于是做了点尝试. #coding=utf-8 import urllib import re def getHtml(ur ...

  2. python 爬虫必知必会

    #python爬虫 #新闻数据 #机器学习:股票数据获取及分析 #网络搜索引擎的一个部件 #Http协议 #正则表达式 #多线程,分布式 #http报文展示 #Http 应答报文介绍 #1.应答码 # ...

  3. python爬虫获取下一页

    from time import sleep import faker import requests from lxml import etree fake = faker.Faker() base ...

  4. 如何科学地蹭热点:用python爬虫获取热门微博评论并进行情感分析

    前言:本文主要涉及知识点包括新浪微博爬虫.python对数据库的简单读写.简单的列表数据去重.简单的自然语言处理(snowNLP模块.机器学习).适合有一定编程基础,并对python有所了解的盆友阅读 ...

  5. Python爬虫获取异步加载站点pexels并下载图片(Python爬虫实战3)

    1. 异步加载爬虫 对于静态页面爬虫很容易获取到站点的数据内容,然而静态页面需要全量加载站点的所有数据,对于网站的访问和带宽是巨大的挑战,对于高并发和大访问访问量的站点来说,需要使用AJAX相关的技术 ...

  6. Python爬取知乎单个问题下的回答

    前言 本文的文字及图片来源于网络,仅供学习.交流使用,不具有任何商业用途,版权归原作者所有,如有问题请及时联系我们以作处理. 作者: 努力学习的渣渣哦 PS:如有需要Python学习资料的小伙伴可以加 ...

  7. 用 python 抓取知乎指定回答下的视频

    前言 现在知乎允许上传视频,奈何不能下载视频,好气哦,无奈之下研究一下了,然后撸了代码,方便下载视频保存. 接下来以 猫为什么一点也不怕蛇? 回答为例,分享一下整个下载过程. 调试一下 打开 F12, ...

  8. python爬虫获取百度图片(没有精华,只为娱乐)

    python3.7,爬虫技术,获取百度图片资源,msg为查询内容,cnt为查询的页数,大家快点来爬起来.注:现在只能爬取到百度的小图片,以后有大图片的方法,我会陆续发贴. #!/usr/bin/env ...

  9. Python爬虫获取迅雷会员帐号

    代码如下: import re import urllib.request import urllib import time from collections import deque head = ...

随机推荐

  1. GitHub创建图床

    GitHub 写第一篇文章时发现从typora粘贴过来的文章会出现下面的情况 经常在Windows用typora的小伙一定遇到过一个问题:不管是用截图工具截图后直接粘贴,还是通过选择文件夹选择图片的方 ...

  2. Convert a Private Project on bitbucket.com to a github Public Project

    Create a public repo on github, you can add README or License files on the master branch, suppose th ...

  3. 【笔记】初探KNN算法(2)

    KNN算法(2) 机器学习算法封装 scikit-learn中的机器学习算法封装 在python chame中将算法写好 import numpy as np from math import sqr ...

  4. SpringBoot开发十六-帖子详情

    需求介绍 实现帖子详情,在帖子标题上增加访问详情页面的链接. 代码实现 开发流程: 首先在数据访问层新增一个方法 实现查看帖子的方法 业务层同理增加查询方法 最后在表现层处理查询请求 数据访问层增加根 ...

  5. 一篇文章搞懂密码学基础及SSL/TLS协议

    SSL协议是现代网络通信中重要的一环,它提供了传输层上的数据安全.为了方便大家的理解,本文将先从加密学的基础知识入手,然后展开对SSL协议原理.流程以及一些重要的特性的详解,最后会扩展介绍一下国密SS ...

  6. iOS开发之Lame编译

    前言 为了保证音频格式在多端通用,需要将音频转化为MP3格式,本文讲解了如何使用Shell脚本来编译lame库. 编译脚本 #!/bin/sh CONFIGURE_FLAGS="--disa ...

  7. 网络安全学习阶段性总结:SQL注入|SSRF攻击|OS命令注入|身份验证漏洞|事物逻辑漏洞|目录遍历漏洞

    目录 SQL注入 什么是SQL注入? 掌握SQL注入之前需要了解的知识点 SQL注入情况流程分析 有完整的回显报错(最简单的情况)--检索数据: 在HTTP报文中利用注释---危险操作 检索隐藏数据: ...

  8. BeautifulSoup4的使用

    一.介绍 Beautiful Soup 主要是用来解析提取 HTML 和 XML 文件中的数据. 现在官网推荐使用 Beautiful Soup 4 ,已经被移植到了BS4中. 安装 Beautifu ...

  9. Socket通信协议解析(文章摘要)

    参考网址: https://zhuanlan.zhihu.com/p/84800923 在计算机通信领域,socket 被翻译为"套接字",它是计算机之间进行通信的一种约定或一种方 ...

  10. vue2.0中文文档

    地址1: 链接: https://pan.baidu.com/s/1uEzM990A-W-fl23ref2zww 提取码: rkpt 复制这段内容后打开百度网盘手机App,操作更方便哦 地址2:htt ...