python学习之BeautifulSoup模块爬图

BeautifulSoup模块爬图学习HTML文本解析标签定位
网上教程多是爬mzitu，此网站反爬限制多了。随意找了个网址，解析速度有些慢。
脚本流程：首页获取总页数-->拼接每页URL-->获取每页中所有主题URL-->遍历图片源URL下载，保存

 #python3

 #coding:utf-8_

 #_author: Jack

 #_date: 2020/3/28

 from bs4 import BeautifulSoup

 import requests,os,sys,time

 DIR_PATH = os.path.dirname(os.path.abspath(__file__))

 sys.path.append(DIR_PATH)

 HEADER = {

         'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:74.0) Gecko/20100101 Firefox/74.0',

        }

 def create_dir(file_path):

     '''

     :param file_path: images_directory

     :return:

     '''

     if not os.path.exists(file_path):

         os.makedirs(file_path)

         print('Creatr directory：',file_path)

     os.chdir(file_path) # cd ..

 def save_data(src,dir_name,file_name):

     '''

     :param src: images url

     :param sum: directory name

     :param file_name: image name

     :return:

     '''

     file_path = os.path.join(DIR_PATH,'images',str(dir_name))  #directory path

     image_path = os.path.join(file_path,file_name)  #images path

     create_dir(file_path)

     if not os.path.isfile(image_path):

         req = requests.get(src,headers=HEADER)

         with open(image_path, 'wb') as f_save:

             f_save.write(req.content)

             print('Download successful:',file_name)

             f_save.flush()

     else:

         print('File already exists! Pass')

 def request_to_url(url,header):

     '''

     :param url: page_url

     :param head: request.header

     :return: respond.text

     '''

     res = requests.get(url,headers=header)

     return res.text

 def soup(url,header):

     '''

     :param url:

     :param header:

     :return: HTML_Tag

     '''

     return BeautifulSoup(request_to_url(url,header),'html.parser')

 def action(url):

     '''

     Download a count of 100 images and create a new folder

     :param url: URL

     :return:

     '''

     download_count = 0

     dir_name =100

     try:

         page_tag = soup(url,HEADER).find('div',class_='pg').find_all('a')

         max_page = int(page_tag[-2].text.split(' ')[-1])

         for i in range(1,max_page+1):   #find page

             page_url = os.path.join(url,'forum.php?order=&fid=0&page=%d'%i)

             #time.sleep(1)

             page_all_theme_list = soup(page_url,HEADER).find('div',class_='kind_show')

             theme_list = page_all_theme_list.find_all('div', class_='photo_thumb kind_left')

             for i in theme_list:    #find theme

                 theme = i.find('div', class_='title').find('a')

                 #title = theme.string

                 img_url = theme.get('href')

                 print("Ready download: %s" % theme.string,img_url)

                 # time.sleep(1)

                 img_page_tag = soup(img_url,HEADER).find('td',class_='t_f').find_all('img')

                 for i in img_page_tag:  #find image

                     try:

                         img_src = i.get('src')

                         if download_count %100 == 0:

                             dir_name +=100

                         save_data(img_src,dir_name,img_src.split('/')[-1])

                         download_count += 1

                         print('Download successful: %d' %download_count)

                     except Exception as e:

                         print('Img_tag & Save_data Error:',e)

                         continue

     except Exception as e:

         print('The trunk Error:',e)

 if __name__ == '__main__':

     print('Run.....')

     URL = 'http://www.lesb.cc/'

     action(URL)

     print('Perform !')

python学习之BeautifulSoup模块爬图的更多相关文章

Python学习 Part4：模块
Python学习 Part4:模块 1. 模块是将定义保存在一个文件中的方法,然后在脚本中或解释器的交互实例中使用.模块中的定义可以被导入到其他模块或者main模块. 模块就是一个包含Python定义 ...
python学习之argparse模块
python学习之argparse模块一.简介: argparse是python用于解析命令行参数和选项的标准模块,用于代替已经过时的optparse模块.argparse模块的作用是用于解析命令行 ...
Python学习day19-常用模块之re模块
figure:last-child { margin-bottom: 0.5rem; } #write ol, #write ul { position: relative; } img { max- ...
Python学习day18-常用模块之NumPy
figure:last-child { margin-bottom: 0.5rem; } #write ol, #write ul { position: relative; } img { max- ...
Python爬虫使用lxml模块爬取豆瓣读书排行榜并分析
上次使用了BeautifulSoup库爬取电影排行榜,爬取相对来说有点麻烦,爬取的速度也较慢.本次使用的lxml库,我个人是最喜欢的,爬取的语法很简单,爬取速度也快. 本次爬取的豆瓣书籍排行榜的首页地 ...
雨痕的《Python学习笔记》--附脑图（转）
原文:http://www.pythoner.com/148.html 近日,在某微博上看到有人推荐了雨痕的<Python学习笔记>,从github上下载下来看了下,确实很不错. 注意 ...
Python学习笔记-常用模块
1.python模块如果你退出 Python 解释器并重新进入,你做的任何定义(变量和方法)都会丢失.因此,如果你想要编写一些更大的程序,为准备解释器输入使用一个文本编辑器会更好,并以那个文件替代作 ...
python学习之random模块
Python中的random模块用于生成随机数.下面介绍一下random模块中最常用的几个函数. random.random random.random()用于生成一个0到1的随机符点数: 0 < ...
Python 爬虫三 beautifulsoup模块
beautifulsoup模块 BeautifulSoup模块 BeautifulSoup是一个模块,该模块用于接收一个HTML或XML字符串,然后将其进行格式化,之后遍可以使用他提供的方法进行快速查 ...

随机推荐

Redis 机器内核参数优化
" > /proc/sys/vm/overcommit_memory echo never > /sys/kernel/mm/transparent_hugepage/enabl ...
Kubernetes集群部署DNS插件
准备 kube-dns 相关镜像准备 kube-dns 相关 yaml 文件系统预定义的 RoleBinding 配置 kube-dns 相关服务检查 kube-dns 功能 kube-dns ...
最强加密算法？AES加解密算法Matlab和Verilog实现
目录背景 AES加密的几种模式基本运算 AES加密原理 Matlab实现 Verilog实现 Testbench 此本文首发于公众号[两猿社],重点讲述了AES加密算法的加密模式和原理,用MATL ...
mysql数据库笔记0
mysql数据库笔记0 一次性添加多行数据例如: INSERT INTO students (class_id, name, gender, score) VALUES (1, '大宝', 'M', ...
《前端之路》--- 重温 Egg.js
目录 <前端之路>--- 重温 Egg.js 一.基础功能 > 日志系统包含了四大层面的日志对象, 分别是 App Logger.App CoreLogger.Context L ...
网络编程模型（C/S模型和B/S模型）
目录网络应用编程模型互联网与企业内部网早期计算机网络的通信模型 C/S模式 B/S模式 B/S 和 C/S 的区别网络应用编程模型互联网与企业内部网网络的两个含义: 互联网 :互联网(In ...
TCP/IP基础总结性学习（6）
HTTP 首部一. HTTP 报文首部 1.HTTP 报文的结构: 2.HTTP 请求报文图示: 举例子: 3.HTTP 响应报文: 下面的示例是访问 http://hackr.jp 时,请求报文 ...
Object-Oriented Programming Summary Ⅳ
目录 UML单元总结博客总结本单元两次作业的设计总结自己在四个单元中架构设计以及OO方法理解的演进总结自己在四个单元中测试理解与实践的演进总结自己的课程收获立足于自己的体会给课程组提三个具体 ...
Error : Failed to get convolution algorithm. This is probably because cuDNN failed to initialize
记录一下: 报错:# Error : Failed to get convolution algorithm. This is probably because cuDNN failed to ini ...
php不用第三个变量，交换两个数的值
//字符串版本结合使用substr,strlen两个方法实现 $a="a"; $b="b"; echo '交换前 $a:'.$a.',$b:'.$b.'< ...

python学习之BeautifulSoup模块爬图

python学习之BeautifulSoup模块爬图的更多相关文章

随机推荐

热门专题