爬虫实战--基于requests 和 Beautiful的7160美图网爬取图片

import requests

import os

from bs4 import BeautifulSoup

import re

# 初始地址

all_url = 'http://www.7160.com/xiaohua/'

#保存路径

path = 'H:/school_girl/'

# 请求头

header = {

    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 UBrowser/6.1.2107.204 Safari/537.36'

}

#################################开始请求（多列表）#################################

html = requests.get(all_url,headers = header)

start_html = html.text.encode('iso-8859-1').decode('gbk')  # 将gb2312转为UTF-8格式

#################################开始解析#################################

soup = BeautifulSoup(start_html,'lxml')

#查找最大页码

page = 255

# 同一路径

same_url = 'http://www.7160.com/xiaohua/'

for n in range(1,int(page)+1):

    ul = same_url + 'list_6_' + str(n) + '.html'

    ####################开始请求（单列表多元素）###############

    html = requests.get(ul,headers = header)

    start_html = html.text.encode('iso-8859-1').decode('gbk')

    ########################开始解析##########################

    soup = BeautifulSoup(start_html,'lxml')

    all_a = soup.find('div',class_='news_bom-left').find_all('a',target = '_blank')

    for a in all_a:

        title = a.get_text()

        if title != '':

            ########################创建目录##########################

            #win不能创建带？的目录

            if (os.path.exists(path + title.strip().replace('?', ''))):

                # print('目录已存在')

                flag = 1

            else:

                os.makedirs(path + title.strip().replace('?', ''))

                flag = 0

            os.chdir(path + title.strip().replace('?', ''))

            ######################### END ###########################

            ###################开始请求（单元素）###############

            print('准备爬取:' + title)

            hrefs = a['href']

            in_url = 'http://www.7160.com'

            href = in_url + hrefs

            htmls = requests.get(href,headers = header)

            html = htmls.text.encode('iso-8859-1').decode('gbk')

            #######################开始解析######################

            mess = BeautifulSoup(html,'lxml')

            titles = mess.find('h1').text

            pic_max = mess.find('div',class_ = 'itempage').find_all('a')[-2].text # 最大页数

            if (flag == 1 and len(os.listdir(path + title.strip().replace('?', ''))) >= int(pic_max)):

                print('已经保存完毕，跳过')

                continue

            for num in range(1,int(pic_max)+1):

                href = a['href']

                hrefs = re.findall(r'.{14}',href)

                href = "".join(hrefs)

                if num == 1:

                    html = in_url + href + '.html'

                else:

                    html = in_url + href + '_' + str(num) + ".html"

                ###################开始请求（单元素里的子元素）###############

                htmls = requests.get(html,headers = header)

                html = htmls.text.encode('iso-8859-1').decode('gbk')

                #######################开始解析######################

                mess = BeautifulSoup(html,'lxml')

                pic_url = mess.find('img',alt = titles)

                print(pic_url['src'])

                #########################开始下载#####################

                html = requests.get(pic_url['src'],headers = header)

                filename = pic_url['src'].split(r'/')[-1]

                f = open(filename,'wb')

                f.write(html.content)

                f.close()

            print('完成')

    print('第',n,'页完成')

打印后的结果为：

准备爬取:
阳光下校花美女迷人桃花眼嘴
http://img.7160.com/uploads/allimg/180913/13-1P913102541.jpg
http://img.7160.com/uploads/allimg/180913/13-1P913102541-50.jpg
http://img.7160.com/uploads/allimg/180913/13-1P913102541-51.jpg
http://img.7160.com/uploads/allimg/180913/13-1P913102542.jpg
http://img.7160.com/uploads/allimg/180913/13-1P913102542-50.jpg
http://img.7160.com/uploads/allimg/180913/13-1P913102542-51.jpg
http://img.7160.com/uploads/allimg/180913/13-1P913102542-52.jpg
http://img.7160.com/uploads/allimg/180913/13-1P913102542-53.jpg
http://img.7160.com/uploads/allimg/180913/13-1P913102542-54.jpg
http://img.7160.com/uploads/allimg/180913/13-1P913102543.jpg
http://img.7160.com/uploads/allimg/180913/13-1P913102543-50.jpg
完成
准备爬取:
黑长直发美女学生日系风制服
http://img.7160.com/uploads/allimg/180912/13-1P912102159.jpg
http://img.7160.com/uploads/allimg/180912/13-1P912102159-50.jpg
http://img.7160.com/uploads/allimg/180912/13-1P912102159-51.jpg
http://img.7160.com/uploads/allimg/180912/13-1P912102159-52.jpg
http://img.7160.com/uploads/allimg/180912/13-1P912102200.jpg

爬虫实战--基于requests 和 Beautiful的7160美图网爬取图片的更多相关文章

爬虫实战--基于requests和beautifulsoup的妹子网图片爬取（福利哦！）
#coding=utf-8 import requests from bs4 import BeautifulSoup import os all_url = 'http://www.mzitu.co ...
[原创] Python3.6+request+beautiful 半次元Top100 爬虫实战，将小姐姐的cos美图获得
1 技术栈 Python3.6 Python的版本 request 得到网页html.jpg等资源的lib beautifulsoup 解析html的利器 html5lib 指定beautifulso ...
基于requests模块的cookie,session和线程池爬取
目录基于requests模块的cookie,session和线程池爬取基于requests模块的cookie操作基于requests模块的代理操作基于multiprocessing.dummy ...
Python爬虫实战之Requests+正则表达式爬取猫眼电影Top100
import requests from requests.exceptions import RequestException import re import json # from multip ...
vue基于video.js实现视频播放暂停---切图网
切图网是最早致力于PSD2HTML切图等web前端外包服务的,随着前端技术的更新迭代,现在也已经全面投入了vue的浪潮了,下面是vue中实现视频播放的方法. vue.js中引入video视频播放器 m ...
爬虫开发3.requests模块
requests模块 - 基于如下5点展开requests模块的学习什么是requests模块 requests模块是python中原生的基于网络请求的模块,其主要作用是用来模拟浏览器发起请求.功能 ...
【Python爬虫实战】微信爬虫
所谓微信爬虫,即自动获取微信的相关文章信息的一种爬虫.微信对我们的限制是很多的,所以我们需要采取一些手段解决这些限制主要包括伪装浏览器.使用代理IP等方式http://weixin.sogou.com ...
python动态网站爬虫实战(requests+xpath+demjson+redis)
目录前言一.主要思路 1.观察网站 2.编写爬虫代码二.爬虫实战 1.登陆获取cookie 2.请求资源列表页面,定位获得左侧目录每一章的跳转url(难点) 3.请求每个跳转url,定位右侧下载 ...
python 网络爬虫全流程教学，从入门到实战（requests+bs4+存储文件）
python 网络爬虫全流程教学,从入门到实战(requests+bs4+存储文件) requests是一个Python第三方库,用于向URL地址发起请求 bs4 全名 BeautifulSoup4, ...

随机推荐

Linux的计划任务
1. 语法格式:Minute Hour DayOfMonth Month DayOfWeek User Command Minute, 每个小时的第几分钟执行该任务Hour,每天的第几个小时执行该任务 ...
用iptables做代理
出于安全考虑,Linux系统默认是禁止数据包转发的.配置Linux系统的ip转发功能,打开系统转发功能:echo "1" > /proc/sys/net/ipv4/ip_fo ...
PHP关于传众多参数还是传上下文对象的性能测试
在开发微信公众平台平台的过程中,有这么几个参数总是需要传来传去,$userOpenId,$message,$time. 在整个程序的运行过程中,为了函数方便的处理,将这三个变量一直放在参数列表里.关于 ...
我的系统资源呢？php-fpm你知道吗？
1:别的先不管咱们top一下.看看咱们的cpu ram swap的使用情况由上图分析,可以看出共有602个进程,其中有601个进程休眠了.这好像有点不对劲,内核进程也就80个左右,加上memcach ...
第146天：移动H5前端性能优化
移动H5前端性能优化一.概述 1. PC优化手段在Mobile侧同样适用 2. 在Mobile侧我们提出三秒种渲染完成首屏指标 3. 基于第二点,首屏加载3秒完成或使用Loading 4. 基于联通 ...
用PS做PNG格式底色是透明的logo
有时我们需要底色为透明色的logo图片,但是一般的图片底色都是白色的,覆盖在其它图片上会显示白色. 本文介绍如何用PS CS6制作透明底色的图片. 1.首先我们确定所选图片的大小(即分辨率大小),在资 ...
包装类 integer 当做 list的参数时候会出现无法删除成功的现象
对Spark2.2.0文档的学习3-Spark Programming Guide
Spark Programming Guide Link:http://spark.apache.org/docs/2.2.0/rdd-programming-guide.html 每个Spark A ...
（转）REST无状态的理解
转至http://lelglin.iteye.com/blog/1852092 Representational State Transfer的缩写.我对这个词组的翻译是"表现层状态转化&q ...
JDBC连接Oracle
数据库的操作是当前系统开发必不可少的开发部分之一,尤其是在现在的大数据时代,数据库尤为重要.但是你真的懂得Java与数据库是怎么连接的么? 先给大家一个数据库连接的简单实例: package com. ...

爬虫实战--基于requests 和 Beautiful的7160美图网爬取图片

爬虫实战--基于requests 和 Beautiful的7160美图网爬取图片的更多相关文章

随机推荐

热门专题