python之（urllib、urllib2、lxml、Selenium+PhantomJS）爬虫

　　一、最近在学习网络爬虫的东西，说实话，没有怎么写过爬虫，Java里面使用的爬虫也没有怎么用过。这里主要是学习Python的时候，了解到Python爬虫的强大，和代码的简介，这里会简单的从入门看是说起，主要是了解基本的开发思路，后续会讲到scrapy框架的使用，这里主要是讲Python的爬虫入门。

　　二、urllib、urllib2，这两个模块都是用来处理url请求的，这里的开始就是使用urllib和urllib2的库进行相关操作，来看一个例子：

#!/usr/bin/env python

# -*- coding:utf-8 -*-

import urllib

import urllib2

# 需要爬取的连接

url = "http://www.baidu.com"

# 模拟的浏览器

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36"}

# 表单数据

form_data = {

        "start": "",

        "end": ""

    }

# 编码

data = urllib.urlencode(form_data)

# 设定请求

request = urllib2.Request(url, data=data, headers=headers)

# 访问获取结果

html = urllib2.urlopen(request).read()

print html

　　说明：这里获取结果的方式，是通过代码去模拟浏览器，来达到访问的目的。这里的form_data 只是一个模拟ajax的请求，没有太大的用处。

　　注意：urllib2.Request中如果存在data数据则是POST请求，反之GET

　　三、上面我们获取到了结果，接下来就是解析，lxml是常用的一中解析方式,当然还存在其他解析的方式比如re，这里不详细介绍：

　　1）在说解析之前，讲一下urllib2的handler:

　　a、handler种类：handler分很多种，比如：cookie，proxy，auth、http等。

　　b、为什么使用handler：cookie处理回话的保存问题；proxy处理ip代理，访问ip被封；auth认证处理；http处理器相关方式；

　　c、处理器比较常见，在会话或者代理都有很好的应用

#!/usr/bin/env python

# -*- coding:utf-8 -*-

import urllib

import urllib2

import cookielib

# 通过CookieJar()类构建一个cookieJar()对象，用来保存cookie的值

cookie = cookielib.CookieJar()

# 通过HTTPCookieProcessor()处理器类构建一个处理器对象，用来处理cookie

# 参数就是构建的CookieJar()对象

cookie_handler = urllib2.HTTPCookieProcessor(cookie)

# 代理handler

# httpproxy_handler = urllib2.ProxyHandler({"http" : "ip:port"})

# 认证代理handler

# authproxy_handler = urllib2.ProxyHandler({"http" : "username:password@ip:port"})

# auth

# passwordMgr = urllib2.HTTPPasswordMgrWithDefaultRealm()

# passwordMgr.add_password(None, ip, username, password)

# httpauth_handler = urllib2.HTTPBasicAuthHandler(passwordMgr)

# proxyauth_handler = urllib2.ProxyBasicAuthHandler(passwordMgr)

# opener = urllib2.build_opener(httpauth_handler, proxyauth_handler)

# 构建一个自定义的opener

opener = urllib2.build_opener(cookie_handler)

# 通过自定义opener的addheaders的参数，可以添加HTTP报头参数

opener.addheaders = [("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36")]

# renren网的登录接口

url = "http://www.renren.com/PLogin.do"

# 需要登录的账户密码

data = {"email":"邮箱", "password":"密码"}

# 通过urlencode()编码转换

data = urllib.urlencode(data)

# 第一次是post请求，发送登录需要的参数，获取cookie

request = urllib2.Request(url, data = data)

# 发送第一次的post请求，生成登录后的cookie(如果登录成功的话)

response = opener.open(request)

#print response.read()

# 第二次可以是get请求，这个请求将保存生成cookie一并发到web服务器，服务器会验证cookie通过

response_deng = opener.open("http://www.renren.com/410043129/profile")

# 获取登录后才能访问的页面信息

html = response_deng.read()

print html

　　2）解析（lxml）:

# !/usr/bin/python

# -*- coding: UTF-8 -*-

import urllib2

from lxml import etree

if __name__ == '__main__':

    # url = raw_input("请输入需要爬取图片的链接地址：")

    url = "https://www.baidu.com"

    headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"}

    # 读取到的页面

    html = urllib2.urlopen(url).read()

    # 使用lxml的etree

    content = etree.HTML(html)

    # XPATH的写法

    img_list = content.xpath("//img/@src")

    # 对图片连接处理

    for link in img_list:

        try:

            if link.find("https") == -1:

                link = "https:" + link

            img = urllib2.urlopen(link).read()

            str_list = link.split("/")

            file_name = str_list[len(str_list) - 1]

            with open(file_name, "wb") as f:

                f.write(img)

        except Exception as err:

            print err

　　xpath的语法参考：http://www.w3school.com.cn/xpath/xpath_syntax.asp

　　3）另外一种使用方式（BeautifulSoup）：

# !/usr/bin/python

# -*- coding: UTF-8 -*-

import urllib2

from bs4 import BeautifulSoupif __name__ == '__main__':

    url = "https://tieba.baidu.com/index.html"

    headers = {

        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"}

    # 读取到的页面

    html = urllib2.urlopen(url).read()

    bs = BeautifulSoup(html, "lxml")

    img_list = bs.find_all("img", attrs={"class": ""})

    # 对图片连接处理

    for img in img_list:

        try:

            link = img.get("src")

            if link.find("https") == -1:

                link = "https:" + link

            img = urllib2.urlopen(link).read()

            str_list = link.split("/")

            file_name = str_list[len(str_list) - 1]

            with open(file_name, "wb") as f:

                f.write(img)

        except Exception as err:

            print err

　　四、Selenium+PhantomJS，上面的都是针对于静态的html文件进行的爬虫，但是现在的网站一般都是通过ajax动态的加载数据，这里就产生了一个问题，我们必须先把js加载完成，才能进行下一步的爬虫工作。这里也就产生了对应的框架，来做这一块的爬虫工作。

from selenium import webdriver

from bs4 import BeautifulSoup

if __name__ == '__main__':

    # 使用谷歌的浏览器

    driver = webdriver.Chrome()

    # 获取页面

    driver.get("https://www.douyu.com/directory/all")

    while True:

        # 确认解析方式

        bs = BeautifulSoup(driver.page_source, "lxml")

        # 找到对应的直播间名称

        title_list = bs.find_all("h3", attrs={"class": "DyListCover-intro"})

        # 直播间热度

        hot_list = bs.find_all("span", attrs={"class": "DyListCover-hot"})

        # 压缩成一个循环

        for title, hot in zip(title_list, hot_list):

            print title.text, hot.text

        # 执行下一页

        if driver.page_source.find("dy-Pagination-next") == -1:

            break

        driver.find_element_by_class_name("dy-Pagination-next").click()

    # 退出

    driver.quit()

　　备注：我这里使用的是chrome的浏览器来执行的操作，目前因为PhantomJS已经被放弃了，不建议使用

　　chromedriver.exe的下载需要到谷歌网站上下载（需要翻墙），我这里提供一个75版本的下载

　　淘宝镜像下载地址：https://npm.taobao.org/mirrors/chromedriver/

　　下面提供一种针对于图片后加载的处理：

import urllib2

import time

from selenium import webdriver

if __name__ == '__main__':

    # 使用谷歌的浏览器driver，需要chromedriver.exe支持（放在文件同目录下）

    # 项目运行时记得让浏览器全屏

    driver = webdriver.Chrome()

    # 爬取网址

    driver.get("https://www.douyu.com/directory/all")

    while True:

        # 下一页后等待加载

        time.sleep(2)

        # 屏幕滚动（这里的值，更具具体页面设置）

        for i in xrange(1, 11):

            js = "document.documentElement.scrollTop=%d" % (i * 1000)

            driver.execute_script(js)

            # 等待页面加载完成

            time.sleep(1)

        # 获取页面的图片（xpath方式）

        img_list = driver.find_elements_by_xpath("//div[@class='LazyLoad is-visible DyImg DyListCover-pic']/img")

        # 对图片连接处理

        for img in img_list:

            try:

                link = img.get_attribute("src")

                str_list = link.split("/")

                file_name = str_list[len(str_list) - 2]

                print file_name

                # 读取图片，写入本地

                img_data = urllib2.urlopen(link).read()

                with open("img/" + file_name, "wb") as f:

                    f.write(img_data)

            except Exception as err:

                print err

        # 查看是否存在下一页

        if driver.page_source.find("dy-Pagination-next") == -1:

            break

        # 如果存在则跳转至下一页

        driver.find_element_by_class_name("dy-Pagination-next").click()

    # 退出

    driver.close()

　　说明：这里只是针对斗鱼的图片进行的爬虫，其他页面需要进一步修改，好了js的处理包括页面需要爬取的图片基本上就是这样了。

　　该代码只是用于学习和尝试，不得用于其他作用

　　五、好了这基本上，算的上入门了吧，当然你要爬取一个完整的东西，还是需要很多功夫的，我这里只是介绍基本上常用的一些库，和我自己测试使用的一些代码，以及目前涉及的不懂的地方，仅供学习吧，有什么错误的地方还请指出。我好及时改正！！

python之（urllib、urllib2、lxml、Selenium+PhantomJS）爬虫的更多相关文章

python中urllib, urllib2,urllib3, httplib,httplib2, request的区别
permike原文python中urllib, urllib2,urllib3, httplib,httplib2, request的区别若只使用python3.X, 下面可以不看了, 记住有个ur ...
selenium+phantomJS爬虫，适用于登陆限制强，点触验证码等一些场景
selenium是非常出名的自己主动化測试工具,多数场景是測试project师用来做自己主动化測试,可是相同selenium能够作为基本上模拟浏览器的工具,去爬取一些基于http request不能或 ...
人生苦短之Python的urllib urllib2 requests
在Python中涉及到URL请求相关的操作涉及到模块有urllib,urllib2,requests,其中urllib和urllib2是Python自带的HTTP访问标准库,requsets是第三方库 ...
python中 urllib, urllib2, httplib, httplib2 几个库的区别
转载摘要: 只用 python3, 只用 urllib 若只使用python3.X, 下面可以不看了, 记住有个urllib的库就行了 python2.X 有这些库名可用: urllib, urll ...
Python爬虫使用Selenium+PhantomJS抓取Ajax和动态HTML内容
1,引言在Python网络爬虫内容提取器一文我们详细讲解了核心部件:可插拔的内容提取器类gsExtractor.本文记录了确定gsExtractor的技术路线过程中所做的编程实验.这是第二部分,第一 ...
Python爬虫小白---（二）爬虫基础--Selenium PhantomJS
一.前言前段时间尝试爬取了网易云音乐的歌曲,这次打算爬取QQ音乐的歌曲信息.网易云音乐歌曲列表是通过iframe展示的,可以借助Selenium获取到iframe的页面元素, 而QQ音乐采用的是 ...
[Python爬虫] 之十五：Selenium +phantomjs根据微信公众号抓取微信文章
借助搜索微信搜索引擎进行抓取抓取过程 1.首先在搜狗的微信搜索页面测试一下,这样能够让我们的思路更加清晰在搜索引擎上使用微信公众号英文名进行“搜公众号”操作(因为公众号英文名是公众号唯一的,而中文 ...
Selenium + PhantomJS + python 简单实现爬虫的功能
Selenium 一.简介 selenium是一个用于Web应用自动化程序测试的工具,测试直接运行在浏览器中,就像真正的用户在操作一样 selenium2支持通过驱动真实浏览器(FirfoxDrive ...
[Python爬虫] 之一： Selenium+Phantomjs动态获取网站数据信息
本人刚才开始学习爬虫,从网上查询资料,写了一个利用Selenium+Phantomjs动态获取网站数据信息的例子,当然首先要安装Selenium+Phantomjs,具体的看 http://www.c ...

随机推荐

mac php7.2 安装mcrypt扩展
安装: brew install libmcrypt 下载mcrypt扩展源码 http://pecl.php.net/package/mcrypt 解压后进入目录: phpize ./config ...
在线http模拟工具
在线http模拟工具http://www.atool.org/httptest.php
ROS tf基础使用知识
博客参考:https://www.ncnynl.com/archives/201702/1306.html ROS与C++入门教程-tf-坐标变换说明: 介绍在c++实现TF的坐标变换概念: Co ...
008-MySQL报错-Access denied for user 'root'@'localhost' (using password: NO)
1.新安装的mysql报错 MySQL报错-Access denied for user 'root'@'localhost' (using password: NO) 解决方案 1.先停掉原来的服务 ...
linux删除用户报错：userdel: user prize is currently used by process 28021
之前创建了一个普通用户prize,现在想删掉它: [root@VM_0_14_centos /]# userdel prize userdel: user prize 发现原来我克隆了一个会话,另一个 ...
Internet Download Manager 快速下载插件，破解版
下载下来直接双击绿化按钮即可. 软件链接 : https://pan.baidu.com/s/1agK3cLtjJzXcGEgsuv5mVQ 提取码: ckm7
stm32f405xx.h头文件的问题Undefined symbol IS_TIM_BREAK_INSTANCE
1. 在实际使用过程中发现,编译工程中,出了个错误Undefined symbol IS_TIM_BREAK_INSTANCE 经过查找,发现有两个stm32f405xx.h,其中一个是,安装的器件包 ...
文件描述符FD的含义/文件句柄
使用sudo lsof -nP -iTCP -sTCP:LISTEN查看占用端口的程序;因为 lsof 需要访问核心内存和各种文件,所以必须以 root 用户的身份运行它才能够充分地发挥其功能概念 ...
R Multiple Plots
R Multiple Plots In this article, you will learn to use par() function to put multiple graphs in a s ...
【CUDA开发】CUDA开发琐碎知识
## 一维矩阵的加 //实现一个一维1*16的小矩阵的加法. //矩阵大小:1*16 //分配一个block,共有16个线程并发. #include <stdio.h> #includ ...

python之（urllib、urllib2、lxml、Selenium+PhantomJS）爬虫

python之（urllib、urllib2、lxml、Selenium+PhantomJS）爬虫的更多相关文章

随机推荐

热门专题