[Python爬虫] 之三：Selenium 调用IEDriverServer 抓取数据

接着上一遍，在用Selenium+phantomjs 抓取数据过程中发现，有时候抓取不到，所以又测试了用Selenium+浏览器驱动的方式：具体代码如下：

#coding=utf-8
import os
import re
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
from selenium.webdriver.common.action_chains import ActionChains
import IniFile
class IEDriverCrawler:

    def __init__(self):
        #通过配置文件获取IEDriverServer.exe路径
        configfile = os.path.join(os.getcwd(),'config.conf')
        cf = IniFile.ConfigFile(configfile)
        IEDriverServer = cf.GetValue("section", "IEDriverServer")
        #每抓取一页数据延迟的时间，单位为秒，默认为5秒
        self.pageDelay = 5
        pageInteralDelay = cf.GetValue("section", "pageInteralDelay")
        if pageInteralDelay:
            self.pageDelay = int(pageInteralDelay)

        os.environ["webdriver.ie.driver"] = IEDriverServer
        self.driver = webdriver.Ie(IEDriverServer)

    def CatchData(self,id,firstUrl,nextUrl,restUrl):
        '''
        抓取数据
        :param id: 要获取元素标签的ID
        :param firstUrl: 首页Url
        :param nextUrl: 下一页URL
        :param restUrl: 下一页URL的组成部分
        :return:
        '''
        #加载首页
        self.driver.get(firstUrl)
        #打印标题
        print self.driver.title
        # id = "J_albumFlowCon"
        element = self.driver.find_element_by_id(id)
        txt = element.text.encode('utf8')
        #打印获取的信息
        print txt
        print ' '
        time.sleep(20)  # 延迟20秒,
        #由于有多页数据，为了测试，只取出几页数据
        for i in range(2, 4):
            print ' '
            time.sleep(20)  # 延迟20秒,
            url = nextUrl + str(i) + restUrl
            self.driver.get(url)
            element = self.driver.find_element_by_id(id)
            txt = element.text.encode('utf8')
            print txt
        self.driver.close()
        self.driver.quit()

    def CatchDatabyClickNextButton(self,id,firstUrl):
        '''
        抓取数据
        :param id: 要获取元素标签的ID
        :param firstUrl: 首页Url
        :return:
        '''
        start = time.clock()
        #加载首页
        self.driver.get(firstUrl)
        #打印标题
        print self.driver.title
        # id = "J_ItemList"
        firstPage = self.driver.find_element_by_id(id)
        txt = firstPage.text.encode('utf8')
        self.printTxt(1,txt)

        #获取总页数
        name = 'filterPageForm'
        totalPageElement = self.driver.find_element_by_name(name)
        txt = totalPageElement.text.encode('utf8')#ui-page-next
        pattern = re.compile(r'\d+')
        flist  = re.findall(pattern, txt)
        pageCount = 1
        if flist and len(flist)>0:
            pageCount = int(flist[0])
        if pageCount > 1:
            pageCount = 10 #先爬三页
            for index in range(2,pageCount + 1):
                time.sleep(self.pageDelay) #延迟五秒
                nextElement = self.driver.find_element_by_xpath("//a[@class='ui-page-next']")
                nextUrl = nextElement.get_attribute('href')
                self.driver.get(nextUrl)
                # ActionChains(self.driver).click(element)
                dataElement = self.driver.find_element_by_id(id)
                txt = dataElement.text.encode('utf8')  # ui-page-next
                print ' '
                self.printTxt(index, txt)

        self.driver.close()
        self.driver.quit()
        end = time.clock()
        print ' '
        print "抓取每页数据后延迟 %d 秒" % self.pageDelay
        print "总共抓取了 %d页数据" % pageCount
        print "整个过程用时间: %f 秒" % (end - start)

    def printTxt(self,pageIndex,stringTxt):
        '''
        打印抓取的每页数据
        :param pageIndex:页数
        :param stringTxt:每页抓取的数据
        :return:
        '''
        if stringTxt.find('¥') > -1:
            itemList = stringTxt.split('¥')
            print '第' + str(pageIndex) + '页数据'
            print ' '
            for item in itemList:
                if len(item) > 0:
                    its = item.split('\n')
                    if len(its)>=4:
                        print '单价：        ¥%s' % its[0]
                        print '品牌：        %s' % its[1]
                        print '销售店铺名称： %s' % its[2]
                        print '成交量：      %s' % its[3]
                        print ' '

#测试抓取淘宝数据
# obj = IEDriverCrawler()
# firstUrl = "https://ai.taobao.com/search/index.htm?pid=mm_26632323_6762370_25910879&unid=&source_id=search&key=%E6%89%8B%E6%9C%BA&b=sousuo_ssk&clk1=&prepvid=200_11.251.246.148_396_1490081427029&spm=a231o.7712113%2Fa.a3342.1"
# nextUrl='https://ai.taobao.com/search/index.htm?pid=mm_26632323_6762370_25910879&unid=&source_id=search&key=%E6%89%8B%E6%9C%BA&b=sousuo_ssk&clk1=&prepvid=200_11.251.246.157_19825_1490081412211&spm=a231o.7076277.1998559105.1&page='
# # url='https://ai.taobao.com/search/index.htm?pid=mm_26632323_6762370_25910879&unid=&source_id=search&key=%E6%89%8B%E6%9C%BA&b=sousuo_ssk&clk1=&prepvid=200_11.251.246.148_396_1490081427029&spm=a231o.7712113%2Fa.a3342.1&page=2&pagesize=120'
# # url='https://ai.taobao.com/search/index.htm?pid=mm_26632323_6762370_25910879&unid=&source_id=search&key=%E6%89%8B%E6%9C%BA&b=sousuo_ssk&clk1=&prepvid=200_11.251.246.148_396_1490081427029&spm=a231o.7712113%2Fa.a3342.1&page=3&pagesize=120'
# restUrl = '&pagesize=120'
# obj.CatchData("J_albumFlowCon",firstUrl,nextUrl,restUrl)

#测试抓取天猫数据
obj = IEDriverCrawler()
firstUrl = "https://list.tmall.com/search_product.htm?q=%CA%D6%BB%FA&type=p&vmarket=&spm=875.7931836%2FB.a2227oh.d100&from=mallfp..pc_1_searchbutton"
obj.CatchDatabyClickNextButton("J_ItemList",firstUrl)

本文章仅仅作为交流。

[Python爬虫] 之三：Selenium 调用IEDriverServer 抓取数据的更多相关文章

python爬虫之分析Ajax请求抓取抓取今日头条街拍美图（七）
python爬虫之分析Ajax请求抓取抓取今日头条街拍美图一.分析网站 1.进入浏览器,搜索今日头条,在搜索栏搜索街拍,然后选择图集这一栏. 2.按F12打开开发者工具,刷新网页,这时网页回弹到综合 ...
python爬虫构建代理ip池抓取数据库的示例代码
爬虫的小伙伴,肯定经常遇到ip被封的情况,而现在网络上的代理ip免费的已经很难找了,那么现在就用python的requests库从爬取代理ip,创建一个ip代理池,以备使用. 本代码包括ip的爬取,检 ...
Python爬虫初探 - selenium+beautifulsoup4+chromedriver爬取需要登录的网页信息
目标之前的自动答复机器人需要从一个内部网页上获取的消息用于回复一些问题,但是没有对应的查询api,于是想到了用脚本模拟浏览器访问网站爬取内容返回给用户.详细介绍了第一次探索python爬虫的坑. 准 ...
Python爬虫入门教程 46-100 Charles抓取手机收音机-手机APP爬虫部分
1. 手机收音机-爬前叨叨今天选了一下,咱盘哪个APP呢,原计划是弄荔枝APP,结果发现竟然没有抓到数据,很遗憾,只能找个没那么圆润的了.搜了一下,找到一个手机收音机下载量也是不错的. 2. 爬虫 ...
Python爬虫入门教程 45-100 Charles抓取兔儿故事-下载小猪佩奇故事-手机APP爬虫部分
1. Charles抓取兔儿故事背景介绍之前已经安装了Charles,接下来我将用两篇博客简单写一下关于Charles的使用,今天抓取一下兔儿故事里面关于小猪佩奇的故事. 爬虫编写起来核心的重点是分 ...
python requests 模拟登陆网站，抓取数据
抓取页面数据的时候,有时候我们需要登陆才可以获取页面资源,那么我们需要登陆以后才可以跳转到对应的资源页面,那么我们需要通过模拟登陆,登陆成功以后再次去抓取对应的数据. 首先我们需要通过手动方式来登陆一 ...
Python爬虫系列-Selenium+Chrome/PhantomJS爬取淘宝美食
1.搜索关键字利用Selenium驱动浏览器搜索关键字,得到查询后的商品列表 2.分析页码并翻页得到商品页码数,模拟翻页,得到后续页面的商品列表 3.分析提取商品内容利用PyQuery分析源码, ...
Python开发爬虫之动态网页抓取篇：爬取博客评论数据——通过Selenium模拟浏览器抓取
区别于上篇动态网页抓取,这里介绍另一种方法,即使用浏览器渲染引擎.直接用浏览器在显示网页时解析 HTML.应用 CSS 样式并执行 JavaScript 的语句. 这个方法在爬虫过程中会打开一个浏览器 ...
Python爬虫之三种网页抓取方法性能比较
下面我们将介绍三种抓取网页数据的方法,首先是正则表达式,然后是流行的 BeautifulSoup 模块,最后是强大的 lxml 模块. 1. 正则表达式如果你对正则表达式还不熟悉,或是需要一些提 ...

随机推荐

Django学习笔记-2018.12.08
在Python的正则表达式中,有一个参数为re.S.它表示“.”(不包含外侧双引号,下同)的作用扩展到整个字符串,包括“\n”.看如下代码: import re a = '''asdfhellopas ...
git合并分支理解和常用命令的总结
原文参考:https://www.liaoxuefeng.com/wiki/0013739516305929606dd18361248578c67b8067c8c017b000 工作区和暂存区工作区 ...
面试的65个回答技巧-适用于BAT公司
互联网职业群分享的资料,里面大多是BAT公司的人,很多是猎头.这些技巧对于职场人来说,是非常宝贵的. 1.请你自我介绍一下你自己? 回答提示:一般人回答这个问题过于平常,只说姓名.年龄.爱好.工作经验 ...
HTML框架与表单
1.框架处理结构 <html> <head> <meta http-equiv="Content-Type" content="text/h ...
使用webgl(three.js)创建3D机房，3D机房微模块详细介绍(升级版二)
序: 上节课已经详细描述了普通机房的实现过程,文章地址(https://www.cnblogs.com/yeyunfei/p/10473021.html) 紧接着上节课的内容我们这节可来详细讲解机房 ...
HDU 3404 Switch lights 博弈论 nim积
http://acm.hdu.edu.cn/showproblem.php?pid=3404 题目 http://www.doc88.com/p-5098170314707.html 论文 nim积在 ...
CF1009G Allowed Letters
link 题意: 给你一个长为n的串,字符集'a'~'f'.你可以重排这个串,满足指定m个位置上只能放特定的字符,m个位置以及字符集会给出.求字典序最小的串? $n,m\leq 10^5.$ 题解: ...
套题：Codeforces Round #194 (Div. 1) (2/5)
A. Secrets http://www.cnblogs.com/qscqesze/p/4528529.html B. Chips http://www.cnblogs.com/qscqesze/p ...
jQuery对象和Javascript对象
jQuery 对象是通过 jQuery 包装DOM 对象后产生的对象.jQuery 对象是 jQuery 独有的,其可以使用 jQuery 里的方法,但是不能使用 DOM 的方法:例如: $(&quo ...
LT1072 -- Wide-range voltage regulator automatically selects operating mode
The circuit in Figure 1 delivers programming voltages to an EEPROM under the control of an external ...

[Python爬虫] 之三：Selenium 调用IEDriverServer 抓取数据

[Python爬虫] 之三：Selenium 调用IEDriverServer 抓取数据的更多相关文章

随机推荐

热门专题