[Python爬虫] 之一： Selenium+Phantomjs动态获取网站数据信息

　　本人刚才开始学习爬虫，从网上查询资料，写了一个利用Selenium+Phantomjs动态获取网站数据信息的例子，当然首先要安装Selenium+Phantomjs，具体的看

http://www.cnblogs.com/shaosks/p/6526817.html Selenium下载： https://pypi.python.org/pypi/selenium/

phantomjs使用参考：http://javascript.ruanyifeng.com/tool/phantomjs.html 及官网：http://phantomjs.org/quick-start.html

源代码如下：

# coding=utf-8
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import selenium.webdriver.support.ui as ui
from selenium.webdriver.common.action_chains import ActionChains
import time
import re
import os

class Crawler:
    def __init__(self, firstUrl = "https://list.jd.com/list.html?cat=9987,653,655",
                 nextUrl = "https://list.jd.com/list.html?cat=9987,653,655&page=%d&sort=sort_rank_asc&trans=1&JL=6_0_0#J_main"):
        self.firstUrl = firstUrl
        self.nextUrl = nextUrl

    def getDetails(self,pageIndex,id = "plist"):
        '''
        获取详细信息
        :param pageIndex: 页索引
        :param id: 标签对应的id
        :return:
        '''
        element = self.driver.find_element_by_id(id)
        txt = element.text.encode('utf8')
        items = txt.split('¥')

        for item in items:
            if len(item) > 0:
                details = item.split('\n')
                print '¥' + item
                # print '单价：¥'+ details[0]
                # print '品牌：' + details[1]
                # print '参与评价：' + details[2]
                # print '店铺：' + details[3]
        print ' '
        print '第 ' + str(pageIndex) + '页'

    def CatchData(self,id = "plist",totalpageCountLable = "//span[@class='p-skip']/em/b"):
        '''
        抓取数据
        :param id:获取数据的标签id
        :param totalpageCountLable:获取总页数标记
        :return:
        '''
        start = time.clock()
        self.driver = webdriver.PhantomJS()
        wait = ui.WebDriverWait(self.driver, 10)
        self.driver.get(self.firstUrl)
        #在等待页面元素加载全部完成后才进行下一步操作
        wait.until(lambda driver: self.driver.find_element_by_xpath(totalpageCountLable))
        # 获取总页数
        pcount = self.driver.find_element_by_xpath(totalpageCountLable)
        txt = pcount.text.encode('utf8')
        print '总页数：' + txt
        print '第1页'
        print ' '
        pageNum = int(txt)
        pageNum = 3  # 只执行三次
        i = 2
        while (i <= pageNum):
            self.getDetails(i,id)
            print ' '
            time.sleep(5)  # 延迟5秒,防止获取数据过快而被封ＩＰ
            wait = ui.WebDriverWait(self.driver, 10)
            self.driver.get(self.nextUrl % i)
            # driver.find_element_by_id("submit").click()
            i = i + 1
        else:
            print 'Load Over'
        end = time.clock()
        print "Time: %f s" % (end - start)

def main():
    # 首页的url
    firstUrl = "https://list.jd.com/list.html?cat=9987,653,655"
    #下一页的url
    nextUrl = "https://list.jd.com/list.html?cat=9987,653,655&page=%d&sort=sort_rank_asc&trans=1&JL=6_0_0#J_main"
    cw = Crawler(firstUrl, nextUrl)
    #总页数标签
    totalpageCountLable = "//span[@class='p-skip']/em/b"
    #获取数据的标签ID
    id = "plist"
    cw.CatchData(id,totalpageCountLable)
#测试
main()

参考：http://blog.csdn.net/eastmount/article/details/47907341

[Python爬虫] 之一： Selenium+Phantomjs动态获取网站数据信息的更多相关文章

[Python爬虫] Selenium+Phantomjs动态获取CSDN下载资源信息和评论
前面几篇文章介绍了Selenium.PhantomJS的基础知识及安装过程,这篇文章是一篇应用.通过Selenium调用Phantomjs获取CSDN下载资源的信息,最重要的是动态获取资源的评论,它是 ...
Python爬虫使用Selenium+PhantomJS抓取Ajax和动态HTML内容
1,引言在Python网络爬虫内容提取器一文我们详细讲解了核心部件:可插拔的内容提取器类gsExtractor.本文记录了确定gsExtractor的技术路线过程中所做的编程实验.这是第二部分,第一 ...
Python爬虫使用selenium处理动态网页
对于静态网页,使用requests等库可以很方便的得到它的网页源码,然后提取出想要的信息.但是对于动态网页,情况就要复杂很多,这种页面的源码往往只有一个框架,其内容都是由JavaScript渲染出来的 ...
[Python爬虫] ：Selenium +phantomjs 利用 pyquery抓取脚本链接对应的内容
抓取上面对应链接的网页的文章的内容 ele = element.attr("onclick") self.driver.execute_script(ele) sub_seleni ...
Python爬虫(二十)_动态爬取影评信息
本案例介绍从JavaScript中采集加载的数据.更多内容请参考:Python学习指南 #-*- coding:utf-8 -*- import requests import re import t ...
Python爬虫教程-26-Selenium + PhantomJS
Python爬虫教程-26-Selenium + PhantomJS 动态前端页面 : JavaScript: JavaScript一种直译式脚本语言,是一种动态类型.弱类型.基于原型的语言,内置支持 ...
[Python爬虫]使用Selenium操作浏览器订购火车票
这个专题主要说的是Python在爬虫方面的应用,包括爬取和处理部分 [Python爬虫]使用Python爬取动态网页-腾讯动漫(Selenium) [Python爬虫]使用Python爬取静态网页-斗 ...
Python 爬虫利器 Selenium 介绍
Python 爬虫利器 Selenium 介绍转 https://mp.weixin.qq.com/s/YJGjZkUejEos_yJ1ukp5kw 前面几节,我们学习了用 requests 构造页 ...
Python爬虫之selenium库使用详解
Python爬虫之selenium库使用详解本章内容如下: 什么是Selenium selenium基本使用声明浏览器对象访问页面查找元素多个元素查找元素交互操作交互动作执行JavaS ...

随机推荐

【C#】Lamada表达式演变过程
static void Main() { //第一步委托实例调用 Func<string, int> test = new Func<string, int>(getLeng ...
Kuhn-Munkres算法
KM算法——二分图最大权匹配我们前面学过了二分图匹配的匈牙利算法.但这种算法是针对没有权值的图来说的. 肯定有人想问,没有权值的用匈牙利算法,哪有权值的图要求最大权或最小权匹配呢?? 这里就引出了我 ...
Unity 2D游戏开发教程之2D游戏的运行效果
Unity 2D游戏开发教程之2D游戏的运行效果 2D游戏的运行效果本章前前后后使用了很多节的篇幅,到底实现了怎样的一个游戏运行效果呢?或者说,游戏中的精灵会不会如我们所想的那样运行呢?关于这些疑问 ...
Learn to Create Everything In a Fragment Shader(译)
学习在片元着色器中创建一切介绍这篇博客翻译自Shadertoy: learn to create everything in a fragment shader 大纲本课程将介绍使用Shader ...
Elasticsearch 删除索引下的所有数据
下面是head中操作的截图 #清空索引 POST quality_control/my_type/_delete_by_query?refresh&slices=5&pretty { ...
luoguP3600 随机数生成器期望概率DP + DP优化
这篇题解更像对他人题解的吐槽和补充? 考虑答案 $E[X] = \sum\limits_{i = 1}^{x} i P(X = i)$ $P(X = i)$不好求................(其实 ...
python3-开发进阶Flask的基础(4)
今日内容: 上下文管理:LocalProxy对象上下文管理: 请求上下文: request/session app上下文:app/g 第三方组件:wtforms 1.使用 ...
性能测试工具——Jmeter使用小结（一）
Apache Jmeter是针对Java的一款性能测试工具,利用该工具可以实现自动化的批量测试和结果聚合,适合做接口压测.今天就来捋一捋软件安装的一些小细节和使用. 一.安装 Jmeter基于JDK, ...
PAT甲级1089. Insert or Merge
PAT甲级1089. Insert or Merge 题意: 根据维基百科: 插入排序迭代,消耗一个输入元素每次重复,并增加排序的输出列表.每次迭代,插入排序从输入数据中删除一个元素,在排序列表中找到 ...
tomcat使用Eclipse进行远程调试（线上调试）
什么是远程调试,就是在A机器上利用Eclipse单步跟踪调试B机器上的Web应用,当然调试A机器上Web应用也是没有问题的,90%我都是调试本机的Web应用,远程调试的意义我想我不用说了,大家都会想到 ...

[Python爬虫] 之一 ： Selenium+Phantomjs动态获取网站数据信息

[Python爬虫] 之一 ： Selenium+Phantomjs动态获取网站数据信息的更多相关文章

随机推荐

热门专题

[Python爬虫] 之一： Selenium+Phantomjs动态获取网站数据信息

[Python爬虫] 之一： Selenium+Phantomjs动态获取网站数据信息的更多相关文章