python使用selenium爬百度文库ppt并生成pdf

详细的讲解我是写在另外一个网址：https://www.yuque.com/docs/share/aacfa45c-22c5-4ef6-be97-cd6849002274

有点尬尴，所以就.....

在这里直接放下另外一个例子(《数学模型答案》)的代码

from selenium import  webdriver

from selenium.common.exceptions import TimeoutException

from selenium.common.exceptions import NoSuchElementException

from selenium.webdriver.common.by import By

from selenium.webdriver.support import expected_conditions as EC

from selenium.webdriver.support.wait import WebDriverWait

from selenium.webdriver.common.action_chains import ActionChains

import time

import re

import requests

class downloader:

    def __init__(self):

        self.browser =webdriver.Chrome()

        self.wait =wait = WebDriverWait(self.browser,3)

        self.i=0

        self.pattern =re.compile('.*?url\("(.*?)"\)',re.S)

    def __call__(self,url):

        self.download(url)

        while True:

            for i in self.parse_link():

                self.save(i)

            sub =self.browser.find_element_by_id('next-pageList-1')

            self.browser.execute_script("arguments[0].scrollIntoViewIfNeeded(true);",sub)

            sub.click()

        self.browser.quit()

    def download(self,url):

        self.browser.get(url)

        submit =self.wait.until(EC.presence_of_element_located((By.XPATH,'//*[@id="html-reader-go-more"]/div[2]/div[1]/span/span[1]')))

        self.browser.execute_script("arguments[0].scrollIntoViewIfNeeded(true);",submit)

        submit.click()

    def parse_link(self):

        self.elem=self.wait.until(EC.presence_of_element_located((By.ID,'reader-container-inner-1')))

        for i in self.elem.find_elements_by_class_name('bd'):

            try:

                self.browser.execute_script("arguments[0].scrollIntoViewIfNeeded(true);",i)

                time.sleep(0.6)

                i =i.find_element_by_class_name('reader-pic-item')

                js=i.get_attribute('style')

                href =self.pattern.findall(js)

                yield href[0]

            except NoSuchElementException:

                continue

    def save(self,link):

        html =requests.get(link).content

        with open('{}.png'.format(self.i),'wb') as f:

            f.write(html)

        self.i +=1

D =downloader()

D('https://wenku.baidu.com/view/d86fe3436c175f0e7dd13731')

python使用selenium爬百度文库ppt并生成pdf的更多相关文章

[Python爬虫] Selenium爬取新浪微博客户端用户信息、热点话题及评论 (上)
转载自:http://blog.csdn.net/eastmount/article/details/51231852 一. 文章介绍源码下载地址:http://download.csdn.net/ ...
python 利用selenium爬取百度文库的word文章
今天学习如何使用selenium库来爬取百度文库里面的收费的word文档 from selenium import webdriver from selenium.webdriver.common.k ...
[Python爬虫] Selenium获取百度百科旅游景点的InfoBox消息盒
前面我讲述过如何通过BeautifulSoup获取维基百科的消息盒,同样可以通过Spider获取网站内容,最近学习了Selenium+Phantomjs后,准备利用它们获取百度百科的旅游景点消息盒(I ...
Python爬虫(一)爬百度贴吧
简单的GET请求: # python2 import urllib2 response = urllib2.urlopen('http://www.baidu.com') html = respons ...
python+requests爬取百度文库ppt
实验网站:https://wenku.baidu.com/view/c7752014f18583d04964594d.html 在下面这种类型文件中的请求头的url打开后会得到一个页面你会得到如下图 ...
python爬百度文库课件
库:re;selenium;requests 源码: from selenium import webdriverimport reimport requests def open_img(items ...
Python 爬虫实例(爬百度百科词条)
爬虫是一个自动提取网页的程序,它为搜索引擎从万维网上下载网页,是搜索引擎的重要组成.爬虫从一个或若干初始网页的URL开始,获得初始网页上的URL,在抓取网页的过程中,不断从当前页面上抽取新的URL放入 ...
Python爬虫(二)爬百度贴吧楼主发言
爬取电影吧一个帖子里的所有楼主发言: # python2 # -*- coding: utf-8 -*- import urllib2 import string import re class Ba ...
类似百度文库pdf2swf+flexpaper解决pdf在线阅读的效果
1:工具准备swftools.exe 下载http://www.swftools.org/download.html 安装至D盘SWFTools提供了一系列将各种文件转成swf的工具:font2swf ...

随机推荐

安利一个十分实用的IDEA插件--RestfulToolkit
官网链接:http://plugins.jetbrains.com/plugin/10292-restfultoolkit,英汉双语的帮助文档. 一套 RESTful 服务开发辅助工具集. 1.根据 ...
从码云上下载react项目并配置成可运行状态
(第一次写,如有不足之处,欢迎指出) 一.下载项目: 1.首先保证安装了git, 2.然后在本地想要存放项目位置打开git(Git Bash Here),再复制码云中如图所示的地址: 3.在git中输 ...
onvif 框架代码生成
1:gsoap官网(http://gsoap2.sourceforge.net/)下载最新版gsoap(本次版本为gsoap_2.8.17)并解压. 2:新建一个文件夹(OnvifFramework) ...
ztree模糊筛选展开选中节点
树呢是一个最简单的树,并没有做一异步加载,也就是一个筛选,然后跳到第一个符合删选的数据下,并且所有符合的都会被展开和选中.其中ztreeAry是一个模拟的本地数组json.在test.json中,如果 ...
ORA-28000: the account is locked解决
首先使用具有sysdba权限的账户登陆,如sys账户和system账户新建一个sql窗体,并执行语句解锁被锁定的账户,如我这里sgyw账户: alter user sgyw account unlo ...
下一站 java
一直都在windows的圈子里打滚,偶尔玩玩Linux, Python, Java. 可是最近聊起windows的时候,总是觉得有些不得力,比如说,windows下有IE,MSMQ,IIS,普通使用没 ...
python 2.7 pip导入django，将python部署到sublime上
1.安装python 2.7,并且导入第三方库django 下载python 2.7,然后把python2.7的python.exe的路径和pip的路径添加到系统环境变量的path路径下. win+R ...
Ｉnt和String互转的方法
Java 中int.String的类型转换 int -> String int i=12345;String s="";第一种方法:s=i+"";第二 ...
拖放事件（drop events）在Firefox上运行会出现的问题
可能会有人觉得我废话特别多,我就在开头写一个简单粗暴的版本: 在Firefox中ondrop事件会触发Firefox自带的拖拽搜索功能,在ondrop事件触发执行时触发的函数中加上这两条: /* 禁止 ...
dispatherServlet拦截所有请求，但是不拦截JSP和其他配置的servelt
不是顺序问题,是就不拦截Servlet 不是load-on-startup启动先后顺序问题,是就是不拦截Servlet. SpringMVC默认用的是第二个 //<url-pattern> ...

python使用selenium爬百度文库ppt并生成pdf

python使用selenium爬百度文库ppt并生成pdf的更多相关文章

随机推荐

热门专题