pyspider 示例

数据存放目录：

C:\Users\Administrator\data

升级版（可加载文章内所有多层嵌套的图片标签）

#!/usr/bin/env python

# -*- encoding: utf-8 -*-

# Created on 2019-04-08 14:24:34

# Project: qunaer

from pyspider.libs.base_handler import *

class Handler(BaseHandler):

    crawl_config = {

    }

    @every(minutes=24 * 60)

    def on_start(self):

        self.crawl('https://travel.qunar.com/travelbook/list.htm', callback=self.index_page,validate_cert=False)

    @config(age=10 * 24 * 60 * 60)

    def index_page(self, response):

        for each in response.doc('li > .tit>a').items():

            self.crawl(each.attr.href, callback=self.detail_page,validate_cert=False,fetch_type='js',js_viewport_height='')

        next1=response.doc('.next').attr.href

        self.crawl(next1,callback=self.index_page,validate_cert=False)

    @config(priority=2)

    def detail_page(self, response):

        imgs=response.doc('.js_memo_node').find('img')#获取id下的所有（包括多层嵌套的）img标签

        img_list=''                                  #必须事先声明，否则return,img_list时会报错：引用未事先声明的局部变量

        for img in imgs.items():

            img_list+=img.attr.src+','              #把所有图片用,组合在一起

                                                    #【复习】img_list=', '.join(['cats', 'rats', 'bats'])

        return {

            "url": response.url,

            "title": response.doc('title').text(),

            "date":response.doc('li.f_item.when > p > span.data').text(),

            "day":response.doc('li.f_item.howlong > p > span.data').text(),

            "text":response.doc('#b_panel_schedule').text(),

            "img":img_list

        }

#ele-3076663-2 > div.bottom > div.e_img_schedule > div > dl:nth-child(2) > dt > img

例子A

#!/usr/bin/env python

# -*- encoding: utf-8 -*-

# Created on 2019-04-08 14:24:34

# Project: qunaer

from pyspider.libs.base_handler import *

class Handler(BaseHandler):

    crawl_config = {

    }

    @every(minutes=24 * 60)

    def on_start(self):

        self.crawl('https://travel.qunar.com/travelbook/list.htm', callback=self.index_page,validate_cert=False)

    @config(age=10 * 24 * 60 * 60)

    def index_page(self, response):

        for each in response.doc('li > .tit>a').items():

            self.crawl(each.attr.href, callback=self.detail_page,validate_cert=False,fetch_type='js',js_viewport_height='')#用js加载，指定页面高度，防止懒加载图片只加载一半

        next1=response.doc('.next').attr.href

        self.crawl(next1,callback=self.index_page,validate_cert=False)

    @config(priority=2)

    def detail_page(self, response):

        #imgs=response.doc('#js_mainleft').find('img')

        #for img in imgs.items():

        #    img_list=img_list+img+','

        return {

            "url": response.url,

            "title": response.doc('title').text(),

            "date":response.doc('li.f_item.when > p > span.data').text(),

            "day":response.doc('li.f_item.howlong > p > span.data').text(),

            "text":response.doc('#b_panel_schedule').text(),

            "img":response.doc('#js_mainleft').find('img').attr.src

        }

#ele-3076663-2 > div.bottom > div.e_img_schedule > div > dl:nth-child(2) > dt > img

pyspider 示例的更多相关文章

pyspider示例代码：解析JSON数据
pyspider示例代码官方网站是http://demo.pyspider.org/.上面的示例代码太多,无从下手.因此本人找出一下比较经典的示例进行简单讲解,希望对新手有一些帮助. 示例说明: py ...
pyspider 示例二升级完整版绕过懒加载，直接读取图片
pyspider 示例二升级完整版绕过懒加载,直接读取图片,见[升级写法处] #!/usr/bin/env python # -*- encoding: utf-8 -*- # Created on ...
pyspider示例代码三：用PyQuery解析页面数据
本系列文章主要记录和讲解pyspider的示例代码,希望能抛砖引玉.pyspider示例代码官方网站是http://demo.pyspider.org/.上面的示例代码太多,无从下手.因此本人找出一些 ...
pyspider示例代码二：解析JSON数据
本系列文章主要记录和讲解pyspider的示例代码,希望能抛砖引玉.pyspider示例代码官方网站是http://demo.pyspider.org/.上面的示例代码太多,无从下手.因此本人找出一下 ...
pyspider示例代码一：利用phantomjs解决js问题
本系列文章主要记录和讲解pyspider的示例代码,希望能抛砖引玉.pyspider示例代码官方网站是http://demo.pyspider.org/.上面的示例代码太多,无从下手.因此本人找出一下 ...
pyspider示例代码六：传递参数
传递参数示例一 #!/usr/bin/env python # -*- encoding: utf- -*- # vim: ts= sts= ff=unix fenc=utf8: # Created ...
pyspider示例代码五：实现自动翻页功能
实现自动翻页功能示例代码一 #!/usr/bin/env python # -*- encoding: utf- -*- # Created on -- :: # Project: v2ex fro ...
pyspider示例代码七：自动登陆并获得PDF文件下载地址
自动登陆并获得PDF文件下载地址 #!/usr/bin/env python # -*- encoding: utf- -*- # Created on -- :: # Project: pdf_sp ...
pyspider示例代码四：搜索引擎爬取
搜索引擎爬取 #!/usr/bin/env python # -*- encoding: utf- -*- # Created on -- :: # Project: __git_lab_fix fr ...

随机推荐

bug: '\xff' 转换成-1 而不是255
后台给的值处理后 Byte rtncode = payload[0]; 打印payload[0]是'\xff', 增加 if (rtncode ==255 ){ ....} 的判断,跳里面去了然后用 ...
在asp.net中使用瀑布流，无限加载
页面中代码 <%@ Page Language="C#" AutoEventWireup="true" CodeBehind="WebForm1 ...
ubuntu上安装ftp
为什么要安装ftp?为了方便在主机和虚拟机之间传文件一般有两种做法: 一.使用VMware安装虚拟机后,可以直接继续安装VMware tools,就可以将主机上的文件拖到ubuntu虚拟机的某个目录 ...
前端 HTML body标签相关内容
想要在网页上展示出来的内容一定要放在body标签中. 常用标签: 标题标签 h1-h6 段落标签 p标签超链接标签 a标签列表标签 ul,ol,li 定义列表<dl> 子标签 div ...
用A标签实现页面内容定位点击链接跳到具体位置
经常在维基百科等网站看到目录列表,点击链接会跳到具体的位置,小美眉一直在问是怎么做到的,其实挺简单的,用A标签实现页面内容定位就行了.实例参考微信营销理论手册的目录. 首先用A标签定义目录的链接. & ...
LATCH_EX
Description: This wait type occurs when a thread is waiting for access to a non-page data structure ...
wx 文件编辑框
# -*- coding: utf- -*- import wx import os class my_frame(wx.Frame): """This is a sim ...
vue-3.0创建项目
.npm install --global @vue/cli .npm install -g @vue/cli-init .vue init webpack my-project
token的使用流程
[LeetCode] 849. Maximize Distance to Closest Person_Easy tag: BFS
In a row of seats, 1 represents a person sitting in that seat, and 0 represents that the seat is emp ...

pyspider 示例

pyspider 示例的更多相关文章

随机推荐

热门专题