pyspider示例代码四：搜索引擎爬取

搜索引擎爬取

#!/usr/bin/env python

# -*- encoding: utf- -*-

# Created on -- ::

# Project: __git_lab_fix

from pyspider.libs.base_handler import *

class Handler(BaseHandler):

    crawl_config = {

    }

    @every(minutes= * )

    def on_start(self):

        list = ['bigsec', 'password', 'email', 'tongdun', 'vpn', 'address', 'pop3',

                'smtp', 'imap', 'zhengxin', 'jdbc', 'mysql', 'credit', 'access_token', 'client_secret',

                'privatekey', 'secret_key', 'xiecheng', 'ctrip', 'tongcheng']

        for u in list:

            url = 'https://gitlab.com/search?group_id=&scope=issues&search=' + u

            self.crawl(url, callback=self.index_page)

    @config(age=)

    def index_page(self, response):

        self.crawl(response.doc('.next > a').attr.href,callback = self.index_page)

        for each in response.doc('h4 > a[href^="http"]').items():

            # print each.text()

            self.crawl(each.attr.href, callback=self.detail_page)

    @config(etag = True)

    def detail_page(self, response):

        for each in response.doc('.detail-page-description').items():

            return {

                "app":"githack",

                "origin":"gitlab.net",

                "code": each.text(),

            }

#!/usr/bin/env python

# -*- encoding: utf- -*-

# Created on -- ::

# Project: __git_zhibo

from pyspider.libs.base_handler import *

class Handler(BaseHandler):

    crawl_config = {

    }

    @every(minutes=  * )

    def on_start(self):

        list = ['douyu', 'panda', 'zhanqi', 'longzhu', 'huya', 'yy', 'momo', 'tv']

        for qu in list:

            url = 'https://github.com/search?p=1&q=' + qu + '&type=Code&utf8'

            self.crawl(url, callback=self.index_page)

    @config(age=)

    def index_page(self, response):

        self.crawl(response.doc('.next_page').attr.href,callback = self.index_page)

        flag =

        for each in response.doc('.title > a').items():

            flag +=

            if flag %  == :

                self.crawl(each.attr.href, callback=self.into_page)

    @config(age=,etag = True)

    def into_page(self, response):

        for each in response.doc('table').items():

            return{

                "app":"githack",

                "origin":"github.net",

                "code": each.text(),

            }

pyspider示例代码四：搜索引擎爬取的更多相关文章

pyspider示例代码：解析JSON数据
pyspider示例代码官方网站是http://demo.pyspider.org/.上面的示例代码太多,无从下手.因此本人找出一下比较经典的示例进行简单讲解,希望对新手有一些帮助. 示例说明: py ...
pyspider示例代码三：用PyQuery解析页面数据
本系列文章主要记录和讲解pyspider的示例代码,希望能抛砖引玉.pyspider示例代码官方网站是http://demo.pyspider.org/.上面的示例代码太多,无从下手.因此本人找出一些 ...
pyspider示例代码二：解析JSON数据
本系列文章主要记录和讲解pyspider的示例代码,希望能抛砖引玉.pyspider示例代码官方网站是http://demo.pyspider.org/.上面的示例代码太多,无从下手.因此本人找出一下 ...
pyspider示例代码一：利用phantomjs解决js问题
本系列文章主要记录和讲解pyspider的示例代码,希望能抛砖引玉.pyspider示例代码官方网站是http://demo.pyspider.org/.上面的示例代码太多,无从下手.因此本人找出一下 ...
pyspider爬虫框架webui简介-爬取阿里招聘信息
命令行输入pyspider开启pyspider 浏览器打开http://localhost:5000/ group表示组名,几个项目可以同一个组名,方便管理,当组名修改为delete时,项目会在一天后 ...
Scrapy实战篇（四）爬取京东商城文胸信息
创建scrapy项目 scrapy startproject jingdong 填充 item.py文件在这里定义想要存储的字段信息 import scrapy class JingdongItem ...
50 行代码教你爬取猫眼电影 TOP100 榜所有信息
对于Python初学者来说,爬虫技能是应该是最好入门,也是最能够有让自己有成就感的,今天,恋习Python的手把手系列,手把手教你入门Python爬虫,爬取猫眼电影TOP100榜信息,将涉及到基础爬虫 ...
pyspider示例代码五：实现自动翻页功能
实现自动翻页功能示例代码一 #!/usr/bin/env python # -*- encoding: utf- -*- # Created on -- :: # Project: v2ex fro ...
爬虫练习四：爬取b站番剧字幕
由于个人经常在空闲时间在b站看些小视频欢乐一下,这次就想到了爬取b站视频的弹幕. 这里就以番剧<我的妹妹不可能那么可爱>第一季为例,抓取这一番剧每一话对应的弹幕. 1. 分析页面这部番剧 ...

随机推荐

MPI Hello World
▶<并行程序设计导论>第三章(用 MPI 进行分布式内存编程)的第一个程序样例. ● 代码 #include <stdio.h> #include <string.h& ...
转发 DDoS攻防战（一）：概述
岁寒然后知松柏之后凋也岁寒然后知松柏之后凋也 ——论语·子罕 (此图摘自<Web脚本攻击与防御技术核心剖析>一书,作者:郝永清先生) DDoS,即 Distributed ...
使用VB.Net Express版本创建服务
Services Part 1:> Creating Services Visual Basic Express is a great, free tool from Microsoft. ...
gevent 实现io自动切换，gevent.join([]), gevent.spawn，爬虫多并发的实现
gevent 是一个第三方库,可以很容易的实现遇到io(文件传输)操作时,程序自动跳转到下一个程序例一: 用gevent.sleep() 来模拟io操作 import gevent def foo ...
python 2 类与对象
1.类与对象的概念类即类别.种类,是面向对象设计最重要的概念,从一小节我们得知对象是特征与技能的结合体,而类则是一系列对象相似的特征与技能的结合体. 那么问题来了,先有的一个个具体存在的对象(比如一 ...
Haskell语言学习笔记（56）Lens（3）
手动计算(view, over, set, to, _1) view l = getConst . l Const over l f = runIdentity . l (Identity . f) ...
VB.net 与 C# 的对应逻辑运算符
And:对两个Boolean表达式执行逻辑和.AndAlso:与AndAlso类似,关键差异是AndAlso显示短路行为,如果AndAlso中的第一个表达式为False,则不计算第二个表达式.Or:对 ...
python 刷题必备
1.判断输入的数字是否是回文数: 学习内容:把数字转成字符串 1. def is_palindrome(n): n=str(n) m=n[::-1] return n==m 2. tmp_str = ...
in 和 exist 区别 (转)
select * from Awhere id in(select id from B) 以上查询使用了in语句,in()只执行一次,它查出B表中的所有id字段并缓存起来.之后,检查A表的id是否与B ...
CloseableHttpClient(二)
package com.cmy.httpClient; import java.io.IOException; import org.apache.http.HttpEntity; import or ...

pyspider示例代码四：搜索引擎爬取

搜索引擎爬取

pyspider示例代码四：搜索引擎爬取的更多相关文章

随机推荐

热门专题