爬虫之爬取斗鱼官网LOL部分主播的状态

一个爬虫小程序爬取主播的排名及观看人数

import re

import requests

import request

class Spider():

    url = 'https://www.douyu.com/g_lol'

    root_pattern = '<p>([\s\S]*?)</p>'

    name_pattern = '<span class="dy-name ellipsis fl">([\s\S]*?)</span>'

    number_pattern = '<span class="dy-num fr"  >([\s\S]*?)</span>'

    def __fetch_content(self):

        r = requests.get(Spider.url)

        htmls = r.text

        return htmls

    def __analysis(self, htmls):

        root_htmls = re.findall(Spider.root_pattern, htmls)

        anchors = []

        for html in root_htmls:

            name = re.findall(Spider.name_pattern, html)

            number = re.findall(Spider.number_pattern, html)

            anchor = {'name': name, 'number': number}

            anchors.append(anchor)

        return anchors

    def __refine(self, anchors):

        l = lambda anchor: {

            'name': anchor['name'][0],

            'number': anchor['number'][0]

            }

        return map(l, anchors)

    def __sort(self, anchors):

        anchors = sorted(anchors, key=self.__sort_seed, reverse=True)

        return anchors

    def __sort_seed(self, anchor):

        r = re.findall('\d*', anchor['number'])

        number = float(r[0])

        if '万' in anchor['number']:

            number *= 10000

        return number

    def __show(self, anchors):

        for rank in range(0, len(anchors)):

            print(

                '人数排名' + str(rank + 1)

                + ' : ' + anchors[rank]['name']

                + '~~~~~~' + anchors[rank]['number']

            )

    def go(self):

        htmls = self.__fetch_content()

        anchors = self.__analysis(htmls)

        anchors = list(self.__refine(anchors))

        anchors = self.__sort(anchors)

        self.__show(anchors)

spider = Spider()

spider.go()

运行结果：

喜欢的朋友们可以去看主播的排名啦

爬虫之爬取斗鱼官网LOL部分主播的状态的更多相关文章

初识python 之爬虫：爬取中国天气网数据
用到模块: 获取网页并解析:import requests,html5lib from bs4 import BeautifulSoup 使用pyecharts的Bar可视化工具"绘制图表& ...
爬虫实例——爬取煎蛋网OOXX频道（反反爬虫——伪装成浏览器）
煎蛋网在反爬虫方面做了不少工作,无法通过正常的方式爬取,比如用下面这段代码爬取无法得到我们想要的源代码. import requests url = 'http://jandan.net/ooxx' ...
python爬虫：爬取易迅网价格信息，并写入Mysql数据库
本程序涉及以下方面知识: 1.python链接mysql数据库:http://www.cnblogs.com/miranda-tang/p/5523431.html 2.爬取中文网站以及各种乱码处 ...
用python爬虫简单爬取笔趣网：类“起点网”的小说
首先:文章用到的解析库介绍 BeautifulSoup: Beautiful Soup提供一些简单的.python式的函数用来处理导航.搜索.修改分析树等功能. 它是一个工具箱,通过解析文档为用户提供 ...
scrapy实验1 爬取中国人寿官网新闻，保存为xml
一.scrapy 实验爬中国人寿新闻,保存为xml 如需转发,请注明出处:小婷儿的python https://www.cnblogs.com/xxtalhr/p/10517297.html 链 ...
实战爬取Plati官网游戏实时最低价格-Python
需要修改url中的id_r="这个",这个id需要从Battlefield V (plati.ru)中获取,其实也是这个链接中的#s24235. 配合了e-mail推送,其实这个e ...
python爬取虎牙直播颜值区美女主播照片
前言本文的文字及图片来源于网络,仅供学习.交流使用,不具有任何商业用途,如有问题请及时联系我们以作处理. PS:如有需要Python学习资料的小伙伴可以加点击下方链接自行获取 python免费学习资 ...
Python爬虫之爬取慕课网课程评分
BS是什么? BeautifulSoup是一个基于标签的文本解析工具.可以根据标签提取想要的内容,很适合处理html和xml这类语言文本.如果你希望了解更多关于BS的介绍和用法,请看Beautiful ...
网络爬虫之定向爬虫：爬取当当网2015年图书销售排行榜信息（Crawler）
做了个爬虫,爬取当当网--2015年图书销售排行榜 TOP500 爬取的基本思想是:通过浏览网页,列出你所想要获取的信息,然后通过浏览网页的源码和检查(这里用的是chrome)来获相关信息的节点,最后 ...

随机推荐

solidity语言10
pragma solidity ^0.4.16; contract OwnedToken { // TokenCreator是个合约类型,由后面部分定义.只要不用于创建新合约,引用它就好 TokenC ...
Python迭代器生成器,私有变量及列表字典集合推导式(二)
1 python自省机制这个是python一大特性,自省就是面向对象的语言所写的程序在运行时,能知道对象的类型,换句话说就是在运行时能获取对象的类型,比如通过 type(),dir(),getatt ...
jetbrain rider 逐渐完美了，微软要哭了么？
2019-03-24 10:08:42 多年的vsiual studio使用经验,各种小瑕疵:到现在的visual studio是越来越大了:简直到了无法忍受境地: 每次重装系统都要重新安装下,这个不 ...
ZooKeeper 典型应用场景-负载均衡
负载均衡(Load Balance)是一种相当常见的计算机网络技术,用来对多个计算机(计算机集群).网络连接.CPU.硬盘驱动器或其他资源进行分配负载,以达到优化资源使用.最大化吞吐率.最小化响应时间 ...
March 31 2017 Week 13 Friday
Sometimes, you think the sky is falling down, actually, that is just because you stand slanting. 有时候 ...
SVN cleanup 反复失败解决办法
svn cleanup cleaning up 操作反复失败,svn提示的问题是版本需要更新,更新成最新的版本之后,依旧反复失败,陷入死循环.还好找一个blog上的方法试了一下,成功了. 先说故障环境 ...
TDD: 解除依赖
1 A类依赖B 类,可以把B类提取成IB接口,解除AB 之间的依赖关系. 通过创建实现了IB接口的BStub 装代码,可以模拟B类进行测试. 这是针对接口编程的典型.适合构造代价大,变化多的情况.应 ...
data-ng-show指令
<!DOCTYPE html><html><head><meta http-equiv="Content-Type" content=&q ...
SpringBoot非官方教程 | 第十篇：用spring Restdocs创建API文档
转载请标明出处: 原文首发于:https://www.fangzhipeng.com/springboot/2017/07/11/springboot10-springrestdocs/ 本文出自方志 ...
Dubbo源码分析之ExtensionLoader加载过程解析
ExtensionLoader加载机制阅读: Dubbo的类加载机制是模仿jdk的spi加载机制: Jdk的SPI扩展加载机制:约定是当服务的提供者每增加一个接口的实现类时,需要在jar包的META ...

爬虫之爬取斗鱼官网LOL部分主播的状态

爬虫之爬取斗鱼官网LOL部分主播的状态的更多相关文章

随机推荐

热门专题