02_Python简单爬虫(熊猫直播LOL的up主,谁最强!)
声明:
本文仅用于Python练手,并无任何恶意攻击行为!
# 导入request模块
from urllib import request
# 导入re模块
import re
class Spider():
# url以http, https开头
url_to_run = r'https://www.panda.tv/cate/lol' # 待抓取网页,熊猫直播平台-LOL分类(抓取主播名,视频观看人数)
htmls = None # 保存抓取到的HTML内容
root_pattern = '<div class="video-info">(.*?)</div>' # 非贪婪匹配,匹配到最近的一个</div>,包含主播名,视频观看人数这两个tag的上一级tag
name_pattern = '</i>(.*?)</span>' # 非贪婪匹配,匹配到举例</i>最近的1个</span>,找到该视频的主播名
number_pattern = '<span class="video-number">(.*?)</span>'# 非贪婪匹配,匹配到举例最近的1个</span>, 找到该主播视频的观看人数
result_list = [] # 存储最后的分析结果,每个元素为{'name':主播名, 'number':视频观看数}} @classmethod
def fetch_content(cls):
"""
模拟浏览器,向服务器发送获取特定页面的请求
将返回的HTML页面,字符串形式保存到Spider.htmls
:return: None
"""
#request模块下的urlopen方法, 将web服务器返回的结果封装为1个file-like object,本质Response实例
result = request.urlopen(cls.url_to_run) # result操作
#print(result.getcode()) # HTTP返回码,200则正常获取到页面
#print(result.geturl()) # 实际获取的URL,判定页面是否有重定向
cls.htmls = result.read() # 实际的HTML页面内容, bytes类型
cls.htmls = str(cls.htmls, encoding='utf-8') # 将byte类型的HTML页面内容,转换为str字符串 @classmethod
def analysis(cls):
"""
根据Spider.htmls中保存的HTML页面,进行分析
1)主播名
2)视频观看次数
将每个主播和视频的观看次数,组成1个dict, 添加到cls.result_list
:return: None
"""
# root_pattern中做了group, 返回结果中已经没有外部video-info标签
video_info_lst = re.findall(cls.root_pattern, cls.htmls, flags=re.S) for video in video_info_lst:
up_host = re.findall(cls.name_pattern, video, flags=re.S)
video_number = re.findall(cls.number_pattern, video, flags=re.S) # 对up_host内容格式进行调整: 丢弃第二个\n, 将第一个的\n开头和两边的空白字符去除
up_host = up_host[0]
up_host = up_host.strip('\n')
up_host = up_host.strip(' ') # 对video_number内容格式进行调整, 将vidoe_number从list中取出
video_number = video_number[0] # 主播名,观看数,组成字典,添加到结果列表
dic = {'name':up_host, 'number':video_number}
cls.result_list.append(dic) @classmethod
def sort_seed(cls, item):
"""
result_list中的元素是dict, 不能对dict直接做大小比较
指定将dict中的number作为key, 进行不同dict间的比较依据
sorted比较,传入要比较的ict, sort_seed返回dict中的number, 作为比较依据
:return: item['number'] 作为比较依据
"""
r = re.findall('\d+', item['number'])
number = float(r[0])
# 处理“万”级别用户换算
if '万' in item['number']:
number *= 10000 return number @classmethod
def sort_result(cls):
"""
将cls.result_list中的元素,按照观看人数进行排序
:return:
"""
# sorted(iterable, key = None, reverse = False)
cls.result_list = sorted(cls.result_list, key=cls.sort_seed, reverse=True) @classmethod
def show(cls):
print("Total Uphost: " + str(len(cls.result_list)))
print('='*45)
for item in cls.result_list:
print('Uphost:'+ item['name'] + " ," + "Rank: " + str(cls.result_list.index(item) + 1) + ' Video Watched: ' + item['number'] ) @classmethod
def go(cls):
cls.fetch_content()
cls.analysis()
cls.sort_result()
cls.show() # 类测试代码
Spider.go()
部分实际测试结果:
Total Uphost:
=============================================
Uphost:即将拥有人鱼线的PDD ,Rank: Video Watched: .7万
Uphost:RNG丶MLXG ,Rank: Video Watched: .5万
Uphost:熊猫伏念 ,Rank: Video Watched: .7万
Uphost:药水哥s ,Rank: Video Watched: .3万
Uphost:WE丶Mystic丶 ,Rank: Video Watched: .0万
Uphost:叫我官人 ,Rank: Video Watched: .5万
Uphost:冠军锐雯 ,Rank: Video Watched: .5万
Uphost:熊猫丶蛮神 ,Rank: Video Watched: .3万
Uphost:起飛的辛德浪 ,Rank: Video Watched: .6万
Uphost:善言_ ,Rank: Video Watched: .9万
Uphost:左手QAQ ,Rank: Video Watched: .3万
Uphost:S7全球总决赛 ,Rank: Video Watched: .2万
Uphost:Pino一米八 ,Rank: Video Watched: .2万
Uphost:金三炮o金三岁 ,Rank: Video Watched:
Uphost:挽神z ,Rank: Video Watched:
Uphost:易小埋l ,Rank: Video Watched:
Uphost:主播毕老实 ,Rank: Video Watched:
Uphost:一剑西来QAQ ,Rank: Video Watched:
Uphost:英雄联盟活动直播间 ,Rank: Video Watched:
Uphost:超级提莫丶牛腩君 ,Rank: Video Watched:
Uphost:mid六安王 ,Rank: Video Watched:
Uphost:熊猫丶乐鱼阿卡丽 ,Rank: Video Watched:
Uphost:熊猫TV一休哥 ,Rank: Video Watched:
Uphost:小黑胖砸 ,Rank: Video Watched:
Uphost:或许这就是离岛吧 ,Rank: Video Watched:
Uphost:第一最寂寞1u ,Rank: Video Watched:
Uphost:李阿特 ,Rank: Video Watched:
Uphost:LOL日常活动直播间 ,Rank: Video Watched:
Uphost:LPL熊猫官方直播 ,Rank: Video Watched:
Uphost:熊猫TV丶小青龙 ,Rank: Video Watched:
Uphost:熊猫TV灬小豆豆 ,Rank: Video Watched:
Uphost:小啊雅大大大 ,Rank: Video Watched:
Uphost:小凯南zz ,Rank: Video Watched:
Uphost:拿铁不加糖 ,Rank: Video Watched:
Uphost:金克喵的猫珥朵丶 ,Rank: Video Watched:
Uphost:炽天使z1 ,Rank: Video Watched:
Uphost:小小小女人丶 ,Rank: Video Watched:
Uphost:東東東 ,Rank: Video Watched:
Uphost:纯纯小流_氓 ,Rank: Video Watched:
Uphost:熊猫tv芭比公主 ,Rank: Video Watched:
Uphost:big火鸡 ,Rank: Video Watched:
Uphost:机器猫mmm ,Rank: Video Watched:
Uphost:大家都叫我冷爷丶 ,Rank: Video Watched:
Uphost:栗子菌i ,Rank: Video Watched:
Uphost:星矢魔术 ,Rank: Video Watched:
Uphost:唐人leo ,Rank: Video Watched:
Uphost:十级浪 ,Rank: Video Watched:
Uphost:筱兮QAQ ,Rank: Video Watched:
Uphost:酥软迷妹小慢慢Zz ,Rank: Video Watched:
Uphost:小凡Aaaaaa ,Rank: Video Watched:
Uphost:小丸子爱吃樱桃丶 ,Rank: Video Watched:
Uphost:爱流血的兔斯基 ,Rank: Video Watched:
Uphost:凶残的喵绵绵 ,Rank: Video Watched:
Uphost:别叫凯隐叫隐神 ,Rank: Video Watched:
Uphost:Panda初心2018 ,Rank: Video Watched:
Uphost:熊猫丶大风6 ,Rank: Video Watched:
Uphost:顽皮ssssssssssss ,Rank: Video Watched:
Uphost:大表哥响尾蛇 ,Rank: Video Watched:
Uphost:告白White ,Rank: Video Watched:
Uphost:牌面之王丶火影劫 ,Rank: Video Watched:
Uphost:西湖仙境 ,Rank: Video Watched:
Uphost:飞不起来1 ,Rank: Video Watched:
Uphost:逗了个蛋 ,Rank: Video Watched:
Uphost:瓜皮球球 ,Rank: Video Watched:
Uphost:竹蜻蜓呀 ,Rank: Video Watched:
Uphost:少年阿超和阿斌 ,Rank: Video Watched:
Uphost:刚出土的i帕帕 ,Rank: Video Watched:
Uphost:小主播安旭 ,Rank: Video Watched:
Uphost:西决哟 ,Rank: Video Watched:
Uphost:Panda丶夏木 ,Rank: Video Watched:
Uphost:冰雪丶狐狸 ,Rank: Video Watched:
Uphost:夜魅丝 ,Rank: Video Watched:
Uphost:熊猫丶皮皮瓜 ,Rank: Video Watched:
Uphost:Panda灬刀刀 ,Rank: Video Watched:
Uphost:莫莫莫夏夏夏 ,Rank: Video Watched:
Uphost:皮皮翔i ,Rank: Video Watched:
Uphost:南表妹QAQ ,Rank: Video Watched:
Uphost:青蛙OB ,Rank: Video Watched:
Uphost:_Infi_ ,Rank: Video Watched:
Uphost:暴躁茹阿姨 ,Rank: Video Watched:
Uphost:整天打碟的DJ胖丶 ,Rank: Video Watched:
Uphost:熊猫丶一百 ,Rank: Video Watched:
Uphost:全蛋狮子喵 ,Rank: Video Watched:
Uphost:熊猫TV丶小66 ,Rank: Video Watched:
Uphost:电竞张全蛋长长 ,Rank: Video Watched:
Uphost:熊猫第一不亏哥 ,Rank: Video Watched:
Uphost:叫我东邪 ,Rank: Video Watched:
Uphost:熊猫TV丶一手绝 ,Rank: Video Watched:
Uphost:熊猫TV丶别勉强 ,Rank: Video Watched:
Uphost:提莫的小女朋友 ,Rank: Video Watched:
Uphost:王者蕾 ,Rank: Video Watched:
Uphost:日暮哟 ,Rank: Video Watched:
Uphost:颖妹er超甜的 ,Rank: Video Watched:
Uphost:熊猫TV丶成小七 ,Rank: Video Watched:
Uphost:熊猫tv丶马小越 ,Rank: Video Watched:
Uphost:柒柒天 ,Rank: Video Watched:
Uphost:Panda电竞白子画 ,Rank: Video Watched:
Uphost:熊猫TV_苏璞 ,Rank: Video Watched:
Uphost:你的小老虎哥哥 ,Rank: Video Watched:
Uphost:门徒zzzz ,Rank: Video Watched:
Uphost:李易钧 ,Rank: Video Watched:
Uphost:熊猫TV丶农药术士 ,Rank: Video Watched:
Uphost:熊猫贝乐 ,Rank: Video Watched:
Uphost:李小青盲僧 ,Rank: Video Watched:
Uphost:刘慕宸 ,Rank: Video Watched:
Uphost:寒风强袭 ,Rank: Video Watched:
Uphost:会蛙泳的饼干0 ,Rank: Video Watched:
Uphost:阿四德莱文丶 ,Rank: Video Watched:
Uphost:知道神龙摆尾吗 ,Rank: Video Watched:
Uphost:瓦罗兰的未来丶尨 ,Rank: Video Watched:
Uphost:JO丶欣欣 ,Rank: Video Watched:
Uphost:123ivan456 ,Rank: Video Watched:
Uphost:only丶提莫 ,Rank: Video Watched:
Uphost:情话好听但不暖心 ,Rank: Video Watched:
Uphost:小丸子真好吃 ,Rank: Video Watched:
Uphost:一只提莫送你回家 ,Rank: Video Watched:
Uphost:请叫我大腿岩丶 ,Rank: Video Watched:
Uphost:伊人芳泽瑞尔心i ,Rank: Video Watched:
02_Python简单爬虫(熊猫直播LOL的up主,谁最强!)的更多相关文章
- Python简单爬虫入门三
我们继续研究BeautifulSoup分类打印输出 Python简单爬虫入门一 Python简单爬虫入门二 前两部主要讲述我们如何用BeautifulSoup怎去抓取网页信息以及获取相应的图片标题等信 ...
- [Java]使用HttpClient实现一个简单爬虫,抓取煎蛋妹子图
第一篇文章,就从一个简单爬虫开始吧. 这只虫子的功能很简单,抓取到”煎蛋网xxoo”网页(http://jandan.net/ooxx/page-1537),解析出其中的妹子图,保存至本地. 先放结果 ...
- 简单爬虫,突破IP访问限制和复杂验证码,小总结
简单爬虫,突破复杂验证码和IP访问限制 文章地址:http://www.cnblogs.com/likeli/p/4730709.html 好吧,看题目就知道我是要写一个爬虫,这个爬虫的目标网站有 ...
- Python简单爬虫入门二
接着上一次爬虫我们继续研究BeautifulSoup Python简单爬虫入门一 上一次我们爬虫我们已经成功的爬下了网页的源代码,那么这一次我们将继续来写怎么抓去具体想要的元素 首先回顾以下我们Bea ...
- GJM : Python简单爬虫入门(二) [转载]
感谢您的阅读.喜欢的.有用的就请大哥大嫂们高抬贵手"推荐一下"吧!你的精神支持是博主强大的写作动力以及转载收藏动力.欢迎转载! 版权声明:本文原创发表于 [请点击连接前往] ,未经 ...
- python 简单爬虫diy
简单爬虫直接diy, 复杂的用scrapy import urllib2 import re from bs4 import BeautifulSoap req = urllib2.Request(u ...
- python3实现简单爬虫功能
本文参考虫师python2实现简单爬虫功能,并增加自己的感悟. #coding=utf-8 import re import urllib.request def getHtml(url): page ...
- Python开发简单爬虫 - 慕课网
课程链接:Python开发简单爬虫 环境搭建: Eclipse+PyDev配置搭建Python开发环境 Python入门基础教程 用Eclipse编写Python程序 课程目录 第1章 课程介绍 ...
- nodejs的简单爬虫
闲聊 好久没写博客了,前几天小颖在朋友的博客里看到了用nodejs的简单爬虫.所以小颖就自己试着做了个爬博客园数据的demo.嘻嘻...... 小颖最近养了条泰日天,自从养了我家 ...
随机推荐
- Centos安装ELK5.3.2
一.注意情况 1.elk的版本要一致. 2.ElasticSearch是基于lucence开发的,也就是运行需要java支持.所以要先安装JAVA环境.由于es5.x依赖于JDK1.8,所以需要安装J ...
- POJ2485:Highways(模板题)
http://poj.org/problem?id=2485 Description The island nation of Flatopia is perfectly flat. Unfortun ...
- POJ2983 Is the Information Reliable?
http://acm.sdut.edu.cn:8080/vjudge/contest/view.action?cid=267#problem/B B - ...
- python3中替换python2中cmp函数
python 3.4.3 的版本中已经没有cmp函数,被operator模块代替,在交互模式下使用时,需要导入模块. 在没有导入模块情况下,会出现 提示找不到cmp函数了,那么在python3中该如何 ...
- qt用mingw编译时报错 multiple definition of
网上相关回答不少,但过于简单,这里做一下记录. qt用mingw编译程序时报“multiple definition of …”这个错误,错误信息大概是如下图所示: 1 2 3 首先,检查自己的程序是 ...
- shell应用技巧
Shell 应用技巧 Shell是一个命令解释器,是在内核之上和内核交互的一个层面. Shell有很多种,我们所使用的的带提示符的那种属于/bin/bash,几乎所有的linux系统缺省就是这种she ...
- linux命令:linux权限管理命令
权限管理命令 文件的权限只有你两个人可以更改,一个是root,一个是文件所有者. 命令名称:chmod 命令英文原意:change the permissions mode of a file ...
- webapi 返回json及route设置
1.返回json 修改App_Start/webapiconfig public static void Register(HttpConfiguration config) { // Web API ...
- typed.js
实现效果,文字逐个输出. 实例代码: <script> $(function(){ $("#head-title").typed({ strings: ["为 ...
- uva10905
/* 很好的字符串 比较方法 很多个字符串 组成的 数字 需要最大 然后 比较 a和b 是 比较a+b 和b+a 的大小 */ #include<cstdio> #include<s ...