五、基于hadoop的nginx访问日志分析--userAgent和spider
useragent:
代码(不包含蜘蛛):
# cat top_10_useragent.py
#!/usr/bin/env python
# coding=utf-8 from mrjob.job import MRJob
from mrjob.step import MRStep
from nginx_accesslog_parser import NginxLineParser import heapq class UserAgent(MRJob): nginx_line_parser = NginxLineParser() def mapper(self, _, line): self.nginx_line_parser.parse(line)
field_item = self.nginx_line_parser.http_user_agent
if field_item is not None:
yield field_item, 1 def reducer_sum(self, key, values): yield None, (sum(values), key) def reducer_top100(self, _, values):
for count, path in heapq.nlargest(10, values):
yield count, path
# for count, path in sorted(values, reverse=True)[:10]:
# yield count, path def steps(self):
return (
MRStep(mapper=self.mapper,
reducer=self.reducer_sum
),
MRStep(reducer=self.reducer_top100)
) def main():
UserAgent.run() if __name__ == '__main__':
main()
结果:
# python3 top_10_useragent.py access_all.log-20161227
No configs found; falling back on auto-configuration
Creating temp directory /tmp/top_10_useragent.root.20161228.090725.308144
Running step 1 of 2...
Running step 2 of 2...
Streaming final output from /tmp/top_10_useragent.root.20161228.090725.308144/output...
85262 "IE"
79611 "Chrome"
48560 "Other"
10662 "Firefox"
7927 "Mobile Safari UI/WKWebView"
7182 "Sogou Explorer"
6681 "QQ Browser"
1988 "Mobile Safari"
1781 "Maxthon"
1404 "Edge"
Removing temp directory /tmp/top_10_useragent.root.20161228.090725.308144...
蜘蛛:
#!/usr/bin/env python
# coding=utf-8 from mrjob.job import MRJob
from mrjob.step import MRStep
from nginx_accesslog_parser import NginxLineParser import heapq class Spider(MRJob): nginx_line_parser = NginxLineParser() def mapper(self, _, line): self.nginx_line_parser.parse(line)
field_item = self.nginx_line_parser.user_agent_type
if field_item is not None:
yield field_item, 1 def reducer_sum(self, key, values): yield None, (sum(values), key) def reducer_top100(self, _, values):
for count, path in heapq.nlargest(10, values):
yield count, path
# for count, path in sorted(values, reverse=True)[:10]:
# yield count, path def steps(self):
return (
MRStep(mapper=self.mapper,
reducer=self.reducer_sum
),
MRStep(reducer=self.reducer_top100)
) def main():
Spider.run() if __name__ == '__main__':
main()
执行结果:
# python3 top_10_spider.py access_all.log-20161227
No configs found; falling back on auto-configuration
Creating temp directory /tmp/top_10_spider.root.20161228.091326.295972
Running step 1 of 2...
Running step 2 of 2...
Streaming final output from /tmp/top_10_spider.root.20161228.091326.295972/output...
33542 "magpie-crawler"
25880 "Other"
16578 "Sogou web spider"
6383 "bingbot"
3688 "Baiduspider"
1487 "Yahoo! Slurp"
1096 "JikeSpider"
731 "YisouSpider"
648 "Baiduspider-image"
470 "Googlebot"
Removing temp directory /tmp/top_10_spider.root.20161228.091326.295972...
五、基于hadoop的nginx访问日志分析--userAgent和spider的更多相关文章
- 一、基于hadoop的nginx访问日志分析---解析日志篇
前一阵子,搭建了ELK日志分析平台,用着挺爽的,再也不用给开发拉各种日志,节省了很多时间. 这篇博文是介绍用python代码实现日志分析的,用MRJob实现hadoop上的mapreduce,可以直接 ...
- 四、基于hadoop的nginx访问日志分析---top 10 request
代码: # cat top_10_request.py #!/usr/bin/env python # coding=utf-8 from mrjob.job import MRJob from mr ...
- 二、基于hadoop的nginx访问日志分析---计算日pv
代码: # pv_day.py#!/usr/bin/env python # coding=utf-8 from mrjob.job import MRJob from nginx_accesslog ...
- 三、基于hadoop的nginx访问日志分析--计算时刻pv
代码: # cat pv_hour.py #!/usr/bin/env python # coding=utf-8 from mrjob.job import MRJob from nginx_acc ...
- nginx访问日志分析,筛选时间大于1秒的请求
处理nginx访问日志,筛选时间大于1秒的请求 #!/usr/bin/env python ''' 处理访问日志,筛选时间大于1秒的请求 ''' with open('test.log','a+' ...
- Nginx 访问日志分析
0:Nginx日志格式配置 # vim nginx.conf ## # Logging Settings ## log_format access '$remote_addr - $remote_us ...
- Nginx访问日志分析
nginx默认的日志格式 log_format main '$remote_addr - $remote_user [$time_local] "$request" ' '$sta ...
- 13 Nginx访问日志分析
#!/bin/bash export PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/bin # Nginx 日志格式: # ...
- 采集并分析Nginx访问日志
日志服务支持通过数据接入向导配置采集Nginx日志,并自动创建索引和Nginx日志仪表盘,帮助您快速采集并分析Nginx日志. 许多个人站长选取了Nginx作为服务器搭建网站,在对网站访问情况进行分析 ...
随机推荐
- Android 面试题--Service
1.Service 是否在 main thread 中执行, service 里面是否能执行耗时的操作?默认情况,如果没有显示的指 servic 所运行的进程, Service 和 activity ...
- vs2010中如何设置Visual Assist方便地使用现成的代码编辑器风格
风格setting可以在下面网站上获取: http://studiostyl.es/ 在VS2010+VA直接使用会有2个显著的问题: 1,有些符号颜色太深,与黑色背景几乎融为一体: 2,光标落入大小 ...
- 1-3 - C#语言习惯 - 推荐使用查询语法而不是循环
C#语言中并不缺少控制程序流程的结构,for.while.do-while和foreach等都可以做到这点. 历史上所有计算机语言设计者都不曾遗漏这些重要的循环控制结构. 不过我们还有一个更好的方式: ...
- 让OData和NHibernate结合进行动态查询
OData是一个非常灵活的RESTful API,如果要做出强大的查询API,那么OData就强烈推荐了.http://www.odata.org/ OData的特点就是可以根据传入参数动态生成Ent ...
- Spark SQL 之 DataFrame
Spark SQL 之 DataFrame 转载请注明出处:http://www.cnblogs.com/BYRans/ 概述(Overview) Spark SQL是Spark的一个组件,用于结构化 ...
- 3-linux下部署tomcat应用
linux下部署tomcat应用 相关软件下载 jdk http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downlo ...
- 移动WEB viewport 相关知识
了解移动web viewport的知识,主要是为了切图时心中有数.本文主要围绕一个问题:切图时怎样设置<meta name="viewport">相关参数?围绕这个问题 ...
- [LeetCode] Integer Replacement 整数替换
Given a positive integer n and you can do operations as follow: If n is even, replace n with n/2. If ...
- [LeetCode] Kth Smallest Element in a BST 二叉搜索树中的第K小的元素
Given a binary search tree, write a function kthSmallest to find the kth smallest element in it. Not ...
- 【前端积累】SEO 学习
白帽SEO:网站标题 关键字 描述 网站内容优化 Robot.txt文件 网站地图 增加外链引用 网站结构布局优化:扁平化结构 控制首页链接数量:中小网站100以内,页面导航.底部 ...