我参照的前辈的文章http://blog.fens.me/hadoop-mapreduce-log-kpi/

从1.x改到了2.x。虽然没什么大改。(说实话,视频没什么看的,看文章最好)


先用maven构建hadoop项目

下载maven、添加环境变量、替换eclipse默认maven配置、修改maven默认库位置... ...

这里没有像前辈一样用maven命令去新建一个maven项目,直接用eclipse这个方便IDE就行了

重要的pom.xml添加依赖

 <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion> <groupId>org.admln</groupId>
<artifactId>getKPI</artifactId>
<version>0.0.1-SNAPSHOT</version>
<packaging>jar</packaging> <name>getKPI</name>
<url>http://maven.apache.org</url> <properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties> <dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.4</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>2.2.0</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-mapreduce-client-core</artifactId>
<version>2.2.0</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-mapreduce-client-common</artifactId>
<version>2.2.0</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>2.2.0</version>
</dependency>
<dependency>
<groupId>jdk.tools</groupId>
<artifactId>jdk.tools</artifactId>
<version>1.7</version>
<scope>system</scope>
<systemPath>${JAVA_HOME}/lib/tools.jar</systemPath>
</dependency>
</dependencies>
</project>

然后让maven下载jar包就行了(第一次下载很多很慢,以后就不用下载,快的很了)


然后就是MR了。

这个MR的任务就是根据日志提取一些KPI指标。

日志格式:

 222.68.172.190 - - [18/Sep/2013:06:49:57 +0000] "GET /images/my.jpg HTTP/1.1" 200 19939
"http://www.angularjs.cn/A00n" "Mozilla/5.0 (Windows NT 6.1)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.66 Safari/537.36"

有用的变量:

  • remote_addr: 记录客户端的ip地址, 222.68.172.190
  • remote_user: 记录客户端用户名称, –
  • time_local: 记录访问时间与时区, [18/Sep/2013:06:49:57 +0000]
  • request: 记录请求的url与http协议, “GET /images/my.jpg HTTP/1.1″
  • status: 记录请求状态,成功是200, 200
  • body_bytes_sent: 记录发送给客户端文件主体内容大小, 19939
  • http_referer: 用来记录从那个页面链接访问过来的, “http://www.angularjs.cn/A00n”
  • http_user_agent: 记录客户浏览器的相关信息, “Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.66 Safari/537.36″

KPI目标:

  • PV(PageView): 页面访问量统计
  • IP: 页面独立IP的访问量统计
  • Time: 用户每小时PV的统计
  • Source: 用户来源域名的统计
  • Browser: 用户的访问设备统计

具体MR:

KPI.java

 package org.admln.kpi;

 import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.HashSet;
import java.util.Locale;
import java.util.Set; /**
* @author admln
*
*/
public class KPI {
private String remote_addr;// 记录客户端的ip地址
private String remote_user;// 记录客户端用户名称,忽略属性"-"
private String time_local;// 记录访问时间与时区
private String request;// 记录请求的url与http协议
private String status;// 记录请求状态;成功是200
private String body_bytes_sent;// 记录发送给客户端文件主体内容大小
private String http_referer;// 用来记录从那个页面链接访问过来的
private String http_user_agent;// 记录客户浏览器的相关信息 private boolean valid = true;// 判断数据是否合法 public static KPI parser(String line) {
KPI kpi = new KPI();
String [] arr = line.split(" ");
if(arr.length>11) {
kpi.setRemote_addr(arr[0]);
kpi.setRemote_user(arr[1]);
kpi.setTime_local(arr[3].substring(1));
kpi.setRequest(arr[6]);
kpi.setStatus(arr[8]);
kpi.setBody_bytes_sent(arr[9]);
kpi.setHttp_referer(arr[10]); if(arr.length>12) {
kpi.setHttp_user_agent(arr[11]+" "+arr[12]);
}else {
kpi.setHttp_user_agent(arr[11]);
} if(Integer.parseInt(kpi.getStatus())>400) {
kpi.setValid(false);
} }else {
kpi.setValid(false);
} return kpi; }
public static KPI filterPVs(String line) {
KPI kpi = parser(line);
Set pages = new HashSet();
pages.add("/about");
pages.add("/black-ip-list/");
pages.add("/cassandra-clustor/");
pages.add("/finance-rhive-repurchase/");
pages.add("/hadoop-family-roadmap/");
pages.add("/hadoop-hive-intro/");
pages.add("/hadoop-zookeeper-intro/");
pages.add("/hadoop-mahout-roadmap/"); if (!pages.contains(kpi.getRequest())) {
kpi.setValid(false);
}
return kpi;
} public String getRemote_addr() {
return remote_addr;
} public void setRemote_addr(String remote_addr) {
this.remote_addr = remote_addr;
} public String getRemote_user() {
return remote_user;
} public void setRemote_user(String remote_user) {
this.remote_user = remote_user;
} public String getTime_local() {
return time_local;
} public Date getTime_local_Date() throws ParseException {
SimpleDateFormat df = new SimpleDateFormat("dd/MMM/yyyy:HH:mm:ss",Locale.US);
return df.parse(this.time_local);
}
//为了以小时为单位统计数据
public String getTime_local_Date_Hour() throws ParseException {
SimpleDateFormat df = new SimpleDateFormat("yyyyMMddHH");
return df.format(this.getTime_local_Date());
} public void setTime_local(String time_local) {
this.time_local = time_local;
} public String getRequest() {
return request;
} public void setRequest(String request) {
this.request = request;
} public String getStatus() {
return status;
} public void setStatus(String status) {
this.status = status;
} public String getBody_bytes_sent() {
return body_bytes_sent;
} public void setBody_bytes_sent(String body_bytes_sent) {
this.body_bytes_sent = body_bytes_sent;
} public String getHttp_referer() {
return http_referer;
} public void setHttp_referer(String http_referer) {
this.http_referer = http_referer;
} public String getHttp_user_agent() {
return http_user_agent;
} public void setHttp_user_agent(String http_user_agent) {
this.http_user_agent = http_user_agent;
} public boolean isValid() {
return valid;
} public void setValid(boolean valid) {
this.valid = valid;
}
}

KPIBrowser.java

 package org.admln.kpi;

 import java.io.IOException;

 import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.io.Text; /**
* @author admln
*
*/
public class KPIBrowser { public static class browserMapper extends Mapper<Object,Text,Text,IntWritable> {
Text word = new Text();
IntWritable ONE = new IntWritable(1);
@Override
public void map(Object key,Text value,Context context) throws IOException, InterruptedException {
KPI kpi = KPI.parser(value.toString());
if(kpi.isValid()) {
word.set(kpi.getHttp_user_agent());
context.write(word, ONE);
}
}
} public static class browserReducer extends Reducer<Text,IntWritable,Text,IntWritable> {
int sum;
public void reduce(Text key,Iterable<IntWritable> values,Context context) throws IOException, InterruptedException {
sum = 0;
for(IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
} public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
Path input = new Path("hdfs://hadoop:9001/fens/kpi/input/");
Path output = new Path("hdfs://hadoop:9001/fens/kpi/browser/output"); Configuration conf = new Configuration(); @SuppressWarnings("deprecation")
Job job = new Job(conf,"get KPI Browser"); job.setJarByClass(KPIBrowser.class); job.setMapperClass(browserMapper.class);
job.setCombinerClass(browserReducer.class);
job.setReducerClass(browserReducer.class); job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job,input);
FileOutputFormat.setOutputPath(job,output); System.exit(job.waitForCompletion(true)?0:1); }
}

KPIIP.java

 package org.admln.kpi;

 import java.io.IOException;
import java.util.HashSet;
import java.util.Set; import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; /**
* @author admln
*
*/
public class KPIIP {
//map类
public static class ipMapper extends Mapper<Object,Text,Text,Text> {
private Text word = new Text();
private Text ips = new Text(); @Override
public void map(Object key,Text value,Context context) throws IOException, InterruptedException {
KPI kpi = KPI.parser(value.toString());
if(kpi.isValid()) {
word.set(kpi.getRequest());
ips.set(kpi.getRemote_addr());
context.write(word, ips);
}
}
} //reduce类
public static class ipReducer extends Reducer<Text,Text,Text,Text> {
private Text result = new Text();
private Set<String> count = new HashSet<String>(); public void reduce(Text key,Iterable<Text> values,Context context) throws IOException, InterruptedException { for (Text val : values) {
count.add(val.toString());
}
result.set(String.valueOf(count.size()));
context.write(key, result);
}
} public static void main(String[] args) throws Exception {
Path input = new Path("hdfs://hadoop:9001/fens/kpi/input/");
Path output = new Path("hdfs://hadoop:9001/fens/kpi/ip/output"); Configuration conf = new Configuration(); @SuppressWarnings("deprecation")
Job job = new Job(conf,"get KPI IP");
job.setJarByClass(KPIIP.class); job.setMapperClass(ipMapper.class);
job.setCombinerClass(ipReducer.class);
job.setReducerClass(ipReducer.class); job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class); FileInputFormat.addInputPath(job,input);
FileOutputFormat.setOutputPath(job,output);
System.exit(job.waitForCompletion(true)?0:1); }
}

KPIPV.java

 package org.admln.kpi;

 import java.io.IOException;

 import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; /**
* @author admln
*
*/
public class KPIPV { public static class pvMapper extends Mapper<Object,Text,Text,IntWritable> {
private Text word = new Text();
private final static IntWritable ONE = new IntWritable(1); public void map(Object key,Text value,Context context) throws IOException, InterruptedException {
KPI kpi = KPI.filterPVs(value.toString());
if(kpi.isValid()) {
word.set(kpi.getRequest());
context.write(word, ONE);
}
}
} public static class pvReducer extends Reducer<Text,IntWritable,Text,IntWritable> {
IntWritable result = new IntWritable();
public void reduce(Text key,Iterable<IntWritable> values,Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key,result);
}
} public static void main(String[] args) throws Exception {
Path input = new Path("hdfs://hadoop:9001/fens/kpi/input/");
Path output = new Path("hdfs://hadoop:9001/fens/kpi/pv/output"); Configuration conf = new Configuration(); @SuppressWarnings("deprecation")
Job job = new Job(conf,"get KPI PV"); job.setJarByClass(KPIPV.class); job.setMapperClass(pvMapper.class);
job.setCombinerClass(pvReducer.class);
job.setReducerClass(pvReducer.class); job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job,input);
FileOutputFormat.setOutputPath(job,output); System.exit(job.waitForCompletion(true)?0:1); } }

KPISource.java

 package org.admln.kpi;

 import java.io.IOException;

 import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; /**
* @author admln
*
*/
public class KPISource { public static class sourceMapper extends Mapper<Object,Text,Text,IntWritable> {
Text word = new Text();
IntWritable ONE = new IntWritable(1);
@Override
public void map(Object key,Text value,Context context) throws IOException, InterruptedException {
KPI kpi = KPI.parser(value.toString());
if(kpi.isValid()) {
word.set(kpi.getHttp_referer());
context.write(word, ONE);
}
}
} public static class sourceReducer extends Reducer<Text,IntWritable,Text,IntWritable> {
int sum;
public void reduce(Text key,Iterable<IntWritable> values,Context context) throws IOException, InterruptedException {
sum = 0;
for(IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
} public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
Path input = new Path("hdfs://hadoop:9001/fens/kpi/input/");
Path output = new Path("hdfs://hadoop:9001/fens/kpi/source/output"); Configuration conf = new Configuration(); @SuppressWarnings("deprecation")
Job job = new Job(conf,"get KPI Source"); job.setJarByClass(KPISource.class); job.setMapperClass(sourceMapper.class);
job.setCombinerClass(sourceReducer.class);
job.setReducerClass(sourceReducer.class); job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job,input);
FileOutputFormat.setOutputPath(job,output); System.exit(job.waitForCompletion(true)?0:1);
}
}

KPITime.java

 package org.admln.kpi;

 import java.io.IOException;
import java.text.ParseException; import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; /**
* @author admln
*
*/
public class KPITime { public static class timeMapper extends Mapper<Object,Text,Text,IntWritable> {
Text word = new Text();
IntWritable ONE = new IntWritable(1);
@Override
public void map(Object key,Text value,Context context) throws IOException, InterruptedException {
KPI kpi = KPI.parser(value.toString());
if(kpi.isValid()) {
try {
word.set(kpi.getTime_local_Date_Hour());
} catch (ParseException e) {
e.printStackTrace();
}
context.write(word, ONE);
}
}
} public static class timeReducer extends Reducer<Text,IntWritable,Text,IntWritable> {
int sum;
public void reduce(Text key,Iterable<IntWritable> values,Context context) throws IOException, InterruptedException {
sum = 0;
for(IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
} public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
Path input = new Path("hdfs://hadoop:9001/fens/kpi/input/");
Path output = new Path("hdfs://hadoop:9001/fens/kpi/time/output"); Configuration conf = new Configuration(); @SuppressWarnings("deprecation")
Job job = new Job(conf,"get KPI Time"); job.setJarByClass(KPITime.class); job.setMapperClass(timeMapper.class);
job.setCombinerClass(timeReducer.class);
job.setReducerClass(timeReducer.class); job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job,input);
FileOutputFormat.setOutputPath(job,output); System.exit(job.waitForCompletion(true)?0:1); } }

其实五个MR都差不多,都是WordCount稍作改变。(前辈好像写的有点小错误,被我发现改了)
hadoop环境是:hadoop2.2.0;JDK1.7;虚拟机伪分布式;IP 192.168.111.132。

具体效果:

这里前辈是把指定目录提取出来了。实际情况可以根据自己的需求提取指定页面。


具体代码和日志文件:http://pan.baidu.com/s/1qW5D63M

实验日志数据也可以从别的地方获得来练手,比如搜狗http://www.sogou.com/labs/dl/q.html


关于CRON。我觉得一个可行的方法是:比如我的日志是由tomcat产生的,定义tomcat产生日志是每天写在一个目录里面,目录以日志命名;然后写一个shell脚本,是执行hadoop命令把当天日期的tomcat日志目录复制到HDFS上,然后执行MR;当然HDFS上的命名也要考虑;执行完后把结果再通过shell复制到HBase、Hive、MySQL、redis等需要的地方,供应用使用。


不当之处期盼喷正。


InActon-日志分析(KPI)的更多相关文章

  1. hadoop入门之海量Web日志分析 用Hadoop提取KPI统计指标

    转载自:http://blog.fens.me/hadoop-mapreduce-log-kpi/ 今天学习了这一篇博客,写得十分好,照着这篇博客敲了一遍. 发现几个问题, 一是这篇博客中采用的had ...

  2. Hadoop学习笔记—20.网站日志分析项目案例(一)项目介绍

    网站日志分析项目案例(一)项目介绍:当前页面 网站日志分析项目案例(二)数据清洗:http://www.cnblogs.com/edisonchou/p/4458219.html 网站日志分析项目案例 ...

  3. 海量WEB日志分析

    Hadoop家族系列文章,主要介绍Hadoop家族产品,常用的项目包括Hadoop, Hive, Pig, HBase, Sqoop, Mahout, Zookeeper, Avro, Ambari, ...

  4. Hadoop应用开发实战案例 第2周 Web日志分析项目 张丹

    课程内容 本文链接: 张丹博客 http://www.fens.me 用Maven构建Hadoop项目 http://blog.fens.me/hadoop-maven-eclipse/程序源代码下载 ...

  5. Hadoop学习笔记—20.网站日志分析项目案例

    1.1 项目来源 本次要实践的数据日志来源于国内某技术学习论坛,该论坛由某培训机构主办,汇聚了众多技术学习者,每天都有人发帖.回帖,如图1所示. 图1 项目来源网站-技术学习论坛 本次实践的目的就在于 ...

  6. 【转】gc日志分析工具

    性能测试排查定位问题,分析调优过程中,会遇到要分析gc日志,人肉分析gc日志有时比较困难,相关图形化或命令行工具可以有效地帮助辅助分析. Gc日志参数 通过在tomcat启动脚本中添加相关参数生成gc ...

  7. 海量日志分析方案--logstash+kibnana+kafka

    下图为唯品会在qcon上面公开的日志处理平台架构图.听后觉得有些意思,好像也可以很容易的copy一个,就动手尝试了一下. 目前只对flume===>kafka===>elacsticSea ...

  8. ELK+Kafka集群日志分析系统

    ELK+Kafka集群分析系统部署 因为是自己本地写好的word文档复制进来的.格式有些出入还望体谅.如有错误请回复.谢谢! 一. 系统介绍 2 二. 版本说明 3 三. 服务部署 3 1) JDK部 ...

  9. Hadoop学习笔记—20.网站日志分析项目案例(二)数据清洗

    网站日志分析项目案例(一)项目介绍:http://www.cnblogs.com/edisonchou/p/4449082.html 网站日志分析项目案例(二)数据清洗:当前页面 网站日志分析项目案例 ...

  10. Hadoop学习笔记—20.网站日志分析项目案例(三)统计分析

    网站日志分析项目案例(一)项目介绍:http://www.cnblogs.com/edisonchou/p/4449082.html 网站日志分析项目案例(二)数据清洗:http://www.cnbl ...

随机推荐

  1. Hadoop构成

    What Is Apache Hadoop? The Apache™ Hadoop® project develops open-source software for reliable, scala ...

  2. pd虚拟机死机怎么解决

    最近在mac上使用pd虚拟机装win使用,今天发现pd中的win7虚拟机死机了,无论怎么点都没用,通过点击操作-关闭也不行,重启电脑也不行,后来找到一种办法可以重启虚拟机. 1.先通过菜单中止虚拟机, ...

  3. 什么是IntelAMT

    IntelAMT 全称为INTEL主动管理技术,该技术允许IT经理们远程管理和修复联网的计算机系统,而且实施过程是对于服务对象完全透明的,从而节省了用户的时间和计 算机维护成本.释放出来的iAMT构架 ...

  4. 什么是Mocking framework?它有什么用?

    原位地址:http://codetunnel.com/blog/post/what-is-a-mocking-framework-why-is-it-useful 今天我想讲下关于mocking fr ...

  5. SqlAgent备份脚本

    ) ) set @dbname='emcp' set @back_path= 'e:\'+@dbname+'\'+@dbname ),) )) )) )) +'.bak' exec('use ['+@ ...

  6. windows 下使用免安裝版MySql5.5

    windows 下使用面安裝版MySql5.5步驟如下 1.解壓下載的壓縮文件到指定文件夾.如:F:\DB\mysql-5.5.18-win32\mysql-5.5.18-win32: 2.在根目錄F ...

  7. lbs basic mongodb

    MongoDB地理位置索引常用的有两种. db.places.ensureIndex({'coordinate':'2d'}) db.places.ensureIndex({'coordinate': ...

  8. 基于TF/IDF的聚类算法原理

        一.TF/IDF描述单个term与特定document的相关性TF(Term Frequency): 表示一个term与某个document的相关性. 公式为这个term在document中出 ...

  9. DataTemplate和ControlTemplate的关系

    DataTemplate和ControlTemplate的关系(转载自haiziguo) 一.ContentControl中的DataTemplate 在开始之前,我们先去看一下ContentCont ...

  10. 5.迪米特法则(Law Of Demeter)

    1.定义 狭义的迪米特法则定义:也叫最少知识原则(LKP,Least Knowledge Principle).如果两个类不必彼此直接通信,那么这两个类就不应当发生直接的相互作用.如果其中的一个类需要 ...