Gora官方文档之二:Gora对Map-Reduce的支持
参考官方文档:http://gora.apache.org/current/tutorial.html
项目代码见:https://code.csdn.net/jediael_lu/mygorademo
另环境准备见: http://blog.csdn.net/jediael_lu/article/details/43272521
当着数据已通过之前的示例存储在hbase中,数据如下:
\x00\x00\x00\x00\x00\x00\x00D column=common:ip, timestamp=1422529645469, value=85.100.75.104
\x00\x00\x00\x00\x00\x00\x00D column=common:timestamp, timestamp=1422529645469, value=\x00\x00\x01\x1F\xF1\xB5\x88\xA0
\x00\x00\x00\x00\x00\x00\x00D column=common:url, timestamp=1422529645469, value=/index.php?i=2&a=1__z_nccylulyu&k=238241
\x00\x00\x00\x00\x00\x00\x00D column=http:httpMethod, timestamp=1422529645469, value=GET
\x00\x00\x00\x00\x00\x00\x00D column=http:httpStatusCode, timestamp=1422529645469, value=\x00\x00\x00\xC8
\x00\x00\x00\x00\x00\x00\x00D column=http:responseSize, timestamp=1422529645469, value=\x00\x00\x00+
\x00\x00\x00\x00\x00\x00\x00D column=misc:referrer, timestamp=1422529645469, value=http://www.buldinle.com/index.php?i=2&a=1__Z_nccYlULyU&k=238241
\x00\x00\x00\x00\x00\x00\x00D column=misc:userAgent, timestamp=1422529645469, value=Mozilla/5.0 (Windows; U; Windows NT 5.1; tr; rv:1.9.0.7) Gecko/2009021
910 Firefox/3.0.7
\x00\x00\x00\x00\x00\x00\x00E column=common:ip, timestamp=1422529645469, value=85.100.75.104
\x00\x00\x00\x00\x00\x00\x00E column=common:timestamp, timestamp=1422529645469, value=\x00\x00\x01\x1F\xF1\xB5\xBFP
\x00\x00\x00\x00\x00\x00\x00E column=common:url, timestamp=1422529645469, value=/index.php?i=7&a=1__yxs0vome9p8&k=4924961
\x00\x00\x00\x00\x00\x00\x00E column=http:httpMethod, timestamp=1422529645469, value=GET
\x00\x00\x00\x00\x00\x00\x00E column=http:httpStatusCode, timestamp=1422529645469, value=\x00\x00\x00\xC8
\x00\x00\x00\x00\x00\x00\x00E column=http:responseSize, timestamp=1422529645469, value=\x00\x00\x00+
\x00\x00\x00\x00\x00\x00\x00E column=misc:referrer, timestamp=1422529645469, value=http://www.buldinle.com/index.php?i=7&a=1__YxS0VoME9P8&k=4924961
\x00\x00\x00\x00\x00\x00\x00E column=misc:userAgent, timestamp=1422529645469, value=Mozilla/5.0 (Windows; U; Windows NT 5.1; tr; rv:1.9.0.7) Gecko/2009021
910 Firefox/3.0.7
本例将使用MR读取hbase中的数据,并进行分析,分析每个url,一天时间内有多少人在访问,输出结果保存在hbase中,表中的key为“url+时间”格式的String,value包括三列,分别是url,时间,访问次数。
0、创建java project及gora.properties,内容如下:
##gora.datastore.default is the default detastore implementation to use
##if it is not passed to the DataStoreFactory#createDataStore() method.
gora.datastore.default=org.apache.gora.hbase.store.HBaseStore ##whether to create schema automatically if not exists.
gora.datastore.autocreateschema=true
1、创建用于对应输入数据的json文件,并生成相应的类。
上个示例已经完成,见passview.json与PageView.java
{
"type": "record",
"name": "Pageview", "default":null,
"namespace": "org.apache.gora.tutorial.log.generated",
"fields" : [
{"name": "url", "type": ["null","string"], "default":null},
{"name": "timestamp", "type": "long", "default":0},
{"name": "ip", "type": ["null","string"], "default":null},
{"name": "httpMethod", "type": ["null","string"], "default":null},
{"name": "httpStatusCode", "type": "int", "default":0},
{"name": "responseSize", "type": "int", "default":0},
{"name": "referrer", "type": ["null","string"], "default":null},
{"name": "userAgent", "type": ["null","string"], "default":null}
]
}
2、创建输入数据的类与表映射文件
<?xml version="1.0" encoding="UTF-8"?> <!--
Gora Mapping file for HBase Backend
-->
<gora-otd>
<table name="Pageview"> <!-- optional descriptors for tables -->
<family name="common"/> <!-- This can also have params like compression, bloom filters -->
<family name="http"/>
<family name="misc"/>
</table> <class name="org.apache.gora.tutorial.log.generated.Pageview" keyClass="java.lang.Long" table="AccessLog">
<field name="url" family="common" qualifier="url"/>
<field name="timestamp" family="common" qualifier="timestamp"/>
<field name="ip" family="common" qualifier="ip" />
<field name="httpMethod" family="http" qualifier="httpMethod"/>
<field name="httpStatusCode" family="http" qualifier="httpStatusCode"/>
<field name="responseSize" family="http" qualifier="responseSize"/>
<field name="referrer" family="misc" qualifier="referrer"/>
<field name="userAgent" family="misc" qualifier="userAgent"/>
</class> </gora-otd>
3、创建用于对于输出数据的json文件,并生成相应的类。
{
"type": "record",
"name": "MetricDatum",
"namespace": "org.apache.gora.tutorial.log.generated",
"fields" : [
{"name": "metricDimension", "type": "string"},
{"name": "timestamp", "type": "long"},
{"name": "metric", "type" : "long"}
]
}
liaoliuqingdeMacBook-Air:MyGoraDemo liaoliuqing$ gora goracompiler avro/metricdatum.json src/
Compiling: /Users/liaoliuqing/99_Project/git/MyGoraDemo/avro/metricdatum.json
Compiled into: /Users/liaoliuqing/99_Project/git/MyGoraDemo/src
Compiler executed SUCCESSFULL.
4、创建输出数据的类与表映射内容,并将之加入第2步创建的文件中。
<class name="org.apache.gora.tutorial.log.generated.MetricDatum" keyClass="java.lang.String" table="Metrics">
<field name="metricDimension" family="common" qualifier="metricDimension"/>
<field name="timestamp" family="common" qualifier="ts"/>
<field name="metric" family="common" qualifier="metric"/>
</class>
5、写主类文件
程序处理的关键步骤:
(1)获取输入、输出DataStore
if(args.length > 0) {
String dataStoreClass = args[0];
inStore = DataStoreFactory.
getDataStore(dataStoreClass, Long.class, Pageview.class, conf);
if(args.length > 1) {
dataStoreClass = args[1];
}
outStore = DataStoreFactory.
getDataStore(dataStoreClass, String.class, MetricDatum.class, conf);
} else {
inStore = DataStoreFactory.getDataStore(Long.class, Pageview.class, conf);
outStore = DataStoreFactory.getDataStore(String.class, MetricDatum.class, conf);
}
(2)设置job的一些基本属性
Job job = new Job(getConf());
job.setJobName("Log Analytics");
log.info("Creating Hadoop Job: " + job.getJobName());
job.setNumReduceTasks(numReducer);
job.setJarByClass(getClass());
(3)定义job相关的Map类及mapr的输入输出信息。
GoraMapper.initMapperJob(job, inStore, TextLong.class, LongWritable.class,
LogAnalyticsMapper.class, true);
(4)定义job相关的Reduce类及reduce的输入输出信息。
GoraReducer.initReducerJob(job, outStore, LogAnalyticsReducer.class);
(5)定义map类
public static class LogAnalyticsMapper extends GoraMapper<Long, Pageview, TextLong,
LongWritable> { private LongWritable one = new LongWritable(1L); private TextLong tuple; @Override
protected void setup(Context context) throws IOException ,InterruptedException {
tuple = new TextLong();
tuple.setKey(new Text());
tuple.setValue(new LongWritable());
}; @Override
protected void map(Long key, Pageview pageview, Context context)
throws IOException ,InterruptedException { CharSequence url = pageview.getUrl();
long day = getDay(pageview.getTimestamp()); tuple.getKey().set(url.toString());
tuple.getValue().set(day); context.write(tuple, one);
}; /** Rolls up the given timestamp to the day cardinality, so that
* data can be aggregated daily */
private long getDay(long timeStamp) {
return (timeStamp / DAY_MILIS) * DAY_MILIS;
}
}
(6)定义reduce类
public static class LogAnalyticsReducer extends GoraReducer<TextLong, LongWritable,
String, MetricDatum> { private MetricDatum metricDatum = new MetricDatum(); @Override
protected void reduce(TextLong tuple, Iterable<LongWritable> values, Context context)
throws IOException ,InterruptedException { long sum = 0L; //sum up the values
for(LongWritable value: values) {
sum+= value.get();
} String dimension = tuple.getKey().toString();
long timestamp = tuple.getValue().get(); metricDatum.setMetricDimension(new Utf8(dimension));
metricDatum.setTimestamp(timestamp); String key = metricDatum.getMetricDimension().toString();
key += "_" + Long.toString(timestamp);
metricDatum.setMetric(sum); context.write(key, metricDatum);
};
}
(8)使用输入输出DataStore来创建一个job,并执行
Job job = createJob(inStore, outStore, 3);
boolean success = job.waitForCompletion(true);
其实使用Gora与一般的MR程序的主要区别在于:
(1)继承于GoraMapper/GoraReducer,而不是Mapper/Reducer。
(2)使用GoraMapper.initMapperJob(), GoraReducer.initReducerJob()设置输入输出类型,而且可以使用一个DataSource类对象表示输入/输出的KEY-VALUE。
如本例中的mapper,使用instroe来代替指定了输入KV类型为Long,Pageview,本例中的reducer,使用outstore来代替指定了输出类型为String, MetricDatum。
对比http://blog.csdn.net/jediael_lu/article/details/43416751中所描述的运行一个job所需的基本属性:
GoraMapper.initMapperJob(job, inStore, TextLong.class, LongWritable.class, LogAnalyticsMapper.class, true);
GoraReducer.initReducerJob(job, outStore, LogAnalyticsReducer.class);
以上语句同时完成了2、3、4、5步,即
指定了2、Map/Reduce的类:LogAnalyticsMapper.class与LogAnalyticsReducer.class
指定了3、4、输入格式及内容及5、reduce的输出类型:即输入输出均为DataSource格式,内容为inStore与outStore中的内容。
指定了5、指定了map的输出类型,这也是reduce的输入类型。
附详细代码:
(1)KeyValueWritable.java
package org.apache.gora.tutorial.log; import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException; import org.apache.hadoop.io.WritableComparable; /**
* A WritableComparable containing a key-value WritableComparable pair.
* @param <K> the class of key
* @param <V> the class of value
*/
public class KeyValueWritable<K extends WritableComparable, V extends WritableComparable>
implements WritableComparable<KeyValueWritable<K,V>> { protected K key = null;
protected V value = null; public KeyValueWritable() {
} public KeyValueWritable(K key, V value) {
this.key = key;
this.value = value;
} public K getKey() {
return key;
} public void setKey(K key) {
this.key = key;
} public V getValue() {
return value;
} public void setValue(V value) {
this.value = value;
} @Override
public void readFields(DataInput in) throws IOException {
if(key == null) { }
key.readFields(in);
value.readFields(in);
} @Override
public void write(DataOutput out) throws IOException {
key.write(out);
value.write(out);
} @Override
public int hashCode() {
final int prime = 31;
int result = 1;
result = prime * result + ((key == null) ? 0 : key.hashCode());
result = prime * result + ((value == null) ? 0 : value.hashCode());
return result;
} @Override
public boolean equals(Object obj) {
if (this == obj)
return true;
if (obj == null)
return false;
if (getClass() != obj.getClass())
return false;
KeyValueWritable other = (KeyValueWritable) obj;
if (key == null) {
if (other.key != null)
return false;
} else if (!key.equals(other.key))
return false;
if (value == null) {
if (other.value != null)
return false;
} else if (!value.equals(other.value))
return false;
return true;
} @Override
public int compareTo(KeyValueWritable<K, V> o) {
int cmp = key.compareTo(o.key);
if(cmp != 0)
return cmp; return value.compareTo(o.value);
}
}
(2) TextLong.java
package org.apache.gora.tutorial.log; import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text; /**
* A {@link KeyValueWritable} of {@link Text} keys and
* {@link LongWritable} values.
*/
public class TextLong extends KeyValueWritable<Text, LongWritable> { public TextLong() {
key = new Text();
value = new LongWritable();
} }
(3) LogAnalytics.java
package org.apache.gora.tutorial.log; import java.io.IOException; import org.apache.avro.util.Utf8;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.apache.gora.mapreduce.GoraMapper;
import org.apache.gora.mapreduce.GoraReducer;
import org.apache.gora.store.DataStore;
import org.apache.gora.store.DataStoreFactory;
import org.apache.gora.tutorial.log.generated.MetricDatum;
import org.apache.gora.tutorial.log.generated.Pageview;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner; /**
* LogAnalytics is the tutorial class to illustrate Gora MapReduce API.
* The analytics mapreduce job reads the web access data stored earlier by the
* {@link LogManager}, and calculates the aggregate daily pageviews. The
* output of the job is stored in a Gora compatible data store.
*
* <p>See the tutorial.html file in docs or go to the
* <a href="http://incubator.apache.org/gora/docs/current/tutorial.html">
* web site</a>for more information.</p>
*/
public class LogAnalytics extends Configured implements Tool { private static final Logger log = LoggerFactory.getLogger(LogAnalytics.class); /** The number of miliseconds in a day */
private static final long DAY_MILIS = 1000 * 60 * 60 * 24; /**
* The Mapper takes Long keys and Pageview objects, and emits
* tuples of <url, day> as keys and 1 as values. Input values are
* read from the input data store.
* Note that all Hadoop serializable classes can be used as map output key and value.
*
*/
//6、定义map类
public static class LogAnalyticsMapper extends GoraMapper<Long, Pageview, TextLong,
LongWritable> { private LongWritable one = new LongWritable(1L); private TextLong tuple; @Override
protected void setup(Context context) throws IOException ,InterruptedException {
tuple = new TextLong();
tuple.setKey(new Text());
tuple.setValue(new LongWritable());
}; @Override
protected void map(Long key, Pageview pageview, Context context)
throws IOException ,InterruptedException { CharSequence url = pageview.getUrl();
long day = getDay(pageview.getTimestamp()); tuple.getKey().set(url.toString());
tuple.getValue().set(day); context.write(tuple, one);
}; /** Rolls up the given timestamp to the day cardinality, so that
* data can be aggregated daily */
private long getDay(long timeStamp) {
return (timeStamp / DAY_MILIS) * DAY_MILIS;
}
} /**
* The Reducer receives tuples of <url, day> as keys and a list of
* values corresponding to the keys, and emits a combined keys and
* {@link MetricDatum} objects. The metric datum objects are stored
* as job outputs in the output data store.
*/
//7、定义reduce类
public static class LogAnalyticsReducer extends GoraReducer<TextLong, LongWritable,
String, MetricDatum> { private MetricDatum metricDatum = new MetricDatum(); @Override
protected void reduce(TextLong tuple, Iterable<LongWritable> values, Context context)
throws IOException ,InterruptedException { long sum = 0L; //sum up the values
for(LongWritable value: values) {
sum+= value.get();
} String dimension = tuple.getKey().toString();
long timestamp = tuple.getValue().get(); metricDatum.setMetricDimension(new Utf8(dimension));
metricDatum.setTimestamp(timestamp); String key = metricDatum.getMetricDimension().toString();
key += "_" + Long.toString(timestamp);
metricDatum.setMetric(sum); context.write(key, metricDatum);
};
} /**
* Creates and returns the {@link Job} for submitting to Hadoop mapreduce.
* @param inStore
* @param outStore
* @param numReducer
* @return
* @throws IOException
*/
public Job createJob(DataStore<Long, Pageview> inStore,
DataStore<String, MetricDatum> outStore, int numReducer) throws IOException {
//3、设置job的一些基本属性
Job job = new Job(getConf());
job.setJobName("Log Analytics");
log.info("Creating Hadoop Job: " + job.getJobName());
job.setNumReduceTasks(numReducer);
job.setJarByClass(getClass()); /* Mappers are initialized with GoraMapper.initMapper() or
* GoraInputFormat.setInput()*/
//4、定义job相关的Map类及mapr的输入输出信息。
GoraMapper.initMapperJob(job, inStore, TextLong.class, LongWritable.class,
LogAnalyticsMapper.class, true); //4、定义job相关的Reduce类及reduce的输入输出信息。
/* Reducers are initialized with GoraReducer#initReducer().
* If the output is not to be persisted via Gora, any reducer
* can be used instead. */
GoraReducer.initReducerJob(job, outStore, LogAnalyticsReducer.class); return job;
} @Override
public int run(String[] args) throws Exception { DataStore<Long, Pageview> inStore;
DataStore<String, MetricDatum> outStore;
Configuration conf = new Configuration(); //1、获取输入、输出DataStore。
if(args.length > 0) {
String dataStoreClass = args[0];
inStore = DataStoreFactory.
getDataStore(dataStoreClass, Long.class, Pageview.class, conf);
if(args.length > 1) {
dataStoreClass = args[1];
}
outStore = DataStoreFactory.
getDataStore(dataStoreClass, String.class, MetricDatum.class, conf);
} else {
inStore = DataStoreFactory.getDataStore(Long.class, Pageview.class, conf);
outStore = DataStoreFactory.getDataStore(String.class, MetricDatum.class, conf);
} //2、使用输入输出DataStore来创建一个job
Job job = createJob(inStore, outStore, 3);
boolean success = job.waitForCompletion(true); inStore.close();
outStore.close(); log.info("Log completed with " + (success ? "success" : "failure")); return success ? 0 : 1;
} private static final String USAGE = "LogAnalytics <input_data_store> <output_data_store>"; public static void main(String[] args) throws Exception {
if(args.length < 2) {
System.err.println(USAGE);
System.exit(1);
}
//run as any other MR job
int ret = ToolRunner.run(new LogAnalytics(), args);
System.exit(ret);
} }
6、运行程序
(1)导出程序—>runnable jar file,并将其上传到服务器
(2)运行程序
$ java -jar MyGoraDemo.jar org.apache.gora.hbase.store.HBaseStore org.apache.gora.hbase.store.HBaseStore
(3)查看hbase中的结果
hbase(main):001:0> list
TABLE
AccessLog
Jan2814_webpage
Jan2819_webpage
Jan2910_webpage
Jan2920_webpage
Metrics
Passwd
member
8 row(s) in 2.6450 seconds
hbase(main):002:0> scan 'Metrics'
Gora官方文档之二:Gora对Map-Reduce的支持的更多相关文章
- Gora官方文档之二:Gora对Map-Reduce的支持 分类: C_OHTERS 2015-01-31 11:27 232人阅读 评论(0) 收藏
参考官方文档:http://gora.apache.org/current/tutorial.html 项目代码见:https://code.csdn.net/jediael_lu/mygoradem ...
- 转:ArcGIS API For JavaScript官方文档(二十)之图形和要素图层——①Graphics概述
原文地址:ArcGIS API For JavaScript官方文档(二十)之图形和要素图层——①Graphics概述 ArcGIS JavaScript API允许在地图上绘制graphic(图形) ...
- OKHttp 官方文档【二】
OkHttp 是这几年比较流行的 Http 客户端实现方案,其支持HTTP/2.支持同一Host 连接池复用.支持Http缓存.支持自动重定向 等等,有太多的优点. 一直想找时间了解一下 OkHttp ...
- 【cocos2d-js官方文档】二十五、Cocos2d-JS v3.0中的单例对象
为何将单例模式移除 在Cocos2d-JS v3.0之前.全部API差点儿都是从Cocos2d-x中移植过来的,这是Cocos2d生态圈统一性的重要一环.可惜的是,这样的统一性也在非常大程度上限制了C ...
- 【cocos2d-js官方文档】二、资源管理器Assets Manager
这篇文档将介绍Cocos2d-JS 3.0的一个重量级新特性:资源管理器(仅支持JSB).资源管理器是为游戏运行时的资源热更新而设计的,这里的资源可以是图片,音频甚至游戏脚本本身.使用资源管理器,你将 ...
- 【cocos2d-js官方文档】二十一、v3相对于v2版本的api变动
分类: cocos2d-js(28) 目录(?)[+] CCAudio.js SimpleAudioEngine.js改名为CCAudio.js. AudioEngine中删除了以下几个方法: pre ...
- 【cocos2d-js官方文档】二十、moduleConfig.json
概述 该配置文件相当于v2版本号中的jsloader.js. 改造的目的是为了使得配置纯粹化,同一时候也能比較好的支持cocos-console.cocos-utils甚至是用户自己定义脚本工具. 字 ...
- 2DToolkit官方文档中文版打地鼠教程(二):设置摄像机
这是2DToolkit官方文档中 Whack a Mole 打地鼠教程的译文,为了减少文中过多重复操作的翻译,以及一些无必要的句子,这里我假设你有Unity的基础知识(例如了解如何新建Sprite等) ...
- OpenGL ES着色器语言之变量和数据类型(二)(官方文档第四章)
OpenGL ES着色器语言之变量和数据类型(二)(官方文档第四章) 4.5精度和精度修饰符 4.5.1范围和精度 用于存储和展示浮点数.整数变量的范围和精度依赖于数值的源(varying,unifo ...
随机推荐
- asp.net 的那点事(2、浏览器和一般处理程序)
从今天开始我们接着来学习:asp.net中一般处理程序和浏览器的通信. 一.第一个图解: 从图解中我们看出,整个过程是:"请求---处理---响应".这个也就是经常面试的时候,面试 ...
- notepad++搜索结果不显示line XX的方法
在使用notepad++如果多次搜索,得到的结果中会出现多次line xx: line xx:,造成文件大量垃圾信息的存在,不利于找寻所需的内容,如下图. 对于这种情况, ...
- log4Net使用的四个步骤
第一步.引入程序集,并建立配置文件,放在根目录下config文件夹里.配置文件如下: <?xml version="1.0" encoding="utf-8&quo ...
- Express4+Mongodb极简入门实例
一.准备工作: 1.启动mongodb:bin目录下运行 2.在test数据库里插入一条数据: 二.正式开始: 1.通过应用生成器工具 express 快速创建一个应用的骨架,参考Express中文网 ...
- Codeforces 543D Road Improvement
http://codeforces.com/contest/543/problem/D 题意: 给定n个点的树 问: 一开始全是黑边,对于以i为根时,把树边白染色,使得任意点走到根的路径上不超过一条黑 ...
- FJ省队集训DAY3 T2
思路:如果一个DAG要的路径上只要一条边去切掉,那么要怎么求?很容易就想到最小割,但是如果直接做最小割会走出重复的部分,那我们就这样:反向边设为inf,这样最小割的时候就不会割到了,判断无解我们直接用 ...
- 解决MVC项目中,静态html 未找到时候,404的跳转
using System; using System.Collections.Generic; using System.Globalization; using System.Linq; using ...
- 《Learn python the hard way》Exercise 48: Advanced User Input
这几天有点时间,想学点Python基础,今天看到了<learn python the hard way>的 Ex48,这篇文章主要记录一些工具的安装,以及scan 函数的实现. 首先与Ex ...
- Java中判断集合类为空的方法
*****需要引入Spring的核心Jar包***** 工具类: org.springframework.util.CollectionUtils 方法: public static boolean ...
- CSS3实现三角形
很多时候我们用到三角形这个效果: 我们可以用CSS3实现这个效果,怎去做呢?先阐述一下原理,我们定义一个空的div,设置这个div宽高为0,给这个div加上一个100px边框(这里是方便观察),得到的 ...