Hbase 表的Rowkey设计避免数据热点

一、案例分析

常见避免数据热点问题的处理方式有：加盐、哈希、反转等方法结合预分区使用。

由于目前原数据第一字段为时间戳形式，第二字段为电话号码，直接存储容易引起热点问题，通过加随机列、组合时间戳、字段反转的方式来设计Rowkey，来实现既能高效查询又能避免热点问题。

二、代码部分

 package beifeng.hadoop.hbase;

 import java.io.IOException;

 import java.text.SimpleDateFormat;

 import java.util.Date;

 import org.apache.hadoop.conf.Configuration;

 import org.apache.hadoop.conf.Configured;

 import org.apache.hadoop.fs.Path;

 import org.apache.hadoop.hbase.HBaseConfiguration;

 import org.apache.hadoop.hbase.HColumnDescriptor;

 import org.apache.hadoop.hbase.HTableDescriptor;

 import org.apache.hadoop.hbase.MasterNotRunningException;

 import org.apache.hadoop.hbase.TableName;

 import org.apache.hadoop.hbase.ZooKeeperConnectionException;

 import org.apache.hadoop.hbase.client.HBaseAdmin;

 import org.apache.hadoop.hbase.client.Mutation;

 import org.apache.hadoop.hbase.client.Put;

 import org.apache.hadoop.hbase.mapreduce.TableOutputFormat;

 import org.apache.hadoop.hbase.mapreduce.TableReducer;

 import org.apache.hadoop.io.LongWritable;

 import org.apache.hadoop.io.NullWritable;

 import org.apache.hadoop.mapreduce.Job;

 import org.apache.hadoop.mapreduce.Mapper;

 import org.apache.hadoop.mapreduce.Reducer;

 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

 import org.apache.hadoop.util.Tool;

 import org.apache.hadoop.util.ToolRunner;

 import org.apache.hadoop.io.Text;

 /**

  * 遵循rowkey的设计原则

  *  1.rowkey不能过长

  *  2.唯一性，加随机列  md5

  *  3.注意避免产生数据热点

  *  4.满足更多的查询场景

  * @author Administrator

  *

  */

 public class LoadData extends Configured implements Tool {

     /**

      * 综合考虑 使用时间和手机 做组合key，能更好的满足应用场景

      * @author Administrator

      *

      */

     public static class LoadDataMapper extends Mapper<LongWritable, Text, LongWritable, Text> {

         //专门处理时间戳 =》标准时间格式

         SimpleDateFormat sdf = new SimpleDateFormat("yyyyMMddHHsss");

         private Text mapOutputValue = new Text();

         @Override

         protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, LongWritable, Text>.Context context)

                 throws IOException, InterruptedException {

             String line = value.toString();

             String[] splited = line.split("\t");

             //将切分的第一个字段转成标准时间

             String formatDate = sdf.format(new Date(Long.parseLong(splited[0].trim())));

             //将手机号码反转

             String phoneNumber = splited[1].toString();

                         String reversePhoneNumber = new StringBuffer(phoneNumber).reverse().toString();

             String rowKeyString = reversePhoneNumber +"|"+formatDate;

             //反转手机号+“|”+时间 +正行内容拼接

             mapOutputValue.set(rowKeyString+"\t"+ line);

             context.write(key, mapOutputValue);

         }

     }

     public static class LoadDataReuducer extends TableReducer<LongWritable, Text, NullWritable>{

         //设置HBase的列簇

             private static final String COLUMN_FAMAILY = "info";

         @Override

         protected void reduce(LongWritable key, Iterable<Text> values,

                 Reducer<LongWritable, Text, NullWritable, Mutation>.Context context)

                 throws IOException, InterruptedException {

             for (Text value:values) {

                 String[] splited = value.toString().split("\t");

                 String rowKey = splited[0];

             //    System.err.println(rowKey);

                 Put put = new Put(rowKey.getBytes());

                 //put.addColumn(COLUMN_FAMAILY.getBytes(),"row".getBytes(),value.getBytes());

                 put.add(COLUMN_FAMAILY.getBytes(), "reportTime".getBytes(), splited[1].getBytes());

                 put.add(COLUMN_FAMAILY.getBytes(), "apmac".getBytes(), splited[3].getBytes());

                 put.add(COLUMN_FAMAILY.getBytes(), "acmac".getBytes(), splited[4].getBytes());

                 put.add(COLUMN_FAMAILY.getBytes(), "host".getBytes(), splited[5].getBytes());

                 put.add(COLUMN_FAMAILY.getBytes(), "siteType".getBytes(), splited[6].getBytes());

                 put.add(COLUMN_FAMAILY.getBytes(), "upPackNum".getBytes(), splited[7].getBytes());

                 put.add(COLUMN_FAMAILY.getBytes(), "downPackNum".getBytes(), splited[8].getBytes());

                 put.add(COLUMN_FAMAILY.getBytes(), "unPayLoad".getBytes(), splited[9].getBytes());

                 put.add(COLUMN_FAMAILY.getBytes(), "downPayLoad".getBytes(), splited[10].getBytes());

                 put.add(COLUMN_FAMAILY.getBytes(),"httpStatus".getBytes(),splited[11].getBytes());

                 context.write(NullWritable.get(), put);

             }

         }

     }

     public static void createTable(String tableName) throws MasterNotRunningException, ZooKeeperConnectionException, IOException {

         Configuration  conf = HBaseConfiguration.create();

         conf.set("hbase.zookeeper.quorum", "beifeng01");

         HBaseAdmin admin = new HBaseAdmin(conf);

         TableName tName = TableName.valueOf(tableName);

         HTableDescriptor htd = new HTableDescriptor(tName);

         HColumnDescriptor hcd = new HColumnDescriptor("info");

         htd.addFamily(hcd);

         if(admin.tableExists(tName)) {

             System.out.println(tableName+"is exist,trying to recrate the table");

             admin.disableTable(tName);

             admin.deleteTable(tName);

         }

         admin.createTable(htd);

         System.out.println("create new table"+ " " + tableName);

     }

     public int run(String[] args) throws Exception {

     Configuration conf = this.getConf();

         conf.set("hbase.zookeeper.quorum", "beifeng01");

         conf.set(TableOutputFormat.OUTPUT_TABLE, "phoneLog"); 

         createTable("phoneLog"); 

         Job job = Job.getInstance(conf, this.getClass().getSimpleName());

         job.setJarByClass(this.getClass());

         job.setNumReduceTasks(1); 

         // map class

         job.setMapperClass(LoadDataMapper.class);

         job.setMapOutputKeyClass(LongWritable.class);

         job.setMapOutputValueClass(Text.class);  

         // reduce class

         job.setReducerClass(LoadDataReuducer.class);

         job.setOutputFormatClass(TableOutputFormat.class); 

         Path inPath = new Path(args[0]);

         FileInputFormat.addInputPath(job, inPath);

        boolean isSucced = job.waitForCompletion(true);

        return isSucced ? 0 : 1;

     }

     public static void main(String[] args) throws Exception {

         Configuration conf = HBaseConfiguration.create(); 

         //指定HDFS数据地址

         args = new String[] {"hdfs://hbase/data/input/HTTP_20130313143750.data"};

            int status = ToolRunner.run(

                    conf,

                    new LoadData(),

                    args);

     System.exit(status);

     }

 }

运行完程序后scan 查看效果

hbase(main):004:0> scan 'phoneLog', {LIMIT => 2}

ROW                                COLUMN+CELL

 01787706731|2013031314048         column=info:acmac, timestamp=1544022103345, value=120.196.100.82

 01787706731|2013031314048         column=info:apmac, timestamp=1544022103345, value=00-FD-07-A4-7B-08:CMCC

 01787706731|2013031314048         column=info:downPackNum, timestamp=1544022103345, value=2

 01787706731|2013031314048         column=info:downPayLoad, timestamp=1544022103345, value=120

 01787706731|2013031314048         column=info:host, timestamp=1544022103345, value=

 01787706731|2013031314048         column=info:httpStatus, timestamp=1544022103345, value=200

 01787706731|2013031314048         column=info:reportTime, timestamp=1544022103345, value=1363157988072

 01787706731|2013031314048         column=info:siteType, timestamp=1544022103345, value=

 01787706731|2013031314048         column=info:unPayLoad, timestamp=1544022103345, value=120

 01787706731|2013031314048         column=info:upPackNum, timestamp=1544022103345, value=2

 10007032831|2013031314045         column=info:acmac, timestamp=1544022103345, value=120.196.100.99

 10007032831|2013031314045         column=info:apmac, timestamp=1544022103345, value=20-7C-8F-70-68-1F:CMCC

 10007032831|2013031314045         column=info:downPackNum, timestamp=1544022103345, value=3

 10007032831|2013031314045         column=info:downPayLoad, timestamp=1544022103345, value=180

 10007032831|2013031314045         column=info:host, timestamp=1544022103345, value=

 10007032831|2013031314045         column=info:httpStatus, timestamp=1544022103345, value=200

 10007032831|2013031314045         column=info:reportTime, timestamp=1544022103345, value=1363157985079

 10007032831|2013031314045         column=info:siteType, timestamp=1544022103345, value=

 10007032831|2013031314045         column=info:unPayLoad, timestamp=1544022103345, value=360

 10007032831|2013031314045         column=info:upPackNum, timestamp=1544022103345, value=6

Hbase 表的Rowkey设计避免数据热点的更多相关文章

Hbase表类型的设计
HBase表类型的设计 1.短宽这种设计一般适用于: * 有大量的列 * 有很少的行 2.高瘦这种设计一般适用于: * 有很少的列 * 有大量的行 3.短宽-高瘦的对比短宽 * 使用列名进行查询 ...
HBase（九）HBase表以及Rowkey的设计
一命名空间 1 命名空间的结构 1) Table:表,所有的表都是命名空间的成员,即表必属于某个命名空间,如果没有指定, 则在 default 默认的命名空间中. 2) RegionServer g ...
hbase实践之rowkey设计
rowkey设计的重要性 rowkeys是HBase表设计中唯一重要的一点. rowkey设计要求唯一性存储特性按照字典顺序排序存储查询特性由于其存储特性导致查询特性: 查询单个记录: 查定 ...
hbase表的高性能设计
第7章 HBase优化 7.1 高可用在HBase中Hmaster负责监控RegionServer的生命周期,均衡RegionServer的负载,如果Hmaster挂掉了,那么整个HBase集群将陷 ...
hbase实践之Rowkey设计之道
笔者从一开始接触hbase就在思考rowkey设计,希望rowkey设计得好,能够支持查询的需求.使用hbase一段时间后,再去总结一些hbase的设计方法,无外乎以下几种: reverse salt ...
大数据性能调优之HBase的RowKey设计
1 概述 HBase是一个分布式的.面向列的数据库,它和一般关系型数据库的最大区别是:HBase很适合于存储非结构化的数据,还有就是它基于列的而不是基于行的模式. 既然HBase是采用KeyValue ...
HBase Rowkey 设计指南
为什么Rowkey这么重要 RowKey 到底是什么我们常说看一张 HBase 表设计的好不好,就看它的 RowKey 设计的好不好.可见 RowKey 在 HBase 中的地位.那么 RowKey ...
HBase之六：HBase的RowKey设计
数据模型我们可以将一个表想象成一个大的映射关系,通过行健.行健+时间戳或行键+列(列族:列修饰符),就可以定位特定数据,Hbase是稀疏存储数据的,因此某些列可以是空白的, Row Key Time ...
HBase(三): Azure HDInsigt HBase表数据导入本地HBase
目录: hdfs 命令操作本地 hbase Azure HDInsight HBase表数据导入本地 hbase hdfs命令操作本地hbase: 参见 HDP2.4安装(五):集群及组件安装 , ...

随机推荐

并发包同步工具CyclicBarrier
/** * * @描述: 同步工具 * 表示大家彼此等待,大家集合好后才开始出发,分散活动后又在指点地点集合碰合 . * @作者: Wnj . * @创建时间: 2017年5月16日 . * @版本: ...
SQL Server 2014 聚集列存储
SQL Server 自2012以来引入了列存储的概念,至今2016对列存储的支持已经是非常友好了.由于我这边线上环境主要是2014,所以本文是以2014为基础的SQL Server 的列存储的介绍. ...
解决SQL server2005数据库死锁的经验心得
前段时间提到的"sql server 2005 死锁解决探索",死锁严重,平均每天会发生一次死锁,在解决和处理SQL server2005死锁中查了很多资料和想了很多办法,后来我们 ...
asp.net c# 断点续传下载 Accept-Ranges
转自:http://www.cnblogs.com/90nice/p/3489287.html 1.因为要下载大文件需要断点续传,使用多线程分段下载效率比较高,节省资源. 发点牢骚:下载可以用多 ...
ue-edit设置显示函数列表
UltraEdit的函数列表竟然不显示函数,那这功能要它何用,应该如何才能让函数显示出来呢? 公司编程基本上都在UltraEdit中进行,俺刚来公司还不熟悉,今天装了个UltraEdit,可是看着别人 ...
Promise里捕捉错误的最佳实践
Promise里的同步部分不需要try catch new Promise((resolve, reject) => { throw new Error('error'); setTimeout ...
AngularJs学习笔记--Dependency Injection（DI，依赖注入）
原版地址:http://code.angularjs.org/1.0.2/docs/guide/di 一.Dependency Injection(依赖注入) 依赖注入(DI)是一个软件设计模式,处理 ...
Spring Framework5.0 学习（3）—— spring配置文件的三种形式
Spring Framework 是 IOC (Inversion of Control 控制反转)原则的实践. IoC is also known as dependency injection ...
JS解析json数据（如何将json字符串转化为数组）
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> <HTML> <HEAD ...
MyBatis框架（4）全局文件
本次全部学习内容:MyBatisLearning 全局配置文件(本次案例中):

Hbase 表的Rowkey设计避免数据热点

Hbase 表的Rowkey设计避免数据热点的更多相关文章

随机推荐

热门专题