Hbase-二级索引 Hbase+Hbase-indexer+solr （CDH）

最近一段时间工作涉及到hbase sql查询和可视化展示的工作，hbase作为列存储，数据单一为二进制数组，本身就不擅长sql查询；而且有hive来作为补充作为sql查询和存储，但是皮皮虾需要低延迟的sql及复杂sql的查询（根据值查找数据的情况），这就要用到hbase的二级索引。这里的二级索引方式采用的 Hbase+Hbase-indexer+solr ，还有Phoenix等方式。

原理：该架构HBase作为底层存储；HBase-indexer创建二级索引，会将HBase中的列隐射到solr中作为索引数据；Solr集合中直接查询数据。当数据写入HBase时，操作默认会先写入HLog中，HBase-indexer一直监控着HLog数据，将HLog中的写入数据同步到Solr中。还没去测试删除和修改数据能不能同步到Solr，测试后再来说。

优势：Solr将索引数据存储再Solr服务器中与HBase隔离，当HBase宕机后，依旧能查询数据。

缺点：每创建一张HBase表就需要去Hbase-indexer与solr中添加索引配置，比较麻烦。而且Hbase-indexer早已经不更新了，所以需要使用CDH版本的中的各类安装包。

一、资源安装

安装包如下：

https://www.cloudera.com/documentation/enterprise/release-notes/topics/cdh_vd_cdh_package_tarball_516.html

基本环境：

OS：CentOS7.x-x86_64

JDK：jdk1.8

hadoop-2.6.0+cdh5.16.2

hbase-solr-1.5+cdh5.16.2

solr-4.10.3-cdh5.16.2

zookeeper-3.4.5-cdh5.16.2

hbase-1.0.0-cdh5.16.2

CDH版本保持相同就ok

节点部署如下：

解压缩hbase-solr-1.5+cdh5.16.2的tarball，在 hbase-solr-1.5-cdh5.16.2\hbase-indexer-dist\target 下找到hbase-indexer-1.5-cdh5.16.2.tar.gz，后面会用到。

二、部署hbase-indexer

将hbase-indexer安装部署到hbase分配的HRegionServer上用于同步数据

修改hbase-indexer的参数：关联zookeeper

vim hbase-indexer-1.5-cdh5.16.2/conf/hbase-indexer-site.xml

<?xml version="1.0"?>

<configuration>

<property>

  <name>hbaseindexer.zookeeper.connectstring</name>

  <!--此处需根据zookeeper集群的实际配置修改-->

  <value>node1:2181,node2:2181,node3:2181</value>

</property>

<property>

  <name>hbase.zookeeper.quorum</name>

  <!--此处需根据zookeeper集群的实际配置修改-->

  <value>node1,node2,node3</value>

</property>

</configuration>

配置hbase-indexer-env.sh:关联java

vim hbase-indexer-1.5-cdh5.16.2/conf/hbase-indexer-env.sh

# Set environment variables here.

# This script sets variables multiple times over the course of starting an hbase-indexer process,

# so try to keep things idempotent unless you want to take an even deeper look

# into the startup scripts (bin/hbase-indexer, etc.)

# The java implementation to use.  Java 1.6 required.

export JAVA_HOME=/usr/java/jdk1.8.0/

#根据实际环境修改

三、Hbase的一些注意事项

修改hbase-site.xml，添加副本设置。

<property>

    <name>hbase.replication</name>

    <value>true</value>

    <description>SEP is basically replication, so enable it</description>

  </property>

  <property>

    <name>replication.source.ratio</name>

    <value>1.0</value>

    <description>Source ratio of 100% makes sure that each SEP consumer is actually used (otherwise, some can sit idle, especially with small clusters)</description>

  </property>

  <property>

    <name>replication.source.nb.capacity</name>

    <value>1000</value>

    <description>Maximum number of hlog entries to replicate in one go. If this is large, and a consumer takes a while to process the events, the HBase rpc call will time out.</description>

  </property>

  <property>

    <name>replication.replicationsource.implementation</name>

    <value>com.ngdata.sep.impl.SepReplicationSource</value>

    <description>A custom replication source that fixes a few things and adds some functionality (doesn't interfere with normal replication usage).</description>

  </property>

 <property>

    <name>hbase.zookeeper.quorum</name>

    <value>node1,node2,node3</value>

    <description>The directory shared by RegionServers</description>

  </property>

  <property>

    <name>hbase.zookeeper.property.dataDir</name>

    <!--注意这里配置的是zookeeper集群的数据目录，参照zookeeper的zoo.cfg-->

    <value>/home/HBasetest/zookeeperdata</value>

    <description>Property from ZooKeeper's config zoo.cfg.

      The directory where the snapshot is stored.

    </description>

  </property>

修改hbase-env.sh添加Javahome与Hbasehome

export JAVA_HOME=/opt/jdk1.8.0_79

export HBASE_HOME=/home/HBasetest/hbase-1.0.0-cdh5.16.2

将hbase-indexer/lib目录下的这4个文件赋值到hbase/lib目录下：

hbase-sep-api-1.5-cdh5.16.2.jar

hbase-sep-impl-1.5-hbase1.0-cdh5.16.2.jar

hbase-sep-impl-common-1.5-cdh5.16.2.jar

hbase-sep-tools-1.5-cdh5.16.2.jar

配置regionservers：

node2

node3

四、测试

1.运行HBase

在node1上执行：

./hbase-1.0.0-cdh5.16.2/bin/start-hbase.sh

2.运行HBase-indexer

分别在node2和node3上执行：

./hbase-indexer-1.5-cdh5.16.2/bin/hbase-indexer server

如果想以后台方式运行，可以使用screen或者nohup

3.运行Solr

分别在node1上进入solr下面的example子目录，执行：

java -Dbootstrap_confdir=./solr/collection1/conf -Dcollection.configName=myconf -DzkHost=node1:2181,node3:2181,node4:2181/solr -jar start.jar

同样，如果想以后台方式运行，可以使用screen或者nohup

使用http://node1:8983/solr/#/访问solr的主页

五、数据索引测试

将Hadoop集群、HBase、HBase-Indexer、Solr都跑起来之后，首先用HBase创建一个数据表：

在任一node上的HBase安装目录下运行：

./bin/hbase shell

create 'indexdemo-user', { NAME => 'info', REPLICATION_SCOPE => '1' }

在部署了HBase-Indexer的节点上，进入HBase-Indexer部署目录，使用HBase-Indexer的demo下的配置文件创建一个索引：

创建索引

./bin/hbase-indexer add-indexer -n myindexer -c .demo/user_indexer.xml -cp solr.zk=node1:2181,node2:2181,node3:2181/solr -cp solr.collection=collection1

查看索引

./hbase-indexer list-indexers -dump

删除索引

./hbase-indexer delete-indexer --name 'indexer_vip'

编辑hbase-indexer-1.5-cdh5.4.1/demo/下的字段定义文件：

<?xml version="1.0"?>

<indexer table="indexdemo-user">

  <field name="firstname_s" value="info:firstname"/>

  <field name="lastname_s" value="info:lastname"/>

  <field name="age_i" value="info:age" type="int"/>

</indexer>

保存为indexdemo-indexer.xml

solr中也需要添加映射：

这些字段solr中的schema.xml中已经有了，不需要重复写入。但是需要注意其中required 配置为true，则必须传入，否则报错。

vim solr-4.10/example/solr/collection1/conf/schema.xml

   <field name="firstname_s" type="string" indexed="true" stored="true" required="false" multiValued="false" />

   <field name="lastname_s" type="string" indexed="true" stored="true" required="false" multiValued="false" />

   <field name="age_i" type="string" indexed="true" stored="true" required="true" multiValued="false" />

添加indexer实例

在hbase-indexer-1.5-cdh5.4.1/demo下运行：

bin/hbase-indexer add-indexer -n myindexer -c demo/user_indexer.xml -cp solr.zk=flzxldyjdata1:2181,flzxldyjdata2:2181,flzxldyjdata3:2181,flzxldyjdata4:2181,flzxldyjdata5:2181/solr -cp solr.collection=collection1

六、javaApi

依赖包：

<dependency>

 <groupId>org.apache.solr</groupId>

 <artifactId>solr-solrj</artifactId>

 <version>4.10.3</version>

</dependency>

package com.ultrapower.hbase.solrhbase;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.hbase.HBaseConfiguration;

import org.apache.hadoop.hbase.KeyValue;

import org.apache.hadoop.hbase.client.HTable;

import org.apache.hadoop.hbase.client.Result;

import org.apache.hadoop.hbase.client.ResultScanner;

import org.apache.hadoop.hbase.client.Scan;

import org.apache.hadoop.hbase.util.Bytes;

import org.apache.solr.client.solrj.SolrServerException;

import org.apache.solr.client.solrj.impl.HttpSolrServer;

import org.apache.solr.common.SolrInputDocument;

public class SolrIndexer {

    /**

     * @param args

     * @throws IOException

     * @throws SolrServerException

     */

    public static void main(String[] args) throws IOException,

            SolrServerException {

        final Configuration conf;

        HttpSolrServer solrServer = new HttpSolrServer(

                "http://192.168.1.10:8983/solr"); // 因为服务端是用的Solr自带的jetty容器，默认端口号是8983

        conf = HBaseConfiguration.create();

        HTable table = new HTable(conf, "hb_app_xxxxxx"); // 这里指定HBase表名称

        Scan scan = new Scan();

        scan.addFamily(Bytes.toBytes("d")); // 这里指定HBase表的列族

        scan.setCaching(500);

        scan.setCacheBlocks(false);

        ResultScanner ss = table.getScanner(scan);

        System.out.println("start ...");

        int i = 0;

        try {

            for (Result r : ss) {

                SolrInputDocument solrDoc = new SolrInputDocument();

                solrDoc.addField("rowkey", new String(r.getRow()));

                for (KeyValue kv : r.raw()) {

                    String fieldName = new String(kv.getQualifier());

                    String fieldValue = new String(kv.getValue());

                    if (fieldName.equalsIgnoreCase("time")

                            || fieldName.equalsIgnoreCase("tebid")

                            || fieldName.equalsIgnoreCase("tetid")

                            || fieldName.equalsIgnoreCase("puid")

                            || fieldName.equalsIgnoreCase("mgcvid")

                            || fieldName.equalsIgnoreCase("mtcvid")

                            || fieldName.equalsIgnoreCase("smaid")

                            || fieldName.equalsIgnoreCase("mtlkid")) {

                        solrDoc.addField(fieldName, fieldValue);

                    }

                }

                solrServer.add(solrDoc);

                solrServer.commit(true, true, true);

                i = i + 1;

                System.out.println("已经成功处理 " + i + " 条数据");

            }

            ss.close();

            table.close();

            System.out.println("done !");

        } catch (IOException e) {

        } finally {

            ss.close();

            table.close();

            System.out.println("erro !");

        }

    }

}

七待优化

hbase-index同步到solr看日志很快就同步过去了。但是页面不显示，难道是到达一定数据量或者时间才显示？同步都同步了干嘛不显示- -，抽空看一下。

参考了下：https://www.jianshu.com/p/a4657a06b09f

果然，solr是通过数据量和时间进行同步跟新的，有两个条件可以设置提交触发。配置在solrConfig.xml文件中：

这里有两种提交方式，硬提交：当满足任意条件会立刻将数据同步到磁盘，开启新搜索器前会堵塞。

<autoCommit>

    <!--最大文档数量-->

    <maxDocs>1000</maxDocs>

    <!--最大间隔时间-->

    <maxTime>${solr.autoCommit.maxTime:300000}</maxTime>

    <!--提交后是否开启新搜索器-->

    <openSearcher>false</openSearcher>

</autoCommit>

还有一种软提交，它可以满足后进行实时自动提交功能：

<autoSoftCommit>

  <maxDocs>1000</maxDocs>

  <maxTime>${solr.autoSoftCommit.maxTime:30000}</maxTime>

</autoSoftCommit>

如果服务经常无缘无故崩溃，有可能是java堆栈设置过小，设置资源管理中的Lily HBase Indexer Default Group 为1G以上会更好一些

待续~

Hbase-二级索引 Hbase+Hbase-indexer+solr （CDH）的更多相关文章

CDH使用Solr实现HBase二级索引
一.为什么要使用Solr做二级索引二.实时查询方案三.部署流程3.1 安装HBase.Solr3.2 增加HBase复制功能3.3创建相应的 SolrCloud 集合3.4 创建 Lily HBa ...
HBase二级索引的设计(案例讲解)
摘要最近做的一个项目涉及到了多条件的组合查询,数据存储用的是HBase,恰恰HBase对于这种场景的查询特别不给力,一般HBase的查询都是通过RowKey(要把多条件组合查询的字段都拼接在RowK ...
hbase 二级索引创建
在单机上运行hbase 二级索引: import java.io.IOException; import java.util.HashMap; import java.util.Map; import ...
HBase二级索引方案总结
转自:http://blog.sina.com.cn/s/blog_4a1f59bf01018apd.html 附hbase如何创建二级索引以及创建二级索引实例:http://www.aboutyun ...
HBase二级索引的设计
摘要最近做的一个项目涉及到了多条件的组合查询,数据存储用的是HBase,恰恰HBase对于这种场景的查询特别不给力,一般HBase的查询都是通过RowKey(要把多条件组合查询的字段都拼接在RowK ...
HBase之八--(1)：HBase二级索引的设计(案例讲解)
摘要最近做的一个项目涉及到了多条件的组合查询,数据存储用的是HBase,恰恰HBase对于这种场景的查询特别不给力,一般HBase的查询都是通过RowKey(要把多条件组合查询的字段都拼接在RowK ...
HBase二级索引、读写流程
HBase二级索引.读写流程一.HBse二级索引方案 1.1 基于Coprocessor方案 1.2 Phoenix二级索引特点 1.3 Phoenix 二级索引方案二.HBase读写流程 2.1 ...
Lily HBase Indexer同步HBase二级索引到Solr丢失数据的问题分析
一.问题描述二.分析步骤2.1 查看日志2.2 修改Solr的硬提交2.3 寻求StackOverFlow帮助2.4 修改了read-row="never"后,丢失部分字段2.5 ...
CDH版本Hbase二级索引方案Solr key value index
概述在Hbase中,表的RowKey 按照字典排序, Region按照RowKey设置split point进行shard,通过这种方式实现的全局.分布式索引. 成为了其成功的最大的砝码. 然而单一 ...
HBase + Solr Cloud实现HBase二级索引
1. 执行流程 2. Solr Cloud实现 http://blog.csdn.net/u011462328/article/details/53008344 3. HBase实现 1) 自定义Ob ...

随机推荐

基于规则的分类——RIPPER算法
在<分类:基于规则的分类技术>中已经比较详细的介绍了基于规则的分类方法,RIPPER算法则是其中一种具体构造基于规则的分类器的方法.在RIPPER算法中,有几个点是算法的重要构成部分,需要 ...
BootStrap的栅格式布局
1.栅格系统(布局) Bootstrap内置了一套响应式.移动设备优先的流式栅格系统,随着屏幕设备或视口(viewport)尺寸的增加,系统会自动分为最多12列. 我在这里是把Bootstrap中的栅 ...
Python自动化运维一之psutil
1.1系统性能信息模块psutil 1.1.1下载安装psutil 1. wget https://pypi.python.org/packages/source/p/psutil/psutil- ...
Ethtool工具源码剖析
Ethtool工具源码剖析 ethool是一个实用的工具,用来给系统管理员以大量的控制网络接口的操作.可以用来控制接口参数,速度,介质类型,双工模式,DMA环设置,硬件校验和,LAN唤醒操作等.本人经 ...
全网最简单明了的MySQL连接Eclipse方法（JDBC详细安装方式及简单操作）2020新版
Step 1 你得有Eclipse 没有出门右拐,我教不了你. Step 2 你得有Mysql MySQL的详细安装过程,我在另一篇博客中给出.戳我 Step 3 安装JDBC 可以去官网下,如果用的 ...
数学--数论--HDU - 6322 打表找规律
In number theory, Euler's totient function φ(n) counts the positive integers up to a given integer n ...
CodeForces - 262C 贪心
Maxim always goes to the supermarket on Sundays. Today the supermarket has a special offer of discou ...
python（类多态）
一.多态 (以封装和继承为前提)不同的子类调用相同的方法,产生不同的结果 class Dog(): def __init__(self,name): self.name = name def game ...
C++ 函数重载，函数模板和函数模板重载，选择哪一个？
重载解析在C++中,对于函数重载.函数模板和函数模板重载,C++需要有一个良好的策略,去选择调用哪一个函数定义(尤其是多个参数时),这个过程称为重载解析. (这个过程将会非常复杂,但愿不要遇到一定要 ...
慎用ToLower和ToUpper，小心把你的系统给拖垮了
不知道何时开始,很多程序员喜欢用ToLower,ToUpper去实现忽略大小写模式的字符串相等性比较,有可能这个习惯是从别的语言引进的,大胆猜测下是JS,为了不引起争论,我指的JS是技师的意思~ 一: ...

Hbase-二级索引 Hbase+Hbase-indexer+solr （CDH）

Hbase-二级索引 Hbase+Hbase-indexer+solr （CDH）的更多相关文章

随机推荐

热门专题