二 HTable 源码导读

户端调优的方法里面无非就这么几种:
1)关闭autoFlush
2)关闭WAL日志
3)把writeBufferSize设大一点，一般说是设置成5MB

经过实践，就第二条关闭日志的效果比较明显，其它的效果都不明显，因为提交的过程是异步的，所以提交的时候占用的时间并不多，提交到server端后，server还有一个写入的队列，(⊙o⊙)… 让人想起小米手机那恶心的排队了。。。所以大规模写入数据，别指望着用put来解决。。。mapreduce生成hfile，然后用bulk load的方式比较好。
　　不废话了，我们继续追踪ap.submit方法吧，F3进去。

提交

HTable

PUT

（通过表名 rowkey 找到）HRegionLocation，region。提交操作需要找到对应的 region

　　（1）把put操作添加到writeAsyncBuffer队列里面，符合条件（自动flush或者超过了阀值writeBufferSize）就通过AsyncProcess异步批量提交。

　　（2）在提交之前，我们要根据每个rowkey找到它们归属的region server，这个定位的过程是通过HConnection的locateRegion方法获得的，然后再把这些rowkey按照HRegionLocation分组。

　　（3）通过多线程，一个HRegionLocation构造MultiServerCallable<Row>，然后通过rpcCallerFactory.<MultiResponse> newCaller()执行调用，忽略掉失败重新提交和错误处理，客户端的提交操作到此结束。

DELETE

对于Delete，我们也可以通过以下代码执行一个delete操作

Delete del = new Delete(rowkey);

table.delete(del);

　　这个操作比较干脆，new一个RegionServerCallable<Boolean>,直接走rpc了，爽快啊。

RegionServerCallable<Boolean> callable = new RegionServerCallable<Boolean>(connection,

        tableName, delete.getRow()) {

      public Boolean call() throws IOException {

        try {

          MutateRequest request = RequestConverter.buildMutateRequest(

            getLocation().getRegionInfo().getRegionName(), delete);

          MutateResponse response = getStub().mutate(null, request);

          return Boolean.valueOf(response.getProcessed());

        } catch (ServiceException se) {

          throw ProtobufUtil.getRemoteException(se);

        }

      }

    };

rpcCallerFactory.<Boolean> newCaller().callWithRetries(callable, this.operationTimeout);

　　这里面注意一下这行MutateResponse response = getStub().mutate(null, request);

　　getStub()返回的是一个ClientService.BlockingInterface接口，实现这个接口的类是HRegionServer，这样子我们就知道它在服务端执行了HRegionServer里面的mutate方法。

3.Get操作

　　get操作也和delete一样简单

Get get = new Get(rowkey);

Result row = table.get(get);

　　get操作也没几行代码，还是直接走的rpc

public Result get(final Get get) throws IOException {

    RegionServerCallable<Result> callable = new RegionServerCallable<Result>(this.connection,

        getName(), get.getRow()) {

      public Result call() throws IOException {

        return ProtobufUtil.get(getStub(), getLocation().getRegionInfo().getRegionName(), get);

      }

    };

    return rpcCallerFactory.<Result> newCaller().callWithRetries(callable, this.operationTimeout);

}

　　注意里面的ProtobufUtil.get操作，它其实是构建了一个GetRequest，需要的参数是regionName和get，然后走HRegionServer的get方法，返回一个GetResponse

public static Result get(final ClientService.BlockingInterface client,

      final byte[] regionName, final Get get) throws IOException {

    GetRequest request =

      RequestConverter.buildGetRequest(regionName, get);

    try {

      GetResponse response = client.get(null, request);

      if (response == null) return null;

      return toResult(response.getResult());

    } catch (ServiceException se) {

      throw getRemoteException(se);

    }

}

4.批量操作

　　针对put、delete、get都有相应的操作的方式：

　　1.Put(list)操作，很多童鞋以为这个可以提高写入速度，其实无效。。。为啥？因为你构造了一个list进去，它再遍历一下list，执行doPut操作。。。。反而还慢点。

　　2.delete和get的批量操作走的都是connection.processBatchCallback(actions, tableName, pool, results, callback)，具体的实现在HConnectionManager的静态类HConnectionImplementation里面，结果我们惊人的发现：

AsyncProcess<?> asyncProcess = createAsyncProcess(tableName, pool, cb, conf);asyncProcess.submitAll(list);

asyncProcess.waitUntilDone();

　　它走的还是put一样的操作，既然是一样的，何苦代码写得那么绕呢？

//如果设置为READ_COMMITTED，它会取当前的时间作为读的检查点，在这个时间点之后的就排除掉了        scan.setIsolationLevel(IsolationLevel.READ_COMMITTED);

5.查询操作

　　现在讲一下scan吧，这个操作相对复杂点。还是老规矩，先上一下代码吧。

        Scan scan = new Scan();

        //scan.setTimeRange(new Date("20140101").getTime(), new Date("20140429").getTime());

        scan.setBatch(10);

        scan.setCaching(10);

        scan.setStartRow(Bytes.toBytes("cenyuhai-00000-20140101"));

        scan.setStopRow(Bytes.toBytes("cenyuhai-zzzzz-201400429"));

        //如果设置为READ_COMMITTED，它会取当前的时间作为读的检查点，在这个时间点之后的就排除掉了        scan.setIsolationLevel(IsolationLevel.READ_COMMITTED);

        RowFilter rowFilter = new RowFilter(CompareOp.EQUAL, new RegexStringComparator("pattern"));

        ResultScanner resultScanner = table.getScanner(scan);

        Result result = null;

        while ((result = resultScanner.next()) != null) {

            //自己处理去吧...

        }

　　这个是带正则表达式的模糊查询的scan查询，Scan这个类是包括我们查询所有需要的参数，batch和caching的设置，在我的另外一篇文章里面有写《hbase客户端设置缓存优化查询》。

Scan查询的时候，设置StartRow和StopRow可是重头戏，假设我这里要查我01月01日到04月29日总共发了多少业务，中间是业务类型，但是我可能是所有的都查，或者只查一部分，在所有都查的情况下，我就不能设置了，那但是StartRow和StopRow我不能空着啊，所以这里可以填00000-zzzzz，只要保证它在这个区间就可以了，然后我们加了一个RowFilter，然后引入了正则表达式，之前好多人一直在问啊问的，不过我这个例子，其实不要也可以，因为是查所有业务的，在StartRow和StopRow之间的都可以要。

　　好的，我们接着看，F3进入getScanner方法

if (scan.isSmall()) {

      return new ClientSmallScanner(getConfiguration(), scan, getName(), this.connection);

}

return new ClientScanner(getConfiguration(), scan, getName(), this.connection);

　　这个scan还分大小, 没关系，我们进入ClientScanner看一下吧，在ClientScanner的构造方法里面发现它会去调用nextScanner去初始化一个ScannerCallable。好的，我们接着来到ScannerCallable里面，这里需要注意的是它的两个方法，prepare和call方法。在prepare里面它主要干了两个事情，获得region的HRegionLocation和ClientService.BlockingInterface接口的实例，之前说过这个继承这个接口的只有Region Server的实现类。

  public void prepare(final boolean reload) throws IOException {

    this.location = connection.getRegionLocation(tableName, row, reload); 　　 //HConnection.getClient()这个方法简直就是神器啊    setStub(getConnection().getClient(getLocation().getServerName()));

  }

　　ok，我们下面看看call方法吧

  public Result [] call() throws IOException {

    　// 第一次走的地方，开启scannerif (scannerId == -1L) {

        this.scannerId = openScanner();

      } else {

        Result [] rrs = null;

        ScanRequest request = null;

        try {

          request = RequestConverter.buildScanRequest(scannerId, caching, false, nextCallSeq);

          ScanResponse response = null; 　　　　　　
　　　　　　// 准备用controller去携带返回的数据，这样的话就不用进行protobuf的序列化了           
　　　　　　PayloadCarryingRpcController controller = new PayloadCarryingRpcController();          
　　　　　　controller.setPriority(getTableName());

          response = getStub().scan(controller, request);

          nextCallSeq++;

          long timestamp = System.currentTimeMillis();

          // Results are returned via controller

          CellScanner cellScanner = controller.cellScanner();

          rrs = ResponseConverter.getResults(cellScanner, response);

　　　   } catch (IOException e) { 　　　　　　 　　　　　　
        } 　　　　
　　　 } 　　　　return rrs;

    }

    return null;

  }

　　在call方法里面，我们可以看得出来，实例化ScanRequest，然后调用scan方法的时候把PayloadCarryingRpcController传过去，这里跟踪了一下，如果设置了codec的就从PayloadCarryingRpcController里面返回结果，否则从response里面返回。

　　好的，下面看next方法吧。

    @Override

    public Result next() throws IOException { if (cache.size() == 0) {

        Result [] values = null;

        long remainingResultSize = maxScannerResultSize;

        int countdown = this.caching; 　　　　 
　　　　 // 设置获取数据的条数         
　　　　 callable.setCaching(this.caching);

        boolean skipFirst = false;

        boolean retryAfterOutOfOrderException  = true;

        do {

  　　　　　　if (skipFirst) {

　　　　　　　　 // 上次读的最后一个，这次就不读了，直接跳过就是了

              callable.setCaching(1);

              values = this.caller.callWithRetries(callable);

              callable.setCaching(this.caching);

              skipFirst = false;

            }
　　　　　　　values = this.caller.callWithRetries(callable);

       　　 if (values != null && values.length > 0) {

            for (Result rs : values) { 　　　　　　　　 //缓存起来               cache.add(rs);

              for (Cell kv : rs.rawCells()) {//计算出keyvalue的大小，然后减去

                remainingResultSize -= KeyValueUtil.ensureKeyValue(kv).heapSize();

              }

              countdown--;

              this.lastResult = rs;

            }

           }

          // Values == null means server-side filter has determined we must STOP

        } while (remainingResultSize > 0 && countdown > 0 && nextScanner(countdown, values == null));

      　
　　　　 //缓存里面有就从缓存里面取       
　　　　 if (cache.size() > 0) {

          return cache.poll();

        }

　　　　 return null;

    }

　　从next方法里面可以看出来，它是一次取caching条数据，然后下一次获取的时候，先把上次获取的最后一个给排除掉，再获取下来保存在cache当中，只要缓存不空，就一直在缓存里面取。

　　好了，至此Scan到此结束。

来自为知笔记(Wiz)