Redis数据结构之HperLogLog
一、HyperLogLog
HyperLogLog是用来做基数统计的。
其可以非常省内存的去统计各种计数,比如注册ip数、每日访问IP数、页面实时UV(PV肯定字符串就搞定了)、在线用户数等在对准确性不是很重要的应用场景。
HyperLogLog的优点是:
在输入元素的数量或者体积非常非常大时,计算基数所需的空间总是固定的、并且是很小的,
HyperLogLog的缺点:
它是估计基数的算法,所以会有一定误差0.81%。
每个HyperLogLog键只需要花费12KB内存,就可以计算接近264个不同元素的基数。这和计算基数时,元素越多耗费内存就越多的集合形成鲜明对比。
但是,因为 HyperLogLog 只会根据输入元素来计算基数,而不会储存输入元素本身,所以 HyperLogLog 不能像集合那样,返回输入的各个元素即无法知道统计的详细内容。
二、基数和估算值
1、基数
基数是集合中不同元素的数量。
比如数据集 {1, 3, 5, 7, 5, 7, 8}, 那么这个数据集的基数集为 {1, 3, 5 ,7, 8}, 基数(不重复元素)为5。
基数估计就是在误差可接受的范围内,快速计算基数。
2、估算值
算法给出的基数并不是精确的,可能会比实际稍微多一些或者稍微少一些,但会控制在合理的范围之内。
三、HperLogLog基本命令
redis HyperLogLog 的基本命令:
1 PFADD key element [element ...]
添加指定元素到 HyperLogLog 中。
2 PFCOUNT key [key ...]
返回给定 HyperLogLog 的基数估算值。
3 PFMERGE destkey sourcekey [sourcekey ...]
将多个 HyperLogLog 合并为一个 HyperLogLog
PFADD
将任意数量的元素添加到指定的 HyperLogLog 里面。在执行这个命令之后,HyperLogLog内部的结构会被更新,并有所反馈,
如果执行完之后HyperLogLog内部的基数估算发生了变化,那么就会返回1,否则(认为已经存在)就返回0。
这个命令还有一个比较神器的就是可以只有键,没有值,这样的意思就是只是创建空的键,不放值。
如果这个键存在,不做任何事情,返回0;不存在的话就创建,并返回1。
这个命令的时间复杂度为O(1),所以就放心用吧~
PFCOUNT
当命令作用于单个键的时候,返回这个键的基数估算值。如果键不存在,则返回0。
当 PFCOUNT 命令作用于多个键时, 返回所有给定 HyperLogLog 的并集的近似基数, 这个近似基数是通过将所有给定 HyperLogLog 合并至一个临时 HyperLogLog 来计算得出的。
这个命令在作用于单个值的时候,时间复杂度为O(1),并且具有非常低的平均常数时间;在作用于N个值的时候,时间复杂度为O(N),这个命令的常数复杂度会比较低些。
命令返回的可见集合(observed set)基数并不是精确值, 而是一个带有 0.81% 标准错误(standard error)的近似值。
举个例子, 为了记录一天会执行多少次各不相同的搜索查询, 一个程序可以在每次执行搜索查询时调用一次 PFADD , 并通过调用 PFCOUNT 命令来获取这个记录的近似结果。
PFMERGE
合并(merge)多个HyperLogLog为一个HyperLogLog。 合并后的 HyperLogLog 的基数接近于所有输入 HyperLogLog 的可见集合(observed set)的并集。
合并得出的 HyperLogLog 会被储存在 destkey 键里面, 如果该键并不存在, 那么命令在执行之前, 会先为该键创建一个空的 HyperLogLog 。
这个命令的第一个参数为目标键,剩下的参数为要合并的HyperLogLog。命令执行时,如果目标键不存在,则创建后再执行合并。
这个命令的时间复杂度为O(N),其中N为要合并的HyperLogLog的个数。不过这个命令的常数时间复杂度比较高。
redis> PFADD ip:20170626 "192.168.0.10" "192.168.0.20" "192.168.0.30"
(integer) 1
redis> PFADD ip:20170626 "192.168.0.20" "192.168.0.40" "192.168.0.50" # 存在就只加新的
(integer) 1
redis> PFCOUNT ip:20170626 # 元素估计数量没有变化
(integer) 5
redis> PFADD ip:20170626 "192.168.0.20" # 存在就不会增加
(integer) 0
edis> PFMERGE ip:20170626 ip:20170627 ip:20170628
OK
redis> PFCOUNT ip:201706
(integer) 5
四、hperloglog 描述
由于hperloglog,这种数据结构在实际应用场景中并不多。因此,这里就不再详细讨论了。
我们看下hperloglog.c文件,对HperLogLog的描述
/* The Redis HyperLogLog implementation is based on the following ideas:
*
* * The use of a 64 bit hash function as proposed in [1], in order to don't
* limited to cardinalities up to 10^9, at the cost of just 1 additional
* bit per register.
* * The use of 16384 6-bit registers for a great level of accuracy, using
* a total of 12k per key.
* * The use of the Redis string data type. No new type is introduced.
* * No attempt is made to compress the data structure as in [1]. Also the
* algorithm used is the original HyperLogLog Algorithm as in [2], with
* the only difference that a 64 bit hash function is used, so no correction
* is performed for values near 2^32 as in [1].
*
* [1] Heule, Nunkesser, Hall: HyperLogLog in Practice: Algorithmic
* Engineering of a State of The Art Cardinality Estimation Algorithm.
*
* [2] P. Flajolet, éric Fusy, O. Gandouet, and F. Meunier. Hyperloglog: The
* analysis of a near-optimal cardinality estimation algorithm.
*
* Redis uses two representations:
*
* 1) A "dense" representation where every entry is represented by
* a 6-bit integer.
* 2) A "sparse" representation using run length compression suitable
* for representing HyperLogLogs with many registers set to 0 in
* a memory efficient way.
*
*
* HLL header
* ===
*
* Both the dense and sparse representation have a 16 byte header as follows:
*
* +------+---+-----+----------+
* | HYLL | E | N/U | Cardin. |
* +------+---+-----+----------+
*
* The first 4 bytes are a magic string set to the bytes "HYLL".
* "E" is one byte encoding, currently set to HLL_DENSE or
* HLL_SPARSE. N/U are three not used bytes.
*
* The "Cardin." field is a 64 bit integer stored in little endian format
* with the latest cardinality computed that can be reused if the data
* structure was not modified since the last computation (this is useful
* because there are high probabilities that HLLADD operations don't
* modify the actual data structure and hence the approximated cardinality).
*
* When the most significant bit in the most significant byte of the cached
* cardinality is set, it means that the data structure was modified and
* we can't reuse the cached value that must be recomputed.
*
* Dense representation
* ===
*
* The dense representation used by Redis is the following:
*
* +--------+--------+--------+------// //--+
* |11000000|22221111|33333322|55444444 .... |
* +--------+--------+--------+------// //--+
*
* The 6 bits counters are encoded one after the other starting from the
* LSB to the MSB, and using the next bytes as needed.
*
* Sparse representation
* ===
*
* The sparse representation encodes registers using a run length
* encoding composed of three opcodes, two using one byte, and one using
* of two bytes. The opcodes are called ZERO, XZERO and VAL.
*
* ZERO opcode is represented as 00xxxxxx. The 6-bit integer represented
* by the six bits 'xxxxxx', plus 1, means that there are N registers set
* to 0. This opcode can represent from 1 to 64 contiguous registers set
* to the value of 0.
*
* XZERO opcode is represented by two bytes 01xxxxxx yyyyyyyy. The 14-bit
* integer represented by the bits 'xxxxxx' as most significant bits and
* 'yyyyyyyy' as least significant bits, plus 1, means that there are N
* registers set to 0. This opcode can represent from 0 to 16384 contiguous
* registers set to the value of 0.
*
* VAL opcode is represented as 1vvvvvxx. It contains a 5-bit integer
* representing the value of a register, and a 2-bit integer representing
* the number of contiguous registers set to that value 'vvvvv'.
* To obtain the value and run length, the integers vvvvv and xx must be
* incremented by one. This opcode can represent values from 1 to 32,
* repeated from 1 to 4 times.
*
* The sparse representation can't represent registers with a value greater
* than 32, however it is very unlikely that we find such a register in an
* HLL with a cardinality where the sparse representation is still more
* memory efficient than the dense representation. When this happens the
* HLL is converted to the dense representation.
*
* The sparse representation is purely positional. For example a sparse
* representation of an empty HLL is just: XZERO:16384.
*
* An HLL having only 3 non-zero registers at position 1000, 1020, 1021
* respectively set to 2, 3, 3, is represented by the following three
* opcodes:
*
* XZERO:1000 (Registers 0-999 are set to 0)
* VAL:2,1 (1 register set to value 2, that is register 1000)
* ZERO:19 (Registers 1001-1019 set to 0)
* VAL:3,2 (2 registers set to value 3, that is registers 1020,1021)
* XZERO:15362 (Registers 1022-16383 set to 0)
*
* In the example the sparse representation used just 7 bytes instead
* of 12k in order to represent the HLL registers. In general for low
* cardinality there is a big win in terms of space efficiency, traded
* with CPU time since the sparse representation is slower to access:
*
* The following table shows average cardinality vs bytes used, 100
* samples per cardinality (when the set was not representable because
* of registers with too big value, the dense representation size was used
* as a sample).
*
* 100 267
* 200 485
* 300 678
* 400 859
* 500 1033
* 600 1205
* 700 1375
* 800 1544
* 900 1713
* 1000 1882
* 2000 3480
* 3000 4879
* 4000 6089
* 5000 7138
* 6000 8042
* 7000 8823
* 8000 9500
* 9000 10088
* 10000 10591
*
* The dense representation uses 12288 bytes, so there is a big win up to
* a cardinality of ~2000-3000. For bigger cardinalities the constant times
* involved in updating the sparse representation is not justified by the
* memory savings. The exact maximum length of the sparse representation
* when this implementation switches to the dense representation is
* configured via the define server.hll_sparse_max_bytes.
*/
Redis数据结构之HperLogLog的更多相关文章
- Redis 数据结构使用场景
转自http://get.ftqq.com/523.get 一.redis 数据结构使用场景 原来看过 redisbook 这本书,对 redis 的基本功能都已经熟悉了,从上周开始看 redis 的 ...
- Redis数据结构
Redis数据结构 Redis数据结构详解(一) 前言 Redis和Memcached最大的区别,Redis 除啦支持数据持久化之外,还支持更多的数据类型而不仅仅是简单key-value结构的数据 ...
- Redis数据结构底层知识总结
Redis数据结构底层总结 本篇文章是基于作者黄建宏写的书Redis设计与实现而做的笔记 数据结构与对象 Redis中数据结构的底层实现包括以下对象: 对象 解释 简单动态字符串 字符串的底层实现 链 ...
- Redis 数据结构与内存管理策略(上)
Redis 数据结构与内存管理策略(上) 标签: Redis Redis数据结构 Redis内存管理策略 Redis数据类型 Redis类型映射 Redis 数据类型特点与使用场景 String.Li ...
- Redis 数据结构与内存管理策略(下)
Redis 数据结构与内存管理策略(下) 标签: Redis Redis数据结构 Redis内存管理策略 Redis数据类型 Redis类型映射 Redis 数据类型特点与使用场景 String.Li ...
- Redis数据结构之intset
本文及后续文章,Redis版本均是v3.2.8 上篇文章<Redis数据结构之robj>,我们说到redis object数据结构,其有5中数据类型:OBJ_STRING,OBJ_LIST ...
- Redis数据结构之robj
本文及后续文章,Redis版本均是v3.2.8 我们知道一个database内的这个映射关系是用一个dict来维护的.dict的key固定用一种数据结构来表达,这这数据结构就是动态字符串sds.而va ...
- Redis 数据结构之dict(2)
本文及后续文章,Redis版本均是v3.2.8 上篇文章<Redis 数据结构之dict>,我们对dict的结构有了大致的印象.此篇文章对dict是如何维护数据结构的做个详细的理解. 老规 ...
- Redis 数据结构之dict
上篇文章<Redis数据结构概述>中,了解了常用数据结构.我们知道Redis以高效的方式实现了多种数据结构,因此把Redis看做为数据结构服务器也未尝不可.研究Redis的数据结构和正确. ...
随机推荐
- 【一本通1329:【例8.2】细胞&&洛谷P1451 求细胞数量】
1329:[例8.2]细胞 [题目描述] 一矩形阵列由数字0到9组成,数字1到9代表细胞,细胞的定义为沿细胞数字上下左右还是细胞数字则为同一细胞,求给定矩形阵列的细胞个数.如: 阵列 4 10 023 ...
- 基于 Markdown 编写接口文档
最近公司开发项目需要前后端分离,这样话就设计到后端接口设计.复杂功能需要提供各种各样的接口供前端调用,因此编写API文档非常有必要了 网上查了很多资料,发现基于Markdown编写文档是一种比较流行而 ...
- python3 动态import
有些情况下,需要动态的替换引入的包 1.常用的import方法 import platform import os 2.__import__ 动态引用 loop_manager = __import_ ...
- 为Nexus配置阿里云代理仓库【转】
Nexus默认远程仓库为https://repo1.maven.org/maven2/ 慢死,还常连不上. 可以添加阿里云代理仓库 URL:http://maven.aliyun.com/nexus/ ...
- JGUI源码:右键菜单实现(12)
1.要想实现右键菜单,就要先能响应右键函数 $('#down').mousedown(function(e){ if(3 == e.which){ alert('这是右键单击事件'); }else i ...
- mui框架中dialog框的实现
<script type="text/javascript" charset="utf-8"> //mui初始化 mui.init({ swipeB ...
- TERADATA SQL学习随笔<一>
此博客内容简介及目录 http://www.cnblogs.com/weibaar/p/6644261.html 最近在TERADATA环境学习SQL.在这里记录一下学习中查过的知识点,作为备案. 目 ...
- 移动端1px问题处理方法
在做移动端开发时,设计师提供的视觉稿一般是750px,当你定义 border-width:1px 时,在iphone6手机上却发现:边框变粗了.. 这是因为,1px是相对于750px的(物理像素),而 ...
- 上传代码到github
上传代码前需配置连接秘钥和设置本地git账号密码. 1.检查上传文件目录状态 git status 2.将更改文件添加到缓存区 git add . 3.添加本次代码更改说明 git commit -m ...
- 用SQL表达连接与外连接
关系代数运算中,有连接运算,又分为θ连接和外连接 标准SQL语言中连接运算通常是采用 SELECT 列名[[,列名]...] FROM 表名1,表名2,... WHERE 检索条件; SQL的高级语法 ...