







但是,因为 HyperLogLog 只会根据输入元素来计算基数,而不会储存输入元素本身,所以 HyperLogLog 不能像集合那样,返回输入的各个元素即无法知道统计的详细内容。




比如数据集 {1, 3, 5, 7, 5, 7, 8}, 那么这个数据集的基数集为 {1, 3, 5 ,7, 8}, 基数(不重复元素)为5。





redis HyperLogLog 的基本命令:

1 PFADD key element [element ...]

添加指定元素到 HyperLogLog 中。

2 PFCOUNT key [key ...]

返回给定 HyperLogLog 的基数估算值。

3 PFMERGE destkey sourcekey [sourcekey ...]

将多个 HyperLogLog 合并为一个 HyperLogLog


将任意数量的元素添加到指定的 HyperLogLog 里面。在执行这个命令之后,HyperLogLog内部的结构会被更新,并有所反馈,







当 PFCOUNT 命令作用于多个键时, 返回所有给定 HyperLogLog 的并集的近似基数, 这个近似基数是通过将所有给定 HyperLogLog 合并至一个临时 HyperLogLog 来计算得出的。


命令返回的可见集合(observed set)基数并不是精确值, 而是一个带有 0.81% 标准错误(standard error)的近似值。

举个例子, 为了记录一天会执行多少次各不相同的搜索查询, 一个程序可以在每次执行搜索查询时调用一次 PFADD , 并通过调用 PFCOUNT 命令来获取这个记录的近似结果。


合并(merge)多个HyperLogLog为一个HyperLogLog。 合并后的 HyperLogLog 的基数接近于所有输入 HyperLogLog 的可见集合(observed set)的并集。

合并得出的 HyperLogLog 会被储存在 destkey 键里面, 如果该键并不存在, 那么命令在执行之前, 会先为该键创建一个空的 HyperLogLog 。



redis> PFADD  ip:20170626  ""  ""  ""

(integer) 1

redis> PFADD  ip:20170626 ""  ""  ""  # 存在就只加新的

(integer) 1

redis> PFCOUNT ip:20170626  # 元素估计数量没有变化

(integer) 5

redis> PFADD  ip:20170626 ""  # 存在就不会增加

(integer) 0

edis> PFMERGE ip:20170626   ip:20170627   ip:20170628


redis> PFCOUNT  ip:201706

(integer) 5

四、hperloglog 描述



/* The Redis HyperLogLog implementation is based on the following ideas:


* * The use of a 64 bit hash function as proposed in [1], in order to don't

*   limited to cardinalities up to 10^9, at the cost of just 1 additional

*   bit per register.

* * The use of 16384 6-bit registers for a great level of accuracy, using

*   a total of 12k per key.

* * The use of the Redis string data type. No new type is introduced.

* * No attempt is made to compress the data structure as in [1]. Also the

*   algorithm used is the original HyperLogLog Algorithm as in [2], with

*   the only difference that a 64 bit hash function is used, so no correction

*   is performed for values near 2^32 as in [1].


* [1] Heule, Nunkesser, Hall: HyperLogLog in Practice: Algorithmic

*     Engineering of a State of The Art Cardinality Estimation Algorithm.


* [2] P. Flajolet, éric Fusy, O. Gandouet, and F. Meunier. Hyperloglog: The

*     analysis of a near-optimal cardinality estimation algorithm.


* Redis uses two representations:


* 1) A "dense" representation where every entry is represented by

*    a 6-bit integer.

* 2) A "sparse" representation using run length compression suitable

*    for representing HyperLogLogs with many registers set to 0 in

*    a memory efficient way.



* HLL header

* ===


* Both the dense and sparse representation have a 16 byte header as follows:


* +------+---+-----+----------+

* | HYLL | E | N/U | Cardin.  |

* +------+---+-----+----------+


* The first 4 bytes are a magic string set to the bytes "HYLL".

* "E" is one byte encoding, currently set to HLL_DENSE or

* HLL_SPARSE. N/U are three not used bytes.


* The "Cardin." field is a 64 bit integer stored in little endian format

* with the latest cardinality computed that can be reused if the data

* structure was not modified since the last computation (this is useful

* because there are high probabilities that HLLADD operations don't

* modify the actual data structure and hence the approximated cardinality).


* When the most significant bit in the most significant byte of the cached

* cardinality is set, it means that the data structure was modified and

* we can't reuse the cached value that must be recomputed.


* Dense representation

* ===


* The dense representation used by Redis is the following:


* +--------+--------+--------+------//      //--+

* |11000000|22221111|33333322|55444444 ....     |

* +--------+--------+--------+------//      //--+


* The 6 bits counters are encoded one after the other starting from the

* LSB to the MSB, and using the next bytes as needed.


* Sparse representation

* ===


* The sparse representation encodes registers using a run length

* encoding composed of three opcodes, two using one byte, and one using

* of two bytes. The opcodes are called ZERO, XZERO and VAL.


* ZERO opcode is represented as 00xxxxxx. The 6-bit integer represented

* by the six bits 'xxxxxx', plus 1, means that there are N registers set

* to 0. This opcode can represent from 1 to 64 contiguous registers set

* to the value of 0.


* XZERO opcode is represented by two bytes 01xxxxxx yyyyyyyy. The 14-bit

* integer represented by the bits 'xxxxxx' as most significant bits and

* 'yyyyyyyy' as least significant bits, plus 1, means that there are N

* registers set to 0. This opcode can represent from 0 to 16384 contiguous

* registers set to the value of 0.


* VAL opcode is represented as 1vvvvvxx. It contains a 5-bit integer

* representing the value of a register, and a 2-bit integer representing

* the number of contiguous registers set to that value 'vvvvv'.

* To obtain the value and run length, the integers vvvvv and xx must be

* incremented by one. This opcode can represent values from 1 to 32,

* repeated from 1 to 4 times.


* The sparse representation can't represent registers with a value greater

* than 32, however it is very unlikely that we find such a register in an

* HLL with a cardinality where the sparse representation is still more

* memory efficient than the dense representation. When this happens the

* HLL is converted to the dense representation.


* The sparse representation is purely positional. For example a sparse

* representation of an empty HLL is just: XZERO:16384.


* An HLL having only 3 non-zero registers at position 1000, 1020, 1021

* respectively set to 2, 3, 3, is represented by the following three

* opcodes:


* XZERO:1000 (Registers 0-999 are set to 0)

* VAL:2,1    (1 register set to value 2, that is register 1000)

* ZERO:19    (Registers 1001-1019 set to 0)

* VAL:3,2    (2 registers set to value 3, that is registers 1020,1021)

* XZERO:15362 (Registers 1022-16383 set to 0)


* In the example the sparse representation used just 7 bytes instead

* of 12k in order to represent the HLL registers. In general for low

* cardinality there is a big win in terms of space efficiency, traded

* with CPU time since the sparse representation is slower to access:


* The following table shows average cardinality vs bytes used, 100

* samples per cardinality (when the set was not representable because

* of registers with too big value, the dense representation size was used

* as a sample).


* 100 267

* 200 485

* 300 678

* 400 859

* 500 1033

* 600 1205

* 700 1375

* 800 1544

* 900 1713

* 1000 1882

* 2000 3480

* 3000 4879

* 4000 6089

* 5000 7138

* 6000 8042

* 7000 8823

* 8000 9500

* 9000 10088

* 10000 10591


* The dense representation uses 12288 bytes, so there is a big win up to

* a cardinality of ~2000-3000. For bigger cardinalities the constant times

* involved in updating the sparse representation is not justified by the

* memory savings. The exact maximum length of the sparse representation

* when this implementation switches to the dense representation is

* configured via the define server.hll_sparse_max_bytes.



