在数据库中,常常会有Distinct Count的操作,比如,查看每一选修课程的人数: select course, count(distinct sid) from stu_table group by course; Hive 在大数据场景下,报表很重要一项是UV(Unique Visitor)统计,即某时间段内用户人数.例如,查看一周内app的用户分布情况,Hive中写HiveQL实现: select app, count(distinct uid) as uv from log_tabl…
Awesome Big Data A curated list of awesome big data frameworks, resources and other awesomeness. Inspired byawesome-php, awesome-python, awesome-ruby, hadoopecosystemtable & big-data. Your contributions are always welcome! Awesome Big Data Frameworks…
https://github.com/onurakpolat/awesome-bigdata A curated list of awesome big data frameworks, resources and other awesomeness. Inspired by awesome-php, awesome-python, awesome-ruby, hadoopecosystemtable & big-data. Your contributions are always welco…
数据量一大,连统计基数也成了一个麻烦事.在使用kylin的时候,遇到对度量值进行基数统计,使用的是Hyperloglog算法,占用内存小,误差小,实乃不错的方法,但查阅网上的资料与内容,感觉未能理解的太明白.经过一番折腾,自己给整理出一个版本出来. 算法的论文是<HyperLogLog the analysis of a near-optimal cardinality estimation algorithm>,可以在谷歌学术上下载下来看看.具体论文的理论推导不详细介绍,简述下其思想核心.…