The current information explosion has resulted in an increasing number of applications that need to deal with large volumes of data. While many of the data contains useless redundancy data, especially in mass media, web crawler/analytic fields, wasted many precious resources (power, bandwidth, CPU and storage, etc.). This has resulted in an increased interest in algorithms that process the input data in restricted ways.

But traditional hash algorithms have two problems, first it assumes that the data fits in main memory, it is unreasonable when dealing with massive data such as multimedia data, web crawler/analytic repositories and so on. And second, traditional hash can only indentify the identical data. this brings to light the importance of simhash.

 

Simhash 5 steps: Tokenize, Hash, Weigh Values, Merge, Dimensionality Reduction

  • tokenize

    • tokenize your data, assign weights to each token, weights and tokenize function are depend on your business

  • hash (md5, SHA1)

    • calculate token's hash value and convert it to binary (101011 )

  • weigh values

    • for each hash value, do hash*w, in this way: (101011 ) -> (w,-w,w,-w,w,w)

  • merge

    • add up tokens' values, to merge to 1 hash, for example, merge (4 -4 -4 4 -4 4) and (5 -5 5 -5 5 5) , results to (4+5 -4+-5 -4+5 4+-5 -4+5 4+5),which is (9 -9 1 -1 1)

  • Dimensionality Reduction

    • Finally, signs of elements of V corresponds to the bits of the final fingerprint, for example (9 -9 1 -1 1) -> (1 0 1 0 1), we get 10101 as the fingerprint.

How to use SimHash fingerprints?

Hamming distance can be used to find the similarity between two given data, calculate the Hamming distance between 2 fingerprints.

Based on my experience, for 64 bit SimHash values, with elaborate weight values,  distance of similar data often differ appreciably in magnitude from those unsimilar data.

how to calculate Hamming distance:

  XOR, 只有两个位不同时结果是1 ,否则为0,两个二进制value“异或”后得到1的个数 为海明距离 。

SimHash algorithm, introduced by Charikarand is patented by Google.

simhash 0.1.0 : Python Package Index

[SimHash] the Hash-based Similarity Detection Algorithm的更多相关文章

  1. A Node Influence Based Label Propagation Algorithm for Community detection in networks 文章算法实现的疑问

    这是我最近看到的一篇论文,思路还是很清晰的,就是改进的LPA算法.改进的地方在两个方面: (1)结合K-shell算法计算量了节点重重要度NI(node importance),标签更新顺序则按照NI ...

  2. VIPS: a VIsion based Page Segmentation Algorithm

    VIPS: a VIsion based Page Segmentation Algorithm VIPS: a VIsion based Page Segmentation Algorithm In ...

  3. MBMD(MobileNet-based tracking by detection algorithm)作者答疑

    If you fail to install and run this tracker, please email me (zhangyunhua@mail.dlut.edu.cn) Introduc ...

  4. anomaly detection algorithm

    anomaly detection algorithm 以上就是异常监测算法流程

  5. Floyd判圈算法 Floyd Cycle Detection Algorithm

    2018-01-13 20:55:56 Floyd判圈算法(Floyd Cycle Detection Algorithm),又称龟兔赛跑算法(Tortoise and Hare Algorithm) ...

  6. Floyd's Cycle Detection Algorithm

    Floyd's Cycle Detection Algorithm http://www.siafoo.net/algorithm/10 改进版: http://www.siafoo.net/algo ...

  7. 从时序异常检测(Time series anomaly detection algorithm)算法原理讨论到时序异常检测应用的思考

    1. 主要观点总结 0x1:什么场景下应用时序算法有效 历史数据可以被用来预测未来数据,对于一些周期性或者趋势性较强的时间序列领域问题,时序分解和时序预测算法可以发挥较好的作用,例如: 四季与天气的关 ...

  8. 个性探测综述阅读笔记——Recent trends in deep learning based personality detection

    目录 abstract 1. introduction 1.1 个性衡量方法 1.2 应用前景 1.3 伦理道德 2. Related works 3. Baseline methods 3.1 文本 ...

  9. 论文阅读笔记五十二:CornerNet-Lite: Efficient Keypoint Based Object Detection(CVPR2019)

    论文原址:https://arxiv.org/pdf/1904.08900.pdf github:https://github.com/princeton-vl/CornerNet-Lite 摘要 基 ...

随机推荐

  1. virtualbox+vagrant学习-2(command cli)-21-vagrant up命令

    Up 格式: vagrant up [options] [name|id] 这个命令根据你的Vagrantfile文件创建和配置客户机. 这是“vagrant”中最重要的一个命令,因为它是创建任何va ...

  2. python file的3中读法

    f.read()  整个文件读入到内存,全部放入到一个string中 f.readlines() 文件全部内容解析成行列表,自带\n,需要print i, f.readline()一行一行,返回字符串 ...

  3. webpack超详细配置, 使用教程(图文)

    流程 webpack安装 Step 1: 首先安装Node.js, 可以去Node.js官网下载. Step2: 在Git或者cmd中输入下面这段代码, 通过全局先将webpack指令安装进电脑中np ...

  4. RabbitMQ如何保证发送端消息的可靠投递-发生镜像队列发生故障转移时

    上一篇最后提到了mandatory这个参数,对于设置mandatory参数个人感觉还是很重要的,尤其在RabbitMQ镜像队列发生故障转移时. 模拟个测试环境如下: 首先在集群队列中增加两个镜像队列的 ...

  5. .NET Core中多语言支持

    在.NET Core项目中也是可以使用.resx资源文件,来为程序提供多语言支持.以下我们就以一个.NET Core控制台项目为例,来讲解资源文件的使用. 新建一个.NET Core控制台项目,然后我 ...

  6. FullCalendar Timeline View 使用

    FullCalendar  Timeline View(v4) The Scheduler add-on provides a new view called “timeline view” with ...

  7. MySQL 性能测试

    MySQL 查询优化器有几个目标,但是其中最主要的目标是尽可能地使用索引,并且使用最严格的索引来消除尽可能多的数据行.最终目标是提交 SELECT 语句查找数据行,而不是排除数据行.优化器试图排除数据 ...

  8. 新装Linux无法访问域名

    昨天新安装Linux,发现ping百度ping不通: 经查询,得知是系统没有配置DNS域名服务器,百度搜索DNS域名服务器列表: 编辑 /etc/resolv.conf 文件,添加查询到的DNS服务器 ...

  9. 【PHP开发规范】老生常谈的编码开发规范你懂多少?

    [PHP开发规范]老生常谈的编码开发规范你懂多少? 这几天看了一下阿里技术发布的一套Java开发规范<阿里巴巴Java开发手册>,里面写了阿里内部的Java开发规范标准,写的很好.这套Ja ...

  10. linux-2.6内核驱动学习——jz2440之按键

      //以下是学习完韦东山老师视屏教程后所做学习记录中断方式取得按键值: #include <linux/module.h> #include <linux/kernel.h> ...