Jaccard Similarity and Shingling

https://www.cs.utah.edu/~jeffp/teaching/cs5955/L4-Jaccard+Shingle.pdf

https://www.cs.utah.edu/~jeffp/teaching/cs5955/L5-Minhash.pdf

【可测空间 convert the data (homeworks, webpages, emails) into an object in an abstract space that we know how to measure distance 】

We will study how to define the distance between sets, specifically with the Jaccard distance. To illustrate and motivate this study, we will focus on using Jaccard distance to measure the distance between documents. This uses the common “bag of words” model, which is simplistic, but is sufficient for many applications. We start with some big questions. This lecture will only begin to answer them. • Given two homework assignments (reports) how can a computer detect if one is likely to have been plagiarized from the other without understanding the content? • In trying to index webpages, how does Google avoid listing duplicates or mirrors? • How does a computer quickly understand emails, for either detecting spam or placing effective advertisers? (If an ad worked on one email, how can we determine which others are similar?)

【词带将文本段落转化为数值集合 convert documents into sets】

4.2 Documents to Sets How do we apply this set machinery to documents? Bag of words vs. Shingles The first option is the bag of words model, where each document is treated as an unordered set of words. A more general approach is to shingle the document. This takes consecutive words and group them as a single object. A k-shingle is a consecutive set of k words. So the set of all 1-shingles is exactly the bag of words model. An alternative name to k-shingle is an k-gram. These mean the same thing. D1 : I am Sam. D2 : Sam I am. D3 : I do not like green eggs and ham. D4 : I do not like them, Sam I am. The (k = 1)-shingles of D1∪D2∪D3∪D4 are: {[I], [am], [Sam], [do], [not], [like], [green], [eggs], [and], [ham], [them]}.

The (k = 2)-shingles of D1∪D2∪D3∪D4 are: {[I am], [am Sam], [Sam Sam], [Sam I], [am I], [I do], [do not], [not like], [like green], [green eggs], [eggs and], [and ham], [like them], [them Sam]}. The set of k-shingles of a document with n words is at most n − k. The takes space O(kn) to store them all. If k is small, this is not a high overhead. Furthermore, the space goes down as items are repeated.

The set of k-shingles of a document with n words is at most n − k. The takes space O(kn) to store them all. If k is small, this is not a high overhead. Furthermore, the space goes down as items are repeated.

【勘误--k n n-k+1 空间复杂度 space O(kn) 】

【Jaccard 对相似度的度量 Jaccard with Shingles】

4.3 Jaccard with Shingles So how do we put this together. Consider the (k = 2)-shingles for each D1, D2, D3, and D4: D1 : [I am], [am Sam] D2 : [Sam I], [I am] D3 : [I do], [do not], [not like], [like green], [green eggs], [eggs and], [and ham] D4 : [I do], [do not], [not like], [like them], [them Sam], [Sam I], [I am]

Now the Jaccard similarity is as follows: JS(D1, D2) = 1/3 ≈ 0.333 JS(D1, D3) = 0 = 0.0 JS(D1, D4) = 1/8 = 0.125 JS(D2, D3) = 0 = 0.0 JS(D3, D4) = 2/7 ≈ 0.286 JS(D3, D4) = 3/11 ≈ 0.273 Next time we will see how to use this special abstract structure of sets to compute this distance (approximately) very efficiently and at extremely large scale.

Jaccard Similarity and Shingling的更多相关文章

jaccard similarity coefficient 相似度计算
Jaccard index From Wikipedia, the free encyclopedia The Jaccard index, also known as the Jaccard ...
Jaccard similarity(杰卡德相似度)和Abundance correlation（丰度相关性）
杰卡德距离(Jaccard Distance) 是用来衡量两个集合差异性的一种指标,它是杰卡德相似系数的补集,被定义为1减去Jaccard相似系数.而杰卡德相似系数(Jaccard similarit ...
基于jaccard相似度的LSH
使用Python通过LSH建立推荐引擎 LSH:一个可以用来处理成百上千行的算法前提: Python 基础 Pandas 学完本教程之后,解锁成就: 通过建立shingles 为LSH准备训练集和测 ...
机器学习中的相似性度量(Similarity Measurement)
机器学习中的相似性度量(Similarity Measurement) 在做分类时常常需要估算不同样本之间的相似性度量(Similarity Measurement),这时通常采用的方法就是计算样本间 ...
相似性度量(Similarity Measurement)与“距离”(Distance)
在做分类时常常需要估算不同样本之间的相似性度量(Similarity Measurement),这时通常采用的方法就是计算样本间的“距离”(Distance).采用什么样的方法计算距离是很讲究,甚至关 ...
相似性分析之Jaccard相似系数
Jaccard, 又称为Jaccard相似系数(Jaccard similarity coefficient)用于比较有限样本集之间的相似性与差异性.Jaccard系数值越大,样本相似度越高公式: ...
Dice Similarity Coefficent vs. IoU Dice系数和IoU
Dice Similarity Coefficent vs. IoU Several readers emailed regarding the segmentation performance of ...
相似系数_杰卡德距离(Jaccard Distance)
python机器学习-乳腺癌细胞挖掘(博主亲自录制视频)https://study.163.com/course/introduction.htm?courseId=1005269003&ut ...
海量数据挖掘MMDS week2: 局部敏感哈希Locality-Sensitive Hashing, LSH
http://blog.csdn.net/pipisorry/article/details/48858661 海量数据挖掘Mining Massive Datasets(MMDs) -Jure Le ...

随机推荐

C语言对文件的读写操作以及处理CSV文件的方法
#include <stdio.h> #define F_PATH "d:\myfile\file.txt" int main(void) { FILE *fp = N ...
JSON API：用 JSON 构建 API 的标准指南中文版
译文地址:https://github.com/justjavac/json-api-zh_CN 假设你和你的团队以前争论过使用什么方式构建合理 JSON 响应格式, 那么 JSON API 就是你的 ...
Youtube深度学习推荐系统论文
https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45530.pdf https://zh ...
MFC MFC对话框滚动条的使用
对话框的(上下/左右)滚动事件,比如,把一个比较大的对话框放入tab控件的某一页时,就需要添加滚动条.在使用了java和qt等图形界面化的集成开发环境之后,再使用MFC,就会发现,想要让一个对话框 ...
Oracle递归操作
需求:找出代理商中没有挂商家的代理商简单SQL如下: select * from t_proxy tp where tp.id not in (SELECT tp.id as p_id FROM t ...
Linux学习之十二-Linux文件属性
Linux文件属性在Linux中,对于每个文件都有相应属性,以Linux中root用户家目录下新建文件a.txt为例,在a.txt中输入几个字符使用命令ls -ild a.txt查看文件的权限等 ...
Linux学习之十一-Linux字符集及乱码处理
Linux字符集及乱码处理 1.字符(Character)是各种文字和符号的总称,包括各国家文字.标点符号.图形符号.数字等.字符集(Character set)是多个字符的集合,字符集种类较多,每个 ...
Xcode中的变量模板(variable template)的使用方法
大熊猫猪·侯佩原创或翻译作品.欢迎转载,转载请注明出处. 假设认为写的不好请多提意见,假设认为不错请多多支持点赞.谢谢! hopy ;) 你可能常常会写一些小的代码片段,里面自然少不了一些关键的变量. ...
Unity3D开发基础组件提取总结
在游戏开发过程中,除了逻辑功能的开发之外,还有非常多基础的模块.这些模块,对大部分手机网络游戏来说都是一样的.所以,在上个游戏已经上线运营大半年之际,我认为有必要将这些模块整理出来.让后面其它游戏的开 ...
tp框架where条件查询数据库
tp框架where条件查询数据库 Where 条件表达式格式为: $map['字段名'] = array('表达式', '操作条件'); 其中 $map 是一个普通的数组变量,可以根据自己需求而命名. ...

Jaccard Similarity and Shingling

Jaccard Similarity and Shingling的更多相关文章

随机推荐

热门专题