海量数据挖掘MMDS week2: LSH的距离度量方法
http://blog.csdn.net/pipisorry/article/details/48882167
海量数据挖掘Mining Massive Datasets(MMDs) -Jure Leskovec courses学习笔记之局部敏感哈希LSH的距离度量方法
Distance Measures距离度量方法
{There are many other notions of similarity(beyond jaccard similarity) or distance and which one to use depends on what type of data we have and what our notion of similar is.Beside it is possible to combine hash functions from a family,to get the s curve
affect that we saw for LSH applied to min-hash matrices.In fact, the construction is essentially the same for any LSH family.And we'll conclude this unit by seeing some particular LSH families, and how they work for the cosine distance and Euclidean distance.}
Euclidean distance Vs. Non-Euclidean distance 欧氏距离对比非欧氏距离
Note: dense: given any two points,their average will be a point in the space.And there is no reasonable notion of the average of points in the space.欧氏距离可以计算average,但是非欧氏距离却不一定。
Axioms of Distance Measures 距离度量公理
距离度量就满足的性质
Note: iff = if and only if [英文文献中常见拉丁字母缩写整理(红色最常见)]
欧氏距离
Note: 范数Norm:
给定向量x=(x1,x2,...xn)
L1范数:向量各个元素绝对值之和,Manhattan distance。
L2范数:向量各个元素的平方求和然后求平方根,也叫欧式范数、欧氏距离。
Lp范数:向量各个元素绝对值的p次方求和然后求1/p次方
L∞范数:向量各个元素求绝对值,最大那个元素的绝对值
非欧氏距离
Note:
1. cosine distance: requires points to be vectors, if the vectors have real numbers as components, then they are essentially points in the Euclidean space.But the vectors could have integer components in which case the space is not Euclidean.
2. 编辑距离有两种方式:一种是直接将其中一个元音字符替换成另 一个,一种是先删除字符再插入另一个字符。
非欧氏距离及其满足公理性质的证明:
Jaccard Dist
Note: Proof中使用反证法:两个都不成立,即都相等时,minhash(x)=minhash(y)了。
Cosine Dist余弦距离
cosine distance is useful for data that is in the form of a vector.Often the vector is in very high dimensions.
Note:
1. The length of a vector from the origin is actually the normal Euclidian distance,what we call the L2 norm.
2. No matter how many dimensions the vectors have, any two lines that intersect, and P1 and P2 do intersect at the origin,they'll follow a plane.
3. if you project P1 onto P2,the length of the projection is the dot product, divided by the length of P2.Then the cosine of the angle between them is the ratio of adjacent(the dot product divided by P2) over hypotenuse(斜边, the length of P1).
Note: vectors here are really directions, not magnitudes.So two vectors with the same direction and different magnitudes are really the same vector.Even to vector and its negation, the reverse of the vector,ought to be thought of as the
same vector.
Edit distance编辑距离
子串的定义:one string is a sub-sequence of another if we can get the first by deleting 0 or more positions from the second.the positions of the deleted characters did not have to be consecutive.
计算x,y编辑距离的两种方式
Note: 第一种方式中我们可以逆向编辑:we can get from y to x by doing the same edits in reverse.delete u and v,and then we insert a to get x.
Hamming distance汉明距离
Reviews复习
Note:距离矩阵
he she his hers
he 1 3 2
she 4 3
his 3
from:http://blog.csdn.net/pipisorry/article/details/48882167
ref: 距离和相似性度量方法
海量数据挖掘MMDS week2: LSH的距离度量方法的更多相关文章
- 海量数据挖掘MMDS week2: 局部敏感哈希Locality-Sensitive Hashing, LSH
http://blog.csdn.net/pipisorry/article/details/48858661 海量数据挖掘Mining Massive Datasets(MMDs) -Jure Le ...
- 海量数据挖掘MMDS week2: 频繁项集挖掘 Apriori算法的改进:非hash方法
http://blog.csdn.net/pipisorry/article/details/48914067 海量数据挖掘Mining Massive Datasets(MMDs) -Jure Le ...
- 海量数据挖掘MMDS week2: Nearest-Neighbor Learning最近邻学习
http://blog.csdn.net/pipisorry/article/details/48894963 海量数据挖掘Mining Massive Datasets(MMDs) -Jure Le ...
- 海量数据挖掘MMDS week2: 频繁项集挖掘 Apriori算法的改进:基于hash的方法
http://blog.csdn.net/pipisorry/article/details/48901217 海量数据挖掘Mining Massive Datasets(MMDs) -Jure Le ...
- 海量数据挖掘MMDS week2: Association Rules关联规则与频繁项集挖掘
http://blog.csdn.net/pipisorry/article/details/48894977 海量数据挖掘Mining Massive Datasets(MMDs) -Jure Le ...
- 海量数据挖掘MMDS week7: 局部敏感哈希LSH(进阶)
http://blog.csdn.net/pipisorry/article/details/49686913 海量数据挖掘Mining Massive Datasets(MMDs) -Jure Le ...
- 海量数据挖掘MMDS week3:社交网络之社区检测:高级技巧
http://blog.csdn.net/pipisorry/article/details/49052255 海量数据挖掘Mining Massive Datasets(MMDs) -Jure Le ...
- 海量数据挖掘MMDS week5: 聚类clustering
http://blog.csdn.net/pipisorry/article/details/49427989 海量数据挖掘Mining Massive Datasets(MMDs) -Jure Le ...
- 海量数据挖掘MMDS week4: 推荐系统Recommendation System
http://blog.csdn.net/pipisorry/article/details/49205589 海量数据挖掘Mining Massive Datasets(MMDs) -Jure Le ...
随机推荐
- Oracle中查询和删除相同记录的3种方法
--创建测试表 )); ,'); ,'); ,'); ,'); ,'); ,'); commit; select * from test; --查询相同记录 ); select id,name fro ...
- delphi 微信(WeChat)多开源代码
在网上看到一个C++代码示例: 原文地址:http://bbs.pediy.com/thread-217610.htm 觉得这是一个很好的调用 windows api 的示例,故将其转换成了 delp ...
- spring cloud 入门系列四:使用Hystrix 实现断路器进行服务容错保护
在微服务中,我们将系统拆分为很多个服务单元,各单元之间通过服务注册和订阅消费的方式进行相互依赖.但是如果有一些服务出现问题了会怎么样? 比如说有三个服务(ABC),A调用B,B调用C.由于网络延迟或C ...
- ng-book札记——HTTP
Angular拥有自己的HTTP库,可以用于调用外部API. 在JavaScript世界里有三种方式可以实现异步请求,Callback,Promise与Observable.Angular倾向于使用O ...
- PHP Misc. 函数
PHP 杂项函数简介 我们把不属于其他类别的函数归纳到杂项函数类别. 安装 杂项函数是 PHP 核心的组成部分.无需安装即可使用这些函数. Runtime 配置 杂项函数的行为受 php.ini 文件 ...
- Linux 性能监测:IO
磁盘通常是计算机最慢的子系统,也是最容易出现性能瓶颈的地方,因为磁盘离 CPU 距离最远而且 CPU 访问磁盘要涉及到机械操作,比如转轴.寻轨等.访问硬盘和访问内存之间的速度差别是以数量级来计算的,就 ...
- rbac 概念
1 权限管理 1.1 什么是权限管理 分享牛原创,分享牛系列.基本上涉及到用户参与的系统都要进行权限管理,权限管理属于系统安全的范畴,权限管理实现对用户访问系统的控制,按照安全规则或者安全策略控制用户 ...
- PGM:贝叶斯网的参数估计
http://blog.csdn.net/pipisorry/article/details/52578631 本文讨论(完备数据的)贝叶斯网的参数估计问题:贝叶斯网的MLE最大似然估计和贝叶斯估计. ...
- 使用Dialog实现全局Loading加载框
Dialog实现全局Loading加载框 很多人在实现Loading加载框的时候,都是在当前的页面隐藏一个Loading布局,需要加载的时候,显示出来,加载完再隐藏 使用Dialog实现Loading ...
- Android 5.0新控件——TextInputLayout
Android 5.0(M)新控件--TextInputLayout 介绍之前,先直观的看一下效果 TextInputLayout其实是一个容器,他继承自LinearLayout,该容器是作用于Tex ...