http://blog.csdn.net/pipisorry/article/details/48882167

海量数据挖掘Mining Massive Datasets(MMDs) -Jure Leskovec courses学习笔记之局部敏感哈希LSH的距离度量方法

Distance Measures距离度量方法

{There are many other notions of similarity(beyond jaccard similarity) or distance and which one to use depends on what type of data we have and what our notion of similar is.Beside it is possible to combine hash functions from a family,to get the s curve
affect that we saw for LSH applied to min-hash matrices.In fact, the construction is essentially the same for any LSH family.And we'll conclude this unit by seeing some particular LSH families, and how they work for the cosine distance and Euclidean distance.}

Euclidean distance Vs. Non-Euclidean distance 欧氏距离对比非欧氏距离

Note: dense: given any two points,their average will be a point in the space.And there is no reasonable notion of the average of points in the space.欧氏距离可以计算average,但是非欧氏距离却不一定。

Axioms of Distance Measures 距离度量公理

距离度量就满足的性质

Note: iff =  if and only if [英文文献中常见拉丁字母缩写整理(红色最常见)]

皮皮blog

欧氏距离

Note: 范数Norm:

给定向量x=(x1,x2,...xn)

L1范数:向量各个元素绝对值之和,Manhattan distance。

L2范数:向量各个元素的平方求和然后求平方根,也叫欧式范数、欧氏距离。

Lp范数:向量各个元素绝对值的p次方求和然后求1/p次方

L∞范数:向量各个元素求绝对值,最大那个元素的绝对值

皮皮blog

非欧氏距离

  

Note:

1. cosine distance: requires points to be vectors, if the vectors have real numbers as components, then they are essentially points in the Euclidean space.But the vectors could have integer components in which case the space is not Euclidean.

2. 编辑距离有两种方式:一种是直接将其中一个元音字符替换成另 一个,一种是先删除字符再插入另一个字符。

非欧氏距离及其满足公理性质的证明:

Jaccard Dist

Note: Proof中使用反证法:两个都不成立,即都相等时,minhash(x)=minhash(y)了。

Cosine Dist余弦距离

cosine distance is useful for data that is in the form of a vector.Often the vector is in very high dimensions.

  

Note:

1. The length of a vector from the origin is actually the normal Euclidian distance,what we call the L2 norm.

2. No matter how many dimensions the vectors have, any two lines that intersect, and P1 and P2 do intersect at the origin,they'll follow a plane.

3. if you project P1 onto P2,the length of the projection is the dot product, divided by the length of P2.Then the cosine of the angle between them is the ratio of adjacent(the dot product divided by P2) over hypotenuse(斜边, the length of P1).

Note: vectors here are really directions, not magnitudes.So two vectors with the same direction and different magnitudes are really the same vector.Even to vector and its negation, the reverse of the vector,ought to be thought of as the
same vector.

Edit distance编辑距离



子串的定义:one string is a sub-sequence of another if we can get the first by deleting 0 or more positions from the second.the positions of the deleted characters did not have to be consecutive.

计算x,y编辑距离的两种方式

Note: 第一种方式中我们可以逆向编辑:we can get from y to x by doing the same edits in reverse.delete u and v,and then we insert a to get x.

Hamming distance汉明距离

Reviews复习

Note:距离矩阵

he     she    his    hers

he                1        3        2

she                        4        3

his                                    3

from:http://blog.csdn.net/pipisorry/article/details/48882167

ref: 距离和相似性度量方法

海量数据挖掘MMDS week2: LSH的距离度量方法的更多相关文章

  1. 海量数据挖掘MMDS week2: 局部敏感哈希Locality-Sensitive Hashing, LSH

    http://blog.csdn.net/pipisorry/article/details/48858661 海量数据挖掘Mining Massive Datasets(MMDs) -Jure Le ...

  2. 海量数据挖掘MMDS week2: 频繁项集挖掘 Apriori算法的改进:非hash方法

    http://blog.csdn.net/pipisorry/article/details/48914067 海量数据挖掘Mining Massive Datasets(MMDs) -Jure Le ...

  3. 海量数据挖掘MMDS week2: Nearest-Neighbor Learning最近邻学习

    http://blog.csdn.net/pipisorry/article/details/48894963 海量数据挖掘Mining Massive Datasets(MMDs) -Jure Le ...

  4. 海量数据挖掘MMDS week2: 频繁项集挖掘 Apriori算法的改进:基于hash的方法

    http://blog.csdn.net/pipisorry/article/details/48901217 海量数据挖掘Mining Massive Datasets(MMDs) -Jure Le ...

  5. 海量数据挖掘MMDS week2: Association Rules关联规则与频繁项集挖掘

    http://blog.csdn.net/pipisorry/article/details/48894977 海量数据挖掘Mining Massive Datasets(MMDs) -Jure Le ...

  6. 海量数据挖掘MMDS week7: 局部敏感哈希LSH(进阶)

    http://blog.csdn.net/pipisorry/article/details/49686913 海量数据挖掘Mining Massive Datasets(MMDs) -Jure Le ...

  7. 海量数据挖掘MMDS week3:社交网络之社区检测:高级技巧

    http://blog.csdn.net/pipisorry/article/details/49052255 海量数据挖掘Mining Massive Datasets(MMDs) -Jure Le ...

  8. 海量数据挖掘MMDS week5: 聚类clustering

    http://blog.csdn.net/pipisorry/article/details/49427989 海量数据挖掘Mining Massive Datasets(MMDs) -Jure Le ...

  9. 海量数据挖掘MMDS week4: 推荐系统Recommendation System

    http://blog.csdn.net/pipisorry/article/details/49205589 海量数据挖掘Mining Massive Datasets(MMDs) -Jure Le ...

随机推荐

  1. git提交项目常用命令及git分支的用法

    1.第一步首先从git托管平台clone项目,我这里就使用idea为例: 填写git的url与存放本地目录名及项目名     2.如果你对项目进行了一些修改,就可以执行git命令,进行提交. 有两种方 ...

  2. 【python标准库模块三】Os模块和Sys模块学习

    Os模块 导入os模块 import os 获取当前工作目录 os.getcwd() 切换目录,跟linux中的cd一样 os.chdir("文件夹名") 递归生成文件夹 os.m ...

  3. 剑指架构师系列-持续集成之Maven+Nexus+Jenkins+git+Spring boot

    1.Nexus与Maven 先说一下这个Maven是什么呢?大家都知道,Java社区发展的非常强大,封装各种功能的Jar包满天飞,那么如何才能方便的引入我们项目,为我所用呢?答案就是Maven,只需要 ...

  4. Node.js UDP/Datagram

    稳定性: 3 - 稳定 调用 require('dgram') ,可以使用数据报文 sockets(Datagram sockets). 重要提醒: dgram.Socket#bind() 的行为在 ...

  5. 临时关闭Mac SIP系统完整性保护机制

    # 修正更新 [2016-12-27] 晚上给我笔记本安装的时候,使用user权限安装成功,mac最后是关闭sip才安装成功. $ pip install -r requirements.txt -- ...

  6. 计算机网络之局域网&以太网

    局域网的拓扑结构 局域网最主要的特点是:网络为一个单位所拥有,且地理范围和站点数目均有限. 局域网具有广播功能,从一个站点可很方便地访问全网,局域网上的主机可共享连接在局域网上的各种硬件和软件资源. ...

  7. Android二维码扫描、生成

    Android二维码扫描.生成 现在使用二维码作为信息的载体已经越来越普及,那么二维码的生成以及扫描是如何实现的呢 google为我们提供了zxing开源库供我们使用 zxing GitHub源码地址 ...

  8. python用openpyxl操作excel

    python操作excel方法 1)自身有Win32 COM操作office但讲不清楚,可能不支持夸平台,linux是否能用不清楚,其他有专业处理模块,如下 2)xlrd:(读excel)表,xlrd ...

  9. Mongo 整体架构介绍(1)-------分片集群

    摘要 在mongo初识文中介绍了mongo与cassandra的主要区别,以及mongo物理部署架构图.本文接着上一篇的mongo 架构图,来继续讲分片集群. 分片介绍 shard key mongo ...

  10. Compile C++ code in Matlab with OpenCV support

    Provides a function named as "mex_opencv(src)" The code function mex_opencv(src) ARC = 'x6 ...