http://blog.csdn.net/pipisorry/article/details/48882167

海量数据挖掘Mining Massive Datasets(MMDs) -Jure Leskovec courses学习笔记之局部敏感哈希LSH的距离度量方法

Distance Measures距离度量方法

{There are many other notions of similarity(beyond jaccard similarity) or distance and which one to use depends on what type of data we have and what our notion of similar is.Beside it is possible to combine hash functions from a family,to get the s curve
affect that we saw for LSH applied to min-hash matrices.In fact, the construction is essentially the same for any LSH family.And we'll conclude this unit by seeing some particular LSH families, and how they work for the cosine distance and Euclidean distance.}

Euclidean distance Vs. Non-Euclidean distance 欧氏距离对比非欧氏距离

Note: dense: given any two points,their average will be a point in the space.And there is no reasonable notion of the average of points in the space.欧氏距离可以计算average,但是非欧氏距离却不一定。

Axioms of Distance Measures 距离度量公理

距离度量就满足的性质

Note: iff =  if and only if [英文文献中常见拉丁字母缩写整理(红色最常见)]

皮皮blog

欧氏距离

Note: 范数Norm:

给定向量x=(x1,x2,...xn)

L1范数:向量各个元素绝对值之和,Manhattan distance。

L2范数:向量各个元素的平方求和然后求平方根,也叫欧式范数、欧氏距离。

Lp范数:向量各个元素绝对值的p次方求和然后求1/p次方

L∞范数:向量各个元素求绝对值,最大那个元素的绝对值

皮皮blog

非欧氏距离

  

Note:

1. cosine distance: requires points to be vectors, if the vectors have real numbers as components, then they are essentially points in the Euclidean space.But the vectors could have integer components in which case the space is not Euclidean.

2. 编辑距离有两种方式:一种是直接将其中一个元音字符替换成另 一个,一种是先删除字符再插入另一个字符。

非欧氏距离及其满足公理性质的证明:

Jaccard Dist

Note: Proof中使用反证法:两个都不成立,即都相等时,minhash(x)=minhash(y)了。

Cosine Dist余弦距离

cosine distance is useful for data that is in the form of a vector.Often the vector is in very high dimensions.

  

Note:

1. The length of a vector from the origin is actually the normal Euclidian distance,what we call the L2 norm.

2. No matter how many dimensions the vectors have, any two lines that intersect, and P1 and P2 do intersect at the origin,they'll follow a plane.

3. if you project P1 onto P2,the length of the projection is the dot product, divided by the length of P2.Then the cosine of the angle between them is the ratio of adjacent(the dot product divided by P2) over hypotenuse(斜边, the length of P1).

Note: vectors here are really directions, not magnitudes.So two vectors with the same direction and different magnitudes are really the same vector.Even to vector and its negation, the reverse of the vector,ought to be thought of as the
same vector.

Edit distance编辑距离



子串的定义:one string is a sub-sequence of another if we can get the first by deleting 0 or more positions from the second.the positions of the deleted characters did not have to be consecutive.

计算x,y编辑距离的两种方式

Note: 第一种方式中我们可以逆向编辑:we can get from y to x by doing the same edits in reverse.delete u and v,and then we insert a to get x.

Hamming distance汉明距离

Reviews复习

Note:距离矩阵

he     she    his    hers

he                1        3        2

she                        4        3

his                                    3

from:http://blog.csdn.net/pipisorry/article/details/48882167

ref: 距离和相似性度量方法

海量数据挖掘MMDS week2: LSH的距离度量方法的更多相关文章

  1. 海量数据挖掘MMDS week2: 局部敏感哈希Locality-Sensitive Hashing, LSH

    http://blog.csdn.net/pipisorry/article/details/48858661 海量数据挖掘Mining Massive Datasets(MMDs) -Jure Le ...

  2. 海量数据挖掘MMDS week2: 频繁项集挖掘 Apriori算法的改进:非hash方法

    http://blog.csdn.net/pipisorry/article/details/48914067 海量数据挖掘Mining Massive Datasets(MMDs) -Jure Le ...

  3. 海量数据挖掘MMDS week2: Nearest-Neighbor Learning最近邻学习

    http://blog.csdn.net/pipisorry/article/details/48894963 海量数据挖掘Mining Massive Datasets(MMDs) -Jure Le ...

  4. 海量数据挖掘MMDS week2: 频繁项集挖掘 Apriori算法的改进:基于hash的方法

    http://blog.csdn.net/pipisorry/article/details/48901217 海量数据挖掘Mining Massive Datasets(MMDs) -Jure Le ...

  5. 海量数据挖掘MMDS week2: Association Rules关联规则与频繁项集挖掘

    http://blog.csdn.net/pipisorry/article/details/48894977 海量数据挖掘Mining Massive Datasets(MMDs) -Jure Le ...

  6. 海量数据挖掘MMDS week7: 局部敏感哈希LSH(进阶)

    http://blog.csdn.net/pipisorry/article/details/49686913 海量数据挖掘Mining Massive Datasets(MMDs) -Jure Le ...

  7. 海量数据挖掘MMDS week3:社交网络之社区检测:高级技巧

    http://blog.csdn.net/pipisorry/article/details/49052255 海量数据挖掘Mining Massive Datasets(MMDs) -Jure Le ...

  8. 海量数据挖掘MMDS week5: 聚类clustering

    http://blog.csdn.net/pipisorry/article/details/49427989 海量数据挖掘Mining Massive Datasets(MMDs) -Jure Le ...

  9. 海量数据挖掘MMDS week4: 推荐系统Recommendation System

    http://blog.csdn.net/pipisorry/article/details/49205589 海量数据挖掘Mining Massive Datasets(MMDs) -Jure Le ...

随机推荐

  1. 条件语句,while循环语句:完整的温度转换程序

    while True: a = int(input('摄氏温度换为华氏温度请按 1\n华氏温度转为摄氏温度请按 2\n退出请按 3\n')) if a==1: c = float(input('请输入 ...

  2. ACM pie

    我的生日快到了,传统上我正在做馅饼.不只是一个馅饼,不,我有N个,各种口味和各种尺寸. 数量为F我的朋友会来到我的聚会,每个人都得到一个馅饼. 这应该是一块馅饼,而不是几个小块,因为看起来很乱.这一块 ...

  3. CMS垃圾收集器

    介绍 CMS垃圾回收器的全称是Concurrent Mark-Sweep Collector,从名字上可以看出两点,一个是使用的是并发收集,第二个是使用的收集算法是Mark-Sweep.从而也可以推测 ...

  4. ROS机器人程序设计(原书第2版)补充资料 (捌) 第八章 导航功能包集入门 navigation

    ROS机器人程序设计(原书第2版)补充资料 (捌) 第八章 导航功能包集入门 navigation 书中,大部分出现hydro的地方,直接替换为indigo或jade或kinetic,即可在对应版本中 ...

  5. linux源码编译安装OpenCV

    为了尽可能保证OpenCV的特性,使用OpenCV源码编译安装在linux上.先从安装其依赖项开始,以ubuntu 14.04.X为例讲解在Linux上源码编译安装OpenCV,其他linux版本可以 ...

  6. Swift中如何转换不同类型的Mutable指针

    在Swift中我们拥有强大高级逻辑抽象能力的同时,低级底层操作被刻意的限制了.但是有些情况下我们仍然想做一些在C语言中的hack工作,下面本猫就带大家看一看如何做这样的事. hacking is ha ...

  7. Android Multimedia框架总结(一)MediaPlayer介绍之状态图及生命周期

    请尊重分享成果,转载请注明出处: http://blog.csdn.net/hejjunlin/article/details/52349221 前言:从本篇开始,将进入Multimedia框架,包含 ...

  8. 带你深入理解STL之List容器

    上一篇博客中介绍的vector和数组类似,它拥有一段连续的内存空间,并且起始地址不变,很好的支持了随机存取,但由于是连续空间,所以在中间进行插入.删除等操作时都造成了内存块的拷贝和移动,另外在内存空间 ...

  9. 18 UI美化自定义主题样式代码

    自定义主题 假设我们我们对现有的样式不大满意 那么可在工程目录res/values下的styles.xml自定义 方法: 1. res/values下的styles.xml文件中自定义一个标签 < ...

  10. 一个iOS6系统bug+一个iOS7系统bug

    先看实际工作中遇到的两个bug:(1)iPhone Qzone有一个导航栏背景随着页面滑动而渐变的体验,当页面滑动到一定距离时,会改变导航栏上title文本的颜色,但是有一个莫名其妙的bug,如下: