Chi Square Distance
The chi squared distance d(x,y) is, as you already know, a distance between two histograms x=[x_1,..,x_n] and y=[y_1,...,y_n] having n bins both. Moreover, both histograms are normalized, i.e. their entries sum up to one.
The distance measure d is usually defined (although alternative definitions exist) as d(x,y) = sum( (xi-yi)^2 / (xi+yi) ) / 2 . It is often used in computer vision to compute distances between some bag-of-visual-word representations of images.
The name of the distance is derived from Pearson's chi squared test statistic X²(x,y) = sum( (xi-yi)^2 / xi) for comparing discrete probability distributions (i.e histograms). However, unlike the test statistic, d(x,y) is symmetric wrt. x and y, which is often useful in practice, e.g., when you want to construct a kernel out of the histogram distances.
Chi-Square Distance
Consider a frequency table with n rows and p columns, it is possible to calculate row profiles and column profiles. Let us then plot the n or p points from each profile. We can define the distances between these points. The Euclidean distance between the components of the profiles, on which a weighting is defined (each term has a weight that is the inverse of its frequency), is called the chi-square distance. The name of the distance is derived from the fact that the mathematical expression defining the distance is identical to that encountered in the elaboration of the chi square goodness of fit test.
MATHEMATICAL ASPECTS
where
f i. | is the sum of the components of the ith row; |
f .j | is the sum of the components of the jth column; |
is the ith row profile for j = 1,2,...,p. |
where is the jth column profile for j = 1,...,n.
DOMAINS AND LIMITATIONS
The chi-square distance incorporates a weight that is inversely proportional to the total of each row (or column), which increases the importance of small deviations in the rows (or columns) which have a small sum with respect to those with more important sum package.
The chi-square distance has the property of distributional equivalence, meaning that it ensures that the distances between rows and columns are invariant when two columns (or two rows) with identical profiles are aggregated.
EXAMPLES
Consider a contingency table charting how satisfied employees working for three different businesses are. Let us establish a distance table using the chi-square distance.
Values for the studied variable X can fall into one of three categories:
- X 1: high satisfaction;
- X 2: medium satisfaction;
- X 3: low satisfaction.
The observations collected from samples of individuals from the three businesses are given below:
Business 1 |
Business 2 |
Business 3 |
Total |
|
---|---|---|---|---|
X 1 |
20 |
55 |
30 |
105 |
X 2 |
18 |
40 |
15 |
73 |
X 3 |
12 |
5 |
5 |
22 |
Total |
50 |
100 |
50 |
200 |
The relative frequency table is obtained by dividing all of the elements of the table by 200, the total number of observations:
Business 1 |
Business 2 |
Business 3 |
Total |
|
---|---|---|---|---|
X 1 |
0.1 |
0.275 |
0.15 |
0.525 |
X 2 |
0.09 |
0.2 |
0.075 |
0.365 |
X 3 |
0.06 |
0.025 |
0.025 |
0.11 |
Total |
0.25 |
0.5 |
0.25 |
1 |
We can calculate the difference in employee satisfaction between the the 3 enterprises. The column profile matrix is given below:
Business 1 |
Business 2 |
Business 3 |
Total |
|
---|---|---|---|---|
X 1 |
0.4 |
0.55 |
0.6 |
1.55 |
X 2 |
0.36 |
0.4 |
0.3 |
1.06 |
X 3 |
0.24 |
0.05 |
0.1 |
0.39 |
Total |
1 |
1 |
1 |
3 |
We can calculate d(1,3) and d(2,3) in a similar way. The distances obtained are summarized in the following distance table:
Business 1 |
Business 2 |
Business 3 |
|
---|---|---|---|
Business 1 |
0 |
0.613 |
0.514 |
Business 2 |
0.613 |
0 |
0.234 |
Business 3 |
0.514 |
0.234 |
0 |
We can also calculate the distances between the rows, in other words the difference in employee satisfaction; to do this we need the line profile table:
Business 1 |
Business 2 |
Business 3 |
Total |
|
---|---|---|---|---|
X 1 |
0.19 |
0.524 |
0.286 |
1 |
X 2 |
0.246 |
0.548 |
0.206 |
1 |
X 3 |
0.546 |
0.227 |
0.227 |
1 |
Total |
0.982 |
1.299 |
0.719 |
3 |
We can calculate d(1,3) and d(2,3) in a similar way. The differences between the degrees of employee satisfaction are finally summarized in the following distance table:
X 1 |
X 2 |
X 3 |
|
---|---|---|---|
X 1 |
0 |
0.198 |
0.835 |
X 2 |
0.198 |
0 |
0.754 |
X 3 |
0.835 |
0.754 |
0 |
http://www.researchgate.net/post/What_is_chi-squared_distance_I_need_help_with_the_source_code
http://www.springerreference.com/docs/html/chapterdbid/60817.html
Chi Square Distance的更多相关文章
- BestCoder Round #87 1002 Square Distance[DP 打印方案]
Square Distance Accepts: 73 Submissions: 598 Time Limit: 4000/2000 MS (Java/Others) Memory Limit ...
- HDU 5903 Square Distance (贪心+DP)
题意:一个字符串被称为square当且仅当它可以由两个相同的串连接而成. 例如, "abab", "aa"是square, 而"aaa", ...
- hdu 5903 Square Distance(dp)
Problem Description A string is called a square string if it can be obtained by concatenating two co ...
- [HDU5903]Square Distance(DP)
题意:给一个字符串t ,求与这个序列刚好有m个位置字符不同的由两个相同的串拼接起来的字符串 s,要求字典序最小的答案. 分析:按照贪心的想法,肯定在前面让字母尽量小,尽可能的填a,但问题是不知道前面填 ...
- BendFord's law's Chi square test
http://www.siam.org/students/siuro/vol1issue1/S01009.pdf bendford'law e=log10(1+l/n) o=freq of first ...
- HDU 5903 - Square Distance [ DP ] ( BestCoder Round #87 1002 )
题意: 给一个字符串t ,求与这个序列刚好有m个位置字符不同的由两个相同的串拼接起来的字符串 s, 要求字典序最小的答案 分析: 把字符串折半,分成0 - n/2-1 和 n/2 - n-1 d ...
- HDU 5903 Square Distance
$dp$预处理,贪心. 因为$t$串前半部分和后半部分是一样的,所以只要构造前一半就可以了. 因为要求字典序最小,所以肯定是从第一位开始贪心选择,$a,b,c,d,...z$,一个一个尝试过去,如果发 ...
- 生成式模型之 GAN
生成对抗网络(Generative Adversarial Networks,GANs),由2014年还在蒙特利尔读博士的Ian Goodfellow引入深度学习领域.2016年,GANs热潮席卷AI ...
- Scoring and Modeling—— Underwriting and Loan Approval Process
https://www.fdic.gov/regulations/examinations/credit_card/ch8.html Types of Scoring FICO Scores V ...
随机推荐
- Spring笔记之(一)初探
对spring框架的学习我是从模拟它的简单实现开始,这样也易于领悟到它的整个框架结构,以下是简单实现的代码: 配置文件:spring.xml <?xml version="1.0&qu ...
- pptp vpn
webalizer是一个高效的.免费的web服务器日志分析程序.其分析结果以HTML文件格式保存,从而可以很方便的通过web服务器进行浏览; http://daliang1215.iteye.com/ ...
- Oracle学习过程(随时更新)
1.入门 实用的一些查询语句: 查询用户所有表注释 select * from user_tab_comments 条件查询 根据两个值查询 select*from table where 字段 in ...
- ios 工程配置统一增加类的前缀(知识点也只能算知识点)
在前边的代码规范中提及:团队开发或者个人开发为了打包或者自己工程中避免创建新的类核第三方系统的重复增加类的统一前缀!!又很多人问我这种开发小技巧.下面我就普及一下: 1.首先选中你的工程配置 2.然后 ...
- java中对map使用entrySet循环
根据JDK5的新特性,用For循环Map,例如循环Map的Key 1 2 3 for(String dataKey : paraMap.keySet()) { System.out.p ...
- tomcat热部署,更改java类不用重新加载context
修改类后,tomcat热部署会重新加载整个项目的context,影响开发效率.网上查的大多数是将server的modules标签中Auto Reload项改为Disabled,但是没有效果. 使用以下 ...
- Java 加密 AES 对称加密算法
版权声明:本文为博主原创文章,未经博主允许不得转载. [AES] 一种对称加密算法,DES的取代者. 加密相关文章见:Java 加密解密 对称加密算法 非对称加密算法 MD5 BASE64 AES R ...
- SOAP web service用AFNetWorking实现请求
问: This is my current call to (asmx) SOAP web service: NSString *soapMessage = [NSString stringWithF ...
- STM32F030 IO口外部中断应用
//==文件exit.h============================================================ #ifndef __EXIT_H #define __ ...
- quartz 报错:java.lang.classNotFoundException
最近在做一个调度平台改造的项目,quartz在测试环境跑的是单机环境,生产上两台服务器做集群. 测试环境是ok的,生产上线后报错,一个类java.lang.classNotFoundException ...