The chi squared distance d(x,y) is, as you already know, a distance between two histograms x=[x_1,..,x_n] and y=[y_1,...,y_n] having n bins both. Moreover, both histograms are normalized, i.e. their entries sum up to one.
The distance measure d is usually defined (although alternative definitions exist) as d(x,y) = sum( (xi-yi)^2 / (xi+yi) ) / 2 . It is often used in computer vision to compute distances between some bag-of-visual-word representations of images.

The name of the distance is derived from Pearson's chi squared test statistic X²(x,y) = sum( (xi-yi)^2 / xi) for comparing discrete probability distributions (i.e histograms). However, unlike the test statistic, d(x,y) is symmetric wrt. x and y, which is often useful in practice, e.g., when you want to construct a kernel out of the histogram distances.

Chi-Square Distance

Consider a frequency table with n rows and p columns, it is possible to calculate row profiles and column profiles. Let us then plot the n or p points from each profile. We can define the distances between these points. The Euclidean distance between the components of the profiles, on which a weighting is defined (each term has a weight that is the inverse of its frequency), is called the chi-square distance. The name of the distance is derived from the fact that the mathematical expression defining the distance is identical to that encountered in the elaboration of the chi square goodness of fit test.

MATHEMATICAL ASPECTS

Let (fij), be the frequency of the ith row and jth column in a frequency table with n rows an p columns. The chi-square distance between two rows i and i is given by the formula:

where

i. is the sum of the components of the ith row;
.j is the sum of the components of the jth column;
is the ith row profile for j = 1,2,...,p.
Likewise, the distance between two columns j and j is given by:

where  is the jth column profile for j = 1,...,n.

DOMAINS AND LIMITATIONS

The chi-square distance incorporates a weight that is inversely proportional to the total of each row (or column), which increases the importance of small deviations in the rows (or columns) which have a small sum with respect to those with more important sum package.

The chi-square distance has the property of distributional equivalence, meaning that it ensures that the distances between rows and columns are invariant when two columns (or two rows) with identical profiles are aggregated.

EXAMPLES

Consider a contingency table charting how satisfied employees working for three different businesses are. Let us establish a distance table using the chi-square distance.

Values for the studied variable X can fall into one of three categories:

  • 1: high satisfaction;
  • 2: medium satisfaction;
  • 3: low satisfaction.

The observations collected from samples of individuals from the three businesses are given below:

 

Business 1

Business 2

Business 3

Total

1

20

 55

30

105

2

18

 40

15

 73

3

12

  5

 5

 22

Total

50

100

50

200

The relative frequency table is obtained by dividing all of the elements of the table by 200, the total number of observations:

 

Business 1

Business 2

Business 3

Total

1

0.1

0.275

0.15

0.525

2

0.09

0.2

0.075

0.365

3

0.06

0.025

0.025

0.11

Total

0.25

0.5

0.25

1

We can calculate the difference in employee satisfaction between the the 3 enterprises. The column profile matrix is given below:

 

Business 1

Business 2

Business 3

Total

1

0.4 

0.55

0.6

1.55

2

0.36

0.4 

0.3

1.06

3

0.24

0.05

0.1

0.39

Total

1  

1  

1 

3  

This allows us to calculate the distances between the different columns:

We can calculate d(1,3) and d(2,3) in a similar way. The distances obtained are summarized in the following distance table:

 

Business 1

Business 2

Business 3

Business 1

0

0.613

0.514

Business 2

0.613

0

0.234

Business 3

0.514

0.234

0

We can also calculate the distances between the rows, in other words the difference in employee satisfaction; to do this we need the line profile table:

 

Business 1

Business 2

Business 3

Total

1

0.19 

0.524

0.286

1

2

0.246

0.548

0.206

1

3

0.546

0.227

0.227

1

Total

0.982

1.299

0.719

3

This allows us to calculate the distances between the different rows:

We can calculate d(1,3) and d(2,3) in a similar way. The differences between the degrees of employee satisfaction are finally summarized in the following distance table:

 

1

2

3

1

0

0.198

0.835

2

0.198

0

0.754

3

0.835

0.754

0

http://www.researchgate.net/post/What_is_chi-squared_distance_I_need_help_with_the_source_code

http://www.springerreference.com/docs/html/chapterdbid/60817.html

Chi Square Distance的更多相关文章

  1. BestCoder Round #87 1002 Square Distance[DP 打印方案]

    Square Distance  Accepts: 73  Submissions: 598  Time Limit: 4000/2000 MS (Java/Others)  Memory Limit ...

  2. HDU 5903 Square Distance (贪心+DP)

    题意:一个字符串被称为square当且仅当它可以由两个相同的串连接而成. 例如, "abab", "aa"是square, 而"aaa", ...

  3. hdu 5903 Square Distance(dp)

    Problem Description A string is called a square string if it can be obtained by concatenating two co ...

  4. [HDU5903]Square Distance(DP)

    题意:给一个字符串t ,求与这个序列刚好有m个位置字符不同的由两个相同的串拼接起来的字符串 s,要求字典序最小的答案. 分析:按照贪心的想法,肯定在前面让字母尽量小,尽可能的填a,但问题是不知道前面填 ...

  5. BendFord's law's Chi square test

    http://www.siam.org/students/siuro/vol1issue1/S01009.pdf bendford'law e=log10(1+l/n) o=freq of first ...

  6. HDU 5903 - Square Distance [ DP ] ( BestCoder Round #87 1002 )

    题意: 给一个字符串t ,求与这个序列刚好有m个位置字符不同的由两个相同的串拼接起来的字符串 s, 要求字典序最小的答案    分析: 把字符串折半,分成0 - n/2-1 和 n/2 - n-1 d ...

  7. HDU 5903 Square Distance

    $dp$预处理,贪心. 因为$t$串前半部分和后半部分是一样的,所以只要构造前一半就可以了. 因为要求字典序最小,所以肯定是从第一位开始贪心选择,$a,b,c,d,...z$,一个一个尝试过去,如果发 ...

  8. 生成式模型之 GAN

    生成对抗网络(Generative Adversarial Networks,GANs),由2014年还在蒙特利尔读博士的Ian Goodfellow引入深度学习领域.2016年,GANs热潮席卷AI ...

  9. Scoring and Modeling—— Underwriting and Loan Approval Process

    https://www.fdic.gov/regulations/examinations/credit_card/ch8.html Types of Scoring FICO Scores    V ...

随机推荐

  1. opengl (1) 基本API的熟悉

    代码从此处下载 1 运行如下代码,可以看到如下效果,我们利用opengl画出一个三角形. void renderScene(void) { /* glClear清除缓冲区 */ glClear(GL_ ...

  2. Codeforces Round #306 (Div. 2) ABCDE(构造)

    A. Two Substrings 题意:给一个字符串,求是否含有不重叠的子串"AB"和"BA",长度1e5. 题解:看起来很简单,但是一直错,各种考虑不周全, ...

  3. HDU 5464 ( Clarke and problem ) (dp)

    dp[i][j] := 前i个数和为j的情况(mod p) dp[i][j] 分两种情况 1.不选取第i个数 -> dp[i][j] = dp[i-1][j] 2.   选取第i个数 -> ...

  4. [置顶] 漫谈SOA(面向服务架构)

    面向服务架构的思想在整个软件的架构中已经不是什么新鲜的东西.我简单的认为服务化是模块化的延伸,所以服务化有着和模块化类似的优点和缺点.这里不再讨论这些服务定义服务与服务之间的通信协议(像WSDL等等) ...

  5. Redis集群战法整理

    单机及集群搭建 http://www.codeceo.com/article/distributed-caching-redis-server.html 主从复制设置 Redis服务器复制(主—从配置 ...

  6. 收集磁盘分区信息(总量、可用、已用、百分比)导出到csv

    #############################脚本功能及说明##################################################该脚本用来收集磁盘分区总大小 ...

  7. 谋哥:App排行榜的秘密

    App在改变世界,改变人们的生活.       如今购物大家都用淘宝.京东,吃饭你会用饭否,看天气预报你用墨迹天气,看视频用优酷.K歌你用唱吧,聊天联系你用微信,看新闻你用今日头条等等.你的生活由你自 ...

  8. careercup-递归和动态规划 9.9

    9.9 设计一种算法,打印八皇后在8*8棋盘上的各种摆法,其中每个皇后都不同行.不同列,也不在对角线上.这里的“对角线”指的是所有的对角线,不只是平分整个棋盘的那两条对角线. 类似leetcode:N ...

  9. QT线程(二)---线程同步

      线程互斥 多线程运行时,通常会访问同一个变量,同一个数据结构,或者同一段代码.因此,需要使用互斥技术来保护上述资源,确保多线程执行的正确性. 注: 我们通常说某个函数是线程安全的,也就是因为该函数 ...

  10. 使用DrawerLayout实现侧拉菜单

    侧拉菜单在android应用中非常常见,它的实现方式太多了,今天我们就说说使用Google提供的DrawerLayout来实现侧拉菜单效果,先来看张效果图: DrawerLayout的实现其实非常简单 ...