ID3/C4.5/Gini Index

*/-->

ID3/C4.5/Gini Index

1 ID3

Select the attribute with the highest information gain.

1.1 Formula

1.2 Formula

Let pi be the probability that an arbitrary tuple in D belongs to class Ci, estimated by |C(i,D)|/|D|.
Expected information need to classify a tuple in D:
$$Info(D) = -\sum\limits_{i=1}^n{P_i}log_2P_i$$
Information needed (after using A to split D into v partitions)to classify D:
$$Info_A(D) = \sum\limits_{j=1}^v\frac{|D_j|}{|D|}Info(D_j)$$
So, information gained by branching on attribute A:
$$Gain(A) = Info(D)-Info_A(D)$$

1.3 Example

age income Student creditrating buyscomputer
<= 30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes excellent yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no

Class P:buyscomputer = "yes"
Class N:buyscomputer = "no"
the number of classs P is 9,while the number of class N is 5.

So,$$Info(D) = -\frac{9}{14}log_2\frac{9}{14} - \frac{5}{14}log_2\frac{5}{14} = 0.940$$

In Attribute age, the number of class P is 2,while the number of class N is 3.
So,$$Info(D_{
Similarly,
$$Info(D_{31...40}) = 0$$,$$Info(D_{>40}) = 0.971$$
Then,$$Info_{age}(D) = \frac{5}{14}Info(D_{40}) = 0.694$$
Therefore,$$Gain(age) = Info(D) - Info_age(D) = 0.246$$
Similarly,
$$Gain(income) = 0.029$$
$$Gain(Student) = 0.151$$
$$Gain(credit_rating) = 0.048$$

1.4 Question

What if the attribute's value is continuous? How can we handle it?
1.The best split point for A
+Sort the value A in increasing order
+Typically, the midpoint between each pair of adjacent values is considered as a possible split point
-(a i +a i+1 )/2 is the midpoint between the values of a i and a i+1
+The point with the minimum expected information requirement for A is selected as the split-point for A
2.Split:
+D1 is the set of tuples in D satisfying A ≤ split-point, and D2 is the set of tuples in D satisfying A > split-point.

2 C4.5

C4.5 is a successor of ID3.

2.1 Formula

$$SpiltInfo_A(D) = -\sum\limits_{j=1}^v\frac{|D_j|}{|D|}*log_2\frac{|D_j|}{|D|}$$
Then the GainRatio equals to,
$$GainRatio(A=Gain(A)/SplitInfo(A)$$
The attribute with the maximun gain ratio is selected as the splitting attribute.

3 Gini Index

3.1 Formula

If a data set D contains examples from n classes, gini index gini(D) is defined as
$$gini(D) = 1 - \sum\limits_{j=1}^nP_j^2$$
where pj is the relative frequency of class j in D.
If Data set D is split on A which have n classes.Then
$$gini_A(D) = \sum\limits_{i=1}^n\frac{D_i}{D}gini(D_i)$$
Reduction in Impurity
$$\Delta gini(A) = gini(D)-gini_A(D)$$

4 Summary

ID3/C4.5 isn't suitable for large amount of trainning set,because they have to repeat to sort and scan training set for many times. That will cost much time than other classification alogrithms.
The three measures,in general, return good results but
1.Information gain:
-biased towards multivalued attributes
2.Gain ratio:
-tends to prefer unbalanced splits in which one partition is much smaller than the other.
3.Gini index:
-biased towards multivalued attributes
-has difficulty when # of classes is large
-tends to favor tests that result in equal-sized partitions and purity in both partitions.

5 Other Attribute Selection Measures

1.CHAID: a popular decision tree algorithm, measure based on χ 2 test for independence
2.C-SEP: performs better than info. gain and gini index in certain cases
3.G-statistics: has a close approximation to χ 2 distribution
4.MDL (Minimal Description Length) principle (i.e., the simplest solution
is preferred):
The best tree as the one that requires the fewest # of bits to both
(1) encode the tree, and (2) encode the exceptions to the tree
5.Multivariate splits (partition based on multiple variable combinations)
CART: finds multivariate splits based on a linear comb. of attrs.
Which attribute selection measure is the best?
Most give good results, none is significantly superior than others

Author: mlhy

Created: 2015-10-06 二 14:39

Emacs 24.5.1 (Org mode 8.2.10)

ID3/C4.5/Gini Index的更多相关文章

  1. Theoretical comparison between the Gini Index and Information Gain criteria

    Knowledge Discovery in Databases (KDD) is an active and important research area with the promise for ...

  2. 决策树(ID3,C4.5,CART)原理以及实现

    决策树 决策树是一种基本的分类和回归方法.决策树顾名思义,模型可以表示为树型结构,可以认为是if-then的集合,也可以认为是定义在特征空间与类空间上的条件概率分布. [图片上传失败...(image ...

  3. ID3\C4.5\CART

    目录 树模型原理 ID3 C4.5 CART 分类树 回归树 树创建 ID3.C4.5 多叉树 CART分类树(二叉) CART回归树 ID3 C4.5 CART 特征选择 信息增益 信息增益比 基尼 ...

  4. 多分类度量gini index

    第一份工作时, 基于 gini index 写了一份决策树代码叫ctree, 用于广告推荐. 今天想起来, 好像应该有开源的其他方法了. 参考 https://www.cnblogs.com/mlhy ...

  5. 决策树模型 ID3/C4.5/CART算法比较

    决策树模型在监督学习中非常常见,可用于分类(二分类.多分类)和回归.虽然将多棵弱决策树的Bagging.Random Forest.Boosting等tree ensembel 模型更为常见,但是“完 ...

  6. 用于分类的决策树(Decision Tree)-ID3 C4.5

    决策树(Decision Tree)是一种基本的分类与回归方法(ID3.C4.5和基于 Gini 的 CART 可用于分类,CART还可用于回归).决策树在分类过程中,表示的是基于特征对实例进行划分, ...

  7. 决策树 ID3 C4.5 CART(未完)

    1.决策树 :监督学习 决策树是一种依托决策而建立起来的一种树. 在机器学习中,决策树是一种预测模型,代表的是一种对象属性与对象值之间的一种映射关系,每一个节点代表某个对象,树中的每一个分叉路径代表某 ...

  8. 21.决策树(ID3/C4.5/CART)

    总览 算法   功能  树结构  特征选择  连续值处理 缺失值处理  剪枝  ID3  分类  多叉树  信息增益   不支持 不支持  不支持 C4.5  分类  多叉树  信息增益比   支持 ...

  9. ID3,C4.5和CART三种决策树的区别

    ID3决策树优先选择信息增益大的属性来对样本进行划分,但是这样的分裂节点方法有一个很大的缺点,当一个属性可取值数目较多时,可能在这个属性对应值下的样本只有一个或者很少个,此时它的信息增益将很高,ID3 ...

随机推荐

  1. Fedora 32大变化:将删除Python 2及其软件包

    导读 虽然Fedora 30还没有上市,Fedora 32直到大约一年后才上市,但我们已经知道一个很大的变化:删除Python 2和包依赖它.随着Fedora 32将于2020年上半年推出,超过了Py ...

  2. Egret Engine 2D - 缩放模式和旋转模式说明

    缩放模式和旋转模式说明 缩放模式showAll 常用 noScale noBorder exactFit 次常用 fixedWidth fixedHeight fixedNarrow fixedWid ...

  3. centos6.7搭建局域网ntp服务器

    修改/etc/ntp.conf文件 restrict xxx nomodify notrap nopeer noquery             #xxx 此处配置本地IP地址restrict 12 ...

  4. 当DIV内出现滚动条,fixed实效怎么办?

    sticky    盒位置根据正常流计算(这称为正常流动中的位置),然后相对于该元素在流中的 flow root(BFC)和 containing block(最近的块级祖先元素)定位.在所有情况下( ...

  5. cf 760B.Frodo and pillows

    二分,判断条件就是最小情况(设当前k位取x)比剩余值(m-x)要小.(貌似又做麻烦了2333) #include<bits/stdc++.h> #define LL long long # ...

  6. Python属性和内建属性

    属性property 1. 私有属性添加getter和setter方法 class Money(object): def __init__(self): self.__money = 0 def ge ...

  7. 实验吧web-难-认真一点!(布尔盲注,py脚本)

    也可用bp进行爆破,这里用py脚本. 打看网页输入1,显示You are in,输入2,显示You are not in,是个布尔注入. 然后看看过滤了什么. sql注入没有过滤:--+.or sql ...

  8. Android自定义View——多边形网格属性图

      1.初始化变量 2.属性图解   3.如果想切换到5.6.7边形等等,则必须修改下面几条数据 4.获取宽和高 5.绘制图形 1.开始画画前:我们要把画笔准备好,这里看代码就能明白意思了,接着把整个 ...

  9. 2020/2/1 PHP代码审计之变量覆盖漏洞

    0x00 变量覆盖简介 变量覆盖是指变量未被初始化,我们自定义的参数值可以替换程序原有的变量值. 0x01 漏洞危害 通常结合程序的其他漏洞实现完整的攻击,比如文件上传页面,覆盖掉原来白名单的列表,导 ...

  10. settings配置数据库和日志

    数据库的配置: 一.mysql配置 pip下载pymysql,用于mysql和django的连接. 在init.py上配置pymsqy. import pymysql pymysql.install_ ...