决策树Decision Tree 及实现




版权声明:本文为博主原创文章,未经博主允许不得转载。
本文基于python逐步实现Decision Tree(决策树),分为以下几个步骤:
- 加载数据集
- 熵的计算
- 根据最佳分割feature进行数据分割
- 根据最大信息增益选择最佳分割feature
- 递归构建决策树
- 样本分类
关于决策树的理论方面本文几乎不讲,详情请google keywords:“决策树 信息增益 熵”
将分别体现于代码。
本文只建一个.py文件,所有代码都在这个py里
1.加载数据集
我们选用UCI经典Iris为例
Brief of IRIS:
Data Set Characteristics: |
Multivariate |
Number of Instances: |
150 |
Area: |
Life |
Attribute Characteristics: |
Real |
Number of Attributes: |
4 |
Date Donated |
1988-07-01 |
Associated Tasks: |
Classification |
Missing Values? |
No |
Number of Web Hits: |
533125 |
Code:
- from numpy import *
- #load "iris.data" to workspace
- traindata = loadtxt("D:\ZJU_Projects\machine learning\ML_Action\Dataset\Iris.data",delimiter = ',',usecols = (0,1,2,3),dtype = float)
- trainlabel = loadtxt("D:\ZJU_Projects\machine learning\ML_Action\Dataset\Iris.data",delimiter = ',',usecols = (range(4,5)),dtype = str)
- feaname = ["#0","#1","#2","#3"] # feature names of the 4 attributes (features)
Result:
左图为实际数据集,四个离散型feature,一个label表示类别(有Iris-setosa, Iris-versicolor,Iris-virginica 三个类)
2. 熵的计算
entropy是香农提出来的(信息论大牛),定义见wiki
注意这里的entropy是H(C|X=xi)而非H(C|X), H(C|X)的计算见第下一个点,还要乘以概率加和
Code:
- from math import log
- def calentropy(label):
- n = label.size # the number of samples
- #print n
- count = {} #create dictionary "count"
- for curlabel in label:
- if curlabel not in count.keys():
- count[curlabel] = 0
- count[curlabel] += 1
- entropy = 0
- #print count
- for key in count:
- pxi = float(count[key])/n #notice transfering to float first
- entropy -= pxi*log(pxi,2)
- return entropy
- #testcode:
- #x = calentropy(trainlabel)
Result:
3. 根据最佳分割feature进行数据分割
假定我们已经得到了最佳分割feature,在这里进行分割(最佳feature为splitfea_idx)
第二个函数idx2data是根据splitdata得到的分割数据的两个index集合返回datal (samples less than pivot), datag(samples greater than pivot), labell, labelg。 这里我们根据所选特征的平均值作为pivot
- #split the dataset according to label "splitfea_idx"
- def splitdata(oridata,splitfea_idx):
- arg = args[splitfea_idx] #get the average over all dimensions
- idx_less = [] #create new list including data with feature less than pivot
- idx_greater = [] #includes entries with feature greater than pivot
- n = len(oridata)
- for idx in range(n):
- d = oridata[idx]
- if d[splitfea_idx] < arg:
- #add the newentry into newdata_less set
- idx_less.append(idx)
- else:
- idx_greater.append(idx)
- return idx_less,idx_greater
- #testcode:2
- #idx_less,idx_greater = splitdata(traindata,2)
- #give the data and labels according to index
- def idx2data(oridata,label,splitidx,fea_idx):
- idxl = splitidx[0] #split_less_indices
- idxg = splitidx[1] #split_greater_indices
- datal = []
- datag = []
- labell = []
- labelg = []
- for i in idxl:
- datal.append(append(oridata[i][:fea_idx],oridata[i][fea_idx+1:]))
- for i in idxg:
- datag.append(append(oridata[i][:fea_idx],oridata[i][fea_idx+1:]))
- labell = label[idxl]
- labelg = label[idxg]
- return datal,datag,labell,labelg
这里args是参数,决定分裂节点的阈值(每个参数对应一个feature,大于该值分到>branch,小于该值分到<branch),我们可以定义如下:
- args = mean(traindata,axis = 0)
测试:按特征2进行分类,得到的less和greater set of indices分别为:
也就是按args[2]进行样本集分割,<和>args[2]的branch分别有57和93个样本。
4. 根据最大信息增益选择最佳分割feature
信息增益为代码中的info_gain, 注释中是熵的计算
- #select the best branch to split
- def choosebest_splitnode(oridata,label):
- n_fea = len(oridata[0])
- n = len(label)
- base_entropy = calentropy(label)
- best_gain = -1
- for fea_i in range(n_fea): #calculate entropy under each splitting feature
- cur_entropy = 0
- idxset_less,idxset_greater = splitdata(oridata,fea_i)
- prob_less = float(len(idxset_less))/n
- prob_greater = float(len(idxset_greater))/n
- #entropy(value|X) = \sum{p(xi)*entropy(value|X=xi)}
- cur_entropy += prob_less*calentropy(label[idxset_less])
- cur_entropy += prob_greater * calentropy(label[idxset_greater])
- info_gain = base_entropy - cur_entropy #notice gain is before minus after
- if(info_gain>best_gain):
- best_gain = info_gain
- best_idx = fea_i
- return best_idx
- #testcode:
- #x = choosebest_splitnode(traindata,trainlabel)
这里的测试针对所有数据,分裂一次选择哪个特征呢?
5. 递归构建决策树
详见code注释,buildtree递归地构建树。
递归终止条件:
①该branch内没有样本(subset为空) or
②分割出的所有样本属于同一类 or
③由于每次分割消耗一个feature,当没有feature的时候停止递归,返回当前样本集中大多数sample的label
- #create the decision tree based on information gain
- def buildtree(oridata, label):
- if label.size==0: #if no samples belong to this branch
- return "NULL"
- listlabel = label.tolist()
- #stop when all samples in this subset belongs to one class
- if listlabel.count(label[0])==label.size:
- return label[0]
- #return the majority of samples' label in this subset if no extra features avaliable
- if len(feanamecopy)==0:
- cnt = {}
- for cur_l in label:
- if cur_l not in cnt.keys():
- cnt[cur_l] = 0
- cnt[cur_l] += 1
- maxx = -1
- for keys in cnt:
- if maxx < cnt[keys]:
- maxx = cnt[keys]
- maxkey = keys
- return maxkey
- bestsplit_fea = choosebest_splitnode(oridata,label) #get the best splitting feature
- print bestsplit_fea,len(oridata[0])
- cur_feaname = feanamecopy[bestsplit_fea] # add the feature name to dictionary
- print cur_feaname
- nodedict = {cur_feaname:{}}
- del(feanamecopy[bestsplit_fea]) #delete current feature from feaname
- split_idx = splitdata(oridata,bestsplit_fea) #split_idx: the split index for both less and greater
- data_less,data_greater,label_less,label_greater = idx2data(oridata,label,split_idx,bestsplit_fea)
- #build the tree recursively, the left and right tree are the "<" and ">" branch, respectively
- nodedict[cur_feaname]["<"] = buildtree(data_less,label_less)
- nodedict[cur_feaname][">"] = buildtree(data_greater,label_greater)
- return nodedict
- #testcode:
- #mytree = buildtree(traindata,trainlabel)
- #print mytree
Result:
mytree就是我们的结果,#1表示当前使用第一个feature做分割,'<'和'>'分别对应less 和 greater的数据。
6. 样本分类
根据构建出的mytree进行分类,递归走分支
- #classify a new sample
- def classify(mytree,testdata):
- if type(mytree).__name__ != 'dict':
- return mytree
- fea_name = mytree.keys()[0] #get the name of first feature
- fea_idx = feaname.index(fea_name) #the index of feature 'fea_name'
- val = testdata[fea_idx]
- nextbranch = mytree[fea_name]
- #judge the current value > or < the pivot (average)
- if val>args[fea_idx]:
- nextbranch = nextbranch[">"]
- else:
- nextbranch = nextbranch["<"]
- return classify(nextbranch,testdata)
- #testcode
- tt = traindata[0]
- x = classify(mytree,tt)
- print x
Result:
为了验证代码准确性,我们换一下args参数,把它们都设成0(很小)
args = [0,0,0,0]
建树和分类的结果如下:
可见没有小于pivot(0)的项,于是dict中每个<的key对应的value都为空。
本文中全部代码下载:决策树python实现
Reference: Machine Learning in Action
from: http://blog.csdn.net/abcjennifer/article/details/20905311
决策树Decision Tree 及实现的更多相关文章
- 机器学习算法实践:决策树 (Decision Tree)(转载)
前言 最近打算系统学习下机器学习的基础算法,避免眼高手低,决定把常用的机器学习基础算法都实现一遍以便加深印象.本文为这系列博客的第一篇,关于决策树(Decision Tree)的算法实现,文中我将对决 ...
- 数据挖掘 决策树 Decision tree
数据挖掘-决策树 Decision tree 目录 数据挖掘-决策树 Decision tree 1. 决策树概述 1.1 决策树介绍 1.1.1 决策树定义 1.1.2 本质 1.1.3 决策树的组 ...
- 用于分类的决策树(Decision Tree)-ID3 C4.5
决策树(Decision Tree)是一种基本的分类与回归方法(ID3.C4.5和基于 Gini 的 CART 可用于分类,CART还可用于回归).决策树在分类过程中,表示的是基于特征对实例进行划分, ...
- (ZT)算法杂货铺——分类算法之决策树(Decision tree)
https://www.cnblogs.com/leoo2sk/archive/2010/09/19/decision-tree.html 3.1.摘要 在前面两篇文章中,分别介绍和讨论了朴素贝叶斯分 ...
- 决策树decision tree原理介绍_python sklearn建模_乳腺癌细胞分类器(推荐AAA)
sklearn实战-乳腺癌细胞数据挖掘(博主亲自录制视频) https://study.163.com/course/introduction.htm?courseId=1005269003& ...
- 机器学习方法(四):决策树Decision Tree原理与实现技巧
欢迎转载,转载请注明:本文出自Bin的专栏blog.csdn.net/xbinworld. 技术交流QQ群:433250724,欢迎对算法.技术.应用感兴趣的同学加入. 前面三篇写了线性回归,lass ...
- 机器学习-决策树 Decision Tree
咱们正式进入了机器学习的模型的部分,虽然现在最火的的机器学习方面的库是Tensorflow, 但是这里还是先简单介绍一下另一个数据处理方面很火的库叫做sklearn.其实咱们在前面已经介绍了一点点sk ...
- 决策树 Decision Tree
决策树是一个类似于流程图的树结构:其中,每个内部结点表示在一个属性上的测试,每个分支代表一个属性输出,而每个树叶结点代表类或类分布.树的最顶层是根结点.  决策树的构建 想要构建一个决策树,那么咱们 ...
- 【机器学习算法-python实现】决策树-Decision tree(2) 决策树的实现
(转载请注明出处:http://blog.csdn.net/buptgshengod) 1.背景 接着上一节说,没看到请先看一下上一节关于数据集的划分数据集划分.如今我们得到了每一个特征值得 ...
随机推荐
- jQuery 基本过滤选择器注意点
$(".add_shuxing_ul > li:first") 选择class为add_shuxing_ul 的子级li元素的第一个li $(".add_shux ...
- PHP面向对象学习三 类的抽象方法和类
一个类中至少有一个方法是抽象的,我们称之为抽象类. 所以如果定义抽象类首先定义抽象方法. 1.类中至少有一个抽象方法 2.抽象方法不允许有{ } 3.抽象方法前面必须要加abstract 抽象类的几个 ...
- C#后台如何获取客户端访问系统型号
ASP.NET获取客户端.服务器端基础信息 . 在ASP.NET中专用属性: 获取服务器电脑名:Page.Server.ManchineName 获取用户信息:Page.User 获取客户端电脑名:P ...
- 原生javascript封装ajax和jsonp
在我们请求数据时,完成页面跨域,利用原生JS封装的ajax和jsonp: <!DOCTYPE html> <html lang="en"> <head ...
- Java面向对象编程
面向对象的软件开发: 面向对象的开发把软件系统看成各种对象的集合,对象就是最小的子系统,一组相关的对象能够组合成复杂的子系统. 面向对象的开发方法具有以下优点: 1.把软件系统看成是各种对象的集合,更 ...
- [CareerCup] 16.5 Semphore 信号旗
16.5 Suppose we have the following code:public class Foo { public Foo() { . . . } public void first( ...
- HDU1019
Least Common Multiple Time Limit: 2000/1000 MS (Java/Others) Memory Limit: 65536/32768 K (Java/Ot ...
- Pl/Sql 导入dmp文件时窗口一闪而过
做如下设置: 点击“导入”,ok
- Hibernate和Mybatis的对比
http://blog.csdn.net/jiuqiyuliang/article/details/45378065 Hibernate与Mybatis对比 1. 简介 Hibernate:Hiber ...
- js闭包初体验
/* 闭包的定义:一个内部函数里变量作用域生命周期延续,直接访问一个函数里面的私有属性 闭包的作用:解决变量作用域延续的问题,同时解决全局变量冲突的问题 */ //1.定义内部函数,私有函数 fu ...