Machine Learning in Action(2) 决策树算法

决策树也是有监督机器学习方法。电影《无耻混蛋》里有一幕游戏，在德军小酒馆里有几个人在玩20问题游戏，游戏规则是一个设迷者在纸牌中抽出一个目标（可以是人，也可以是物），而猜谜者可以提问题，设迷者只能回答是或者不是，在几个问题（最多二十个问题）之后，猜谜者通过逐步缩小范围就准确的找到了答案。这就类似于决策树的工作原理。（图一）是一个判断邮件类别的工作方式，可以看出判别方法很简单，基本都是阈值判断，关键是如何构建决策树，也就是如何训练一个决策树。

（图一）

构建决策树的伪代码如下：

Check if every item in the dataset is in the same class:

If so return the class label

Else

find the best feature to split the data

split the dataset

create a branch node

for each split

call create Branch and add the result to the branch node

return branch node

原则只有一个，尽量使得每个节点的样本标签尽可能少，注意上面伪代码中一句说：find the best feature to split the data，那么如何find thebest feature?一般有个准则就是尽量使得分支之后节点的类别纯一些，也就是分的准确一些。如（图二）中所示，从海洋中捞取的5个动物，我们要判断他们是否是鱼，先用哪个特征？

（图二）

为了提高识别精度，我们是先用“能否在陆地存活”还是“是否有蹼”来判断？我们必须要有一个衡量准则，常用的有信息论、基尼纯度等，这里使用前者。我们的目标就是选择使得分割后数据集的标签信息增益最大的那个特征，信息增益就是原始数据集标签基熵减去分割后的数据集标签熵，换句话说，信息增益大就是熵变小，使得数据集更有序。熵的计算如（公式一）所示：

（公式一）

有了指导原则，那就进入代码实战阶段，先来看看熵的计算代码：

 def calcShannonEnt(dataSet):
     numEntries = len(dataSet)
     labelCounts = {}
     for featVec in dataSet: #the the number of unique elements and their occurance
         currentLabel = featVec[-1]
         if currentLabel not in labelCounts.keys(): labelCounts[currentLabel] = 0
         labelCounts[currentLabel] += 1  #收集所有类别的数目，创建字典
     shannonEnt = 0.0
     for key in labelCounts:
         prob = float(labelCounts[key])/numEntries
         shannonEnt -= prob * log(prob,2) #log base 2  计算熵
     return shannonEnt

有了熵的计算代码，接下来看依照信息增益变大的原则选择特征的代码：

 def splitDataSet(dataSet, axis, value):
     retDataSet = []
     for featVec in dataSet:
         if featVec[axis] == value:
             reducedFeatVec = featVec[:axis]     #chop out axis used for splitting
             reducedFeatVec.extend(featVec[axis+1:])
             retDataSet.append(reducedFeatVec)
     return retDataSet
 
 def chooseBestFeatureToSplit(dataSet):
     numFeatures = len(dataSet[0]) - 1      #the last column is used for the labels
     baseEntropy = calcShannonEnt(dataSet)
     bestInfoGain = 0.0; bestFeature = -1
     for i in range(numFeatures):        #iterate over all the features
         featList = [example[i] for example in dataSet]#create a list of all the examples of this feature
         uniqueVals = set(featList)       #get a set of unique values
         newEntropy = 0.0
         for value in uniqueVals:
             subDataSet = splitDataSet(dataSet, i, value)
             prob = len(subDataSet)/float(len(dataSet))
             newEntropy += prob * calcShannonEnt(subDataSet)
         infoGain = baseEntropy - newEntropy     #calculate the info gain; ie reduction in entropy
         if (infoGain > bestInfoGain):       #compare this to the best gain so far    #选择信息增益最大的代码在此
             bestInfoGain = infoGain         #if better than current best, set to best
             bestFeature = i
     return bestFeature                      #returns an integer

从最后一个if可以看出，选择使得信息增益最大的特征作为分割特征，现在有了特征分割准则，继续进入一下个环节，如何构建决策树，其实就是依照最上面的伪代码写下去，采用递归的思想依次分割下去，直到执行完成就构建了决策树。代码如下：

 def majorityCnt(classList):
     classCount={}
     for vote in classList:
         if vote not in classCount.keys(): classCount[vote] = 0
         classCount[vote] += 1
     sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)
     return sortedClassCount[0][0]
 
 def createTree(dataSet,labels):
     classList = [example[-1] for example in dataSet]
     if classList.count(classList[0]) == len(classList):
         return classList[0]#stop splitting when all of the classes are equal
     if len(dataSet[0]) == 1: #stop splitting when there are no more features in dataSet
         return majorityCnt(classList)
     bestFeat = chooseBestFeatureToSplit(dataSet)
     bestFeatLabel = labels[bestFeat]
     myTree = {bestFeatLabel:{}}
     del(labels[bestFeat])
     featValues = [example[bestFeat] for example in dataSet]
     uniqueVals = set(featValues)
     for value in uniqueVals:
         subLabels = labels[:]       #copy all of labels, so trees don't mess up existing labels
         myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value),subLabels)
     return myTree

用图二的样本构建的决策树如（图三）所示：

（图三）

有了决策树，就可以用它做分类咯，分类代码如下：

 def classify(inputTree,featLabels,testVec):
     firstStr = inputTree.keys()[0]
     secondDict = inputTree[firstStr]
     featIndex = featLabels.index(firstStr)
     key = testVec[featIndex]
     valueOfFeat = secondDict[key]
     if isinstance(valueOfFeat, dict):
         classLabel = classify(valueOfFeat, featLabels, testVec)
     else: classLabel = valueOfFeat
     return classLabel

最后给出序列化决策树（把决策树模型保存在硬盘上）的代码：

 def storeTree(inputTree,filename):
     import pickle
     fw = open(filename,'w')
     pickle.dump(inputTree,fw)
     fw.close()
 
 def grabTree(filename):
     import pickle
     fr = open(filename)
     return pickle.load(fr)

优点：检测速度快

缺点：容易过拟合，可以采用修剪的方式来尽量避免

以上内容来至群友博客http://blog.csdn.net/marvin521/article/details/9255977

Ps:决策树算法以其简单、清晰、高效的性能，在传统行业应用非常广泛，常见流失预警，客户的目标响应模型等，同时也是我最喜欢的一个算法,经典的决策树算法有CART和C5.0，前者是二叉树，适合并行，之前在学校的时候也在cuda架构上写过这个算法的并行程序，后者可以是多叉树。在此算法基础上的ensemble集成(bagging,adaboost、gbdt,bagging+random subspace = random forest、(pca+subspace)+tree = rotation forest)性能有很大的提升，至于决策树剪枝则可以选用不同的优化指标，采用前向还是后向剪。

Machine Learning in Action(2) 决策树算法的更多相关文章

机器学习实战（Machine Learning in Action）学习笔记————05.Logistic回归
机器学习实战(Machine Learning in Action)学习笔记————05.Logistic回归关键字:Logistic回归.python.源码解析.测试作者:米仓山下时间:2018- ...
机器学习实战（Machine Learning in Action）学习笔记————03.决策树原理、源码解析及测试
机器学习实战(Machine Learning in Action)学习笔记————03.决策树原理.源码解析及测试关键字:决策树.python.源码解析.测试作者:米仓山下时间:2018-10-2 ...
Machine Learning in Action(5) SVM算法
做机器学习的一定对支持向量机(support vector machine-SVM)颇为熟悉,因为在深度学习出现之前,SVM一直霸占着机器学习老大哥的位子.他的理论很优美,各种变种改进版本也很多,比如 ...
《Machine Learning in Action》—— 剖析支持向量机，单手狂撕线性SVM
<Machine Learning in Action>-- 剖析支持向量机,单手狂撕线性SVM 前面在写NumPy文章的结尾处也有提到,本来是打算按照<机器学习实战 / Machi ...
《Machine Learning in Action》—— 剖析支持向量机，优化SMO
<Machine Learning in Action>-- 剖析支持向量机,优化SMO 薄雾浓云愁永昼,瑞脑销金兽. 愁的很,上次不是更新了一篇关于支持向量机的文章嘛,<Machi ...
《Machine Learning in Action》—— Taoye给你讲讲决策树到底是支什么“鬼”
<Machine Learning in Action>-- Taoye给你讲讲决策树到底是支什么"鬼" 前面我们已经详细讲解了线性SVM以及SMO的初步优化过程,具体 ...
《Machine Learning in Action》—— 小朋友，快来玩啊，决策树呦
<Machine Learning in Action>-- 小朋友,快来玩啊,决策树呦在上篇文章中,<Machine Learning in Action>-- Taoye ...
《Machine Learning in Action》—— 懂的都懂，不懂的也能懂。非线性支持向量机
说在前面:前几天,公众号不是给大家推送了第二篇关于决策树的文章嘛.阅读过的读者应该会发现,在最后排版已经有点乱套了.真的很抱歉,也不知道咋回事,到了后期Markdown格式文件的内容就解析出现问题了, ...
《Machine Learning in Action》—— Taoye给你讲讲Logistic回归是咋回事
在手撕机器学习系列文章的上一篇,我们详细讲解了线性回归的问题,并且最后通过梯度下降算法拟合了一条直线,从而使得这条直线尽可能的切合数据样本集,已到达模型损失值最小的目的. 在本篇文章中,我们主要是手撕 ...

随机推荐

Windwos2008如何关闭IE增强的安全配置
如题方法:
解决 ecshop 搜索特殊字符关键字（如：*，+，/）导致搜索结果乱码问题
病症:ecshop系统搜索会对搜索关键字进行分词,然后对关键字分词进行正则匹配,并且标红加粗处理,如果关键字分词有特殊字符,则正则匹配结果会导致乱码解决方法: 1.找到特殊字符串数组:$ts_str ...
洛谷——P2737 [USACO4.1]麦香牛块Beef McNuggets
https://www.luogu.org/problemnew/show/P2737 题目描述农夫布朗的奶牛们正在进行斗争,因为它们听说麦当劳正在考虑引进一种新产品:麦香牛块.奶牛们正在想尽一切办 ...
洛谷——P2176 [USACO14FEB]路障Roadblock
P2176 [USACO14FEB]路障Roadblock 题目描述每天早晨,FJ从家中穿过农场走到牛棚.农场由 N 块农田组成,农田通过 M 条双向道路连接,每条路有一定长度.FJ 的房子在 1 ...
SQLite的sqlite_master表
SQLite的sqlite_master表 sqlite_master表是SQLite的系统表.该表记录该数据库中保存的表.索引.视图.和触发器信息.每一行记录一个项目.在创建一个SQLIte数据 ...
【spring mvc】后台spring mvc接收List参数报错如下：org.springframework.beans.BeanInstantiationException: Failed to instantiate [java.util.List]: Specified class is an interface
后台spring mvc接收List参数报错如下:org.springframework.beans.BeanInstantiationException: Failed to instantiate ...
[反汇编练习] 160个CrackMe之032
[反汇编练习] 160个CrackMe之032. 本系列文章的目的是从一个没有任何经验的新手的角度(其实就是我自己),一步步尝试将160个CrackMe全部破解,如果可以,通过任何方式写出一个类似于注 ...
深度排序与alpha混合【转】
翻译:李现民最后修改:2012-07-03 原文:Depth sorting alpha blended objects 先说个题外话,本来我想回答在 Creators Club论坛上的一个常见 ...
自己封装的CMusic类【转】
http://www.cnblogs.com/zhangminaxiang/archive/2013/02/27/2936011.html 缘由: 在改正俄罗斯方块程序的功能的时候,想给这个程序增加一 ...
meta标签多种用法
<meta name=”google” content=”notranslate” /> <!-- 有时,Google在结果页面会提供一个翻译链接,但有时候你不希望出现这个链接,你可 ...

Machine Learning in Action(2) 决策树算法

Machine Learning in Action(2) 决策树算法的更多相关文章

随机推荐

热门专题