机器学习：决策树--python

今天，我们介绍机器学习里比较常用的一种分类算法，决策树。决策树是对人类认知识别的一种模拟，给你一堆看似杂乱无章的数据，如何用尽可能少的特征，对这些数据进行有效的分类。

决策树借助了一种层级分类的概念，每一次都选择一个区分性最好的特征进行分类，对于可以直接给出标签 label 的数据，可能最初选择的几个特征就能很好地进行区分，有些数据可能需要更多的特征，所以决策树的深度也就表示了你需要选择的几种特征。

在进行特征选择的时候，常常需要借助信息论的概念，利用最大熵原则。

决策树一般是用来对离散数据进行分类的，对于连续数据，可以事先对其离散化。

在介绍决策树之前，我们先简单的介绍一下信息熵，我们知道，熵的定义为：

En(xi)=log2p(xi)

p(xi) 表示 x 属于第 i 类的概率，我们把所有类的期望定义为熵：

H=−∑i=1np(xi)log2p(xi)

这里 n 表示类别的个数。

我们先构造一些简单的数据：

from sklearn import datasets

import numpy as np

import matplotlib.pyplot as plt

import math

import operator

def Create_data():

    dataset = [[1, 1, 'yes'],

               [1, 1, 'yes'],

               [1, 0, 'no'],

               [0, 1, 'no'],

               [0, 1, 'no'],

               [3, 0, 'maybe']]

    feat_name = ['no surfacing', 'flippers']

    return dataset, feat_name

然后定义一个计算熵的函数：

def Cal_entrpy(dataset):

    n_sample = len(dataset)

    n_label = {}

    for featvec in dataset:

        current_label = featvec[-1]

        if current_label not in n_label.keys():

            n_label[current_label] = 0

        n_label[current_label] += 1

    shannonEnt = 0.0

    for key in n_label:

        prob = float(n_label[key]) / n_sample

        shannonEnt -= prob * math.log(prob, 2)

    return shannonEnt

要注意的是，熵越大，说明数据的类别越分散，越呈现某种无序的状态。

下面再定义一个拆分数据集的函数：

def Split_dataset(dataset, axis, value):

    retDataSet = []

    for featVec in dataset:

        if featVec[axis] == value:

            reducedFeatVec = featVec[:axis]

            reducedFeatVec.extend(featVec[axis+1 :])

            retDataSet.append(reducedFeatVec)

    return retDataSet

结合前面的几个函数，我们可以构造一个特征选择的函数：

def Choose_feature(dataset):

    num_sample = len(dataset)

    num_feature = len(dataset[0]) - 1

    baseEntrpy = Cal_entrpy(dataset)

    best_Infogain = 0.0

    bestFeat = -1

    for i in range (num_feature):

        featlist = [example[i] for example in dataset]

        uniquValus = set(featlist)

        newEntrpy = 0.0

        for value in uniquValus:

            subData = Split_dataset(dataset, i, value)

            prob = len(subData) / float(num_sample)

            newEntrpy += prob * Cal_entrpy(subData)

        info_gain = baseEntrpy - newEntrpy

        if (info_gain > best_Infogain):

            best_Infogain = info_gain

            bestFeat = i

    return bestFeat

然后再构造一个投票及计票的函数

def Major_cnt(classlist):

    class_num = {}

    for vote in classlist:

        if vote not in class_num.keys():

            class_num[vote] = 0

        class_num[vote] += 1

    Sort_K = sorted(class_num.iteritems(),

       key = operator.itemgetter(1), reverse=True)

    return Sort_K[0][0]

有了这些，就可以构造我们需要的决策树了：

def Create_tree(dataset, featName):

    classlist = [example[-1] for example in dataset]

    if classlist.count(classlist[0]) == len(classlist):

        return classlist[0]

    if len(dataset[0]) == 1:

        return Major_cnt(classlist)

    bestFeat = Choose_feature(dataset)

    bestFeatName = featName[bestFeat]

    myTree = {bestFeatName: {}}

    del(featName[bestFeat])

    featValues = [example[bestFeat] for example in dataset]

    uniqueVals = set(featValues)

    for value in uniqueVals:

        subLabels = featName[:]

        myTree[bestFeatName][value] = Create_tree(Split_dataset\

              (dataset, bestFeat, value), subLabels)

    return myTree

def Get_numleafs(myTree):

    numLeafs = 0

    firstStr = myTree.keys()[0]

    secondDict = myTree[firstStr]

    for key in secondDict.keys():

        if type(secondDict[key]).__name__ == 'dict' :

            numLeafs += Get_numleafs(secondDict[key])

        else:

            numLeafs += 1

    return numLeafs

def Get_treedepth(myTree):

    max_depth = 0

    firstStr = myTree.keys()[0]

    secondDict = myTree[firstStr]

    for key in secondDict.keys():

        if type(secondDict[key]).__name__ == 'dict' :

            this_depth = 1 + Get_treedepth(secondDict[key])

        else:

            this_depth = 1

        if this_depth > max_depth:

            max_depth = this_depth

    return max_depth

我们也可以把决策树绘制出来:

def Plot_node(nodeTxt, centerPt, parentPt, nodeType):

    Create_plot.ax1.annotate(nodeTxt, xy=parentPt,

                            xycoords='axes fraction',

                            xytext=centerPt, textcoords='axes fraction',

                            va="center", ha="center", bbox=nodeType, arrowprops=arrow_args)

def Plot_tree(myTree, parentPt, nodeTxt):

    numLeafs = Get_numleafs(myTree)

    Get_treedepth(myTree)

    firstStr = myTree.keys()[0]

    cntrPt = (Plot_tree.xOff + (1.0 + float(numLeafs))/2.0/Plot_tree.totalW,\

              Plot_tree.yOff)

    Plot_midtext(cntrPt, parentPt, nodeTxt)

    Plot_node(firstStr, cntrPt, parentPt, decisionNode)

    secondDict = myTree[firstStr]

    Plot_tree.yOff = Plot_tree.yOff - 1.0/Plot_tree.totalD

    for key in secondDict.keys():

        if type(secondDict[key]).__name__=='dict':

            Plot_tree(secondDict[key],cntrPt,str(key))

        else:

            Plot_tree.xOff = Plot_tree.xOff + 1.0/Plot_tree.totalW

            Plot_node(secondDict[key], (Plot_tree.xOff, Plot_tree.yOff),

                     cntrPt, leafNode)

            Plot_midtext((Plot_tree.xOff, Plot_tree.yOff), cntrPt, str(key))

    Plot_tree.yOff = Plot_tree.yOff + 1.0/Plot_tree.totalD

def Create_plot (myTree):

    fig = plt.figure(1, facecolor = 'white')

    fig.clf()

    axprops = dict(xticks=[], yticks=[])

    Create_plot.ax1 = plt.subplot(111, frameon=False, **axprops)

    Plot_tree.totalW = float(Get_numleafs(myTree))

    Plot_tree.totalD = float(Get_treedepth(myTree))

    Plot_tree.xOff = -0.5/Plot_tree.totalW; Plot_tree.yOff = 1.0;

    Plot_tree(myTree, (0.5,1.0), '')

    plt.show()

def Plot_midtext(cntrPt, parentPt, txtString):

    xMid = (parentPt[0] - cntrPt[0]) / 2.0 + cntrPt[0]

    yMid = (parentPt[1] - cntrPt[1]) / 2.0 + cntrPt[1]

    Create_plot.ax1.text(xMid, yMid, txtString)

def Classify(myTree, featLabels, testVec):

    firstStr = myTree.keys()[0]

    secondDict = myTree[firstStr]

    featIndex = featLabels.index(firstStr)

    for key in secondDict.keys():

        if testVec[featIndex] == key:

            if type(secondDict[key]).__name__ == 'dict' :

                classLabel = Classify(secondDict[key],featLabels,testVec)

            else:

                classLabel = secondDict[key]

    return classLabel

最后，可以测试我们的构造的决策树分类器：

decisionNode = dict(boxstyle="sawtooth", fc="0.8")

leafNode = dict(boxstyle="round4", fc="0.8")

arrow_args = dict(arrowstyle="<-")

myData, featName = Create_data()

S_entrpy = Cal_entrpy(myData)

new_data = Split_dataset(myData, 0, 1)

best_feat = Choose_feature(myData)

myTree = Create_tree(myData, featName[:])

num_leafs = Get_numleafs(myTree)

depth = Get_treedepth(myTree)

Create_plot(myTree)

predict_label = Classify(myTree, featName, [1, 0])

print("the predict label is: ", predict_label)

print("the decision tree is: ", myTree)

print("the best feature index is: ", best_feat)

print("the new dataset: ", new_data)

print("the original dataset: ", myData)

print("the feature names are: ",  featName)

print("the entrpy is:", S_entrpy)

print("the number of leafs is: ", num_leafs)

print("the dpeth is: ", depth)

print("All is well.")

构造的决策树最后如下所示：

机器学习：决策树--python的更多相关文章

可能是史上最全的机器学习和Python（包括数学）速查表
新手学习机器学习很难,就是收集资料也很费劲.所幸Robbie Allen从不同来源收集了目前最全的有关机器学习.Python和相关数学知识的速查表大全.强烈建议收藏! 机器学习有很多方面. 当我开始刷 ...
决策树python建模中的坑：ValueError: Expected 2D array, got 1D array instead:
决策树python建模中的坑代码 #coding=utf-8 from sklearn.feature_extraction import DictVectorizerimport csvfrom ...
【机器学习算法-python实现】决策树-Decision tree（2）决策树的实现
(转载请注明出处:http://blog.csdn.net/buptgshengod) 1.背景接着上一节说,没看到请先看一下上一节关于数据集的划分数据集划分.如今我们得到了每一个特征值得 ...
机器学习决策树ID3算法，手把手教你用Python实现
本文始发于个人公众号:TechFlow,原创不易,求个关注今天是机器学习专题的第21篇文章,我们一起来看一个新的模型--决策树. 决策树的定义决策树是我本人非常喜欢的机器学习模型,非常直观容易理解 ...
【机器学习算法-python实现】决策树-Decision tree（1）信息熵划分数据集
(转载请注明出处:http://blog.csdn.net/buptgshengod) 1.背景决策书算法是一种逼近离散数值的分类算法,思路比較简单,并且准确率较高.国际权威的学术组织,数据挖掘国际 ...
【机器学习算法-python实现】Adaboost的实现(1)-单层决策树(decision stump)
(转载请注明出处:http://blog.csdn.net/buptgshengod) 1.背景上一节学习支持向量机,感觉公式都太难理解了,弄得我有点头大.只是这一章的Adaboost线比 ...
python 机器学习决策树
决策树(Decision Trees ,DTs)是一种无监督的学习方法,用于分类和回归. 优点:计算复杂度不高,输出结果易于理解,对中间值缺失不敏感,可以处理不相关的特征数据缺点:可能会产生过度匹配的 ...
机器学习_决策树Python代码详解
决策树优点:计算复杂度不高,输出结果易于理解,对中间值的缺失不敏感,可以处理不相关特征数据: 决策树缺点:可能会产生过度匹配问题. 决策树的一般步骤: (1)代码中def 1,计算给定数据集的香农熵: ...
Python数据科学手册-机器学习: 决策树与随机森林
无参数算法随机森林随机森林是一种集成方法,集成多个比较简单的评估器形成累计效果. 导入标准程序库随机森林的诱因: 决策树随机森林是建立在决策树基础上的集成学习器建一颗决策树二叉决策树 ...

随机推荐

CocoaAsyncSocket 文档1：Socket简单介绍
前言 CocoaAsyncSocket是 IOS下广泛应用的Socket三方库,网上相关样例数不胜数.这里我就不直接上代码,本文由B9班的真高兴发表于CSDN博客.另辟一条思路:翻译SocketAsy ...
【WPF学习笔记】之如何设置下拉框读取SqlServer数据库的值：动画系列之（一）
先前条件:设置好数据库,需要三个文件CommandInfo.cs.DbHelperSQL.cs.myHelper.cs,需要修改命名空间,参照之前随笔http://www.cnblogs.com/Ow ...
Go 语言从新手到大神：每个人都会踩的五十个坑（转）
Go语言是一个简单却蕴含深意的语言.但是,即便号称是最简单的C语言,都能总结出一本<C陷阱与缺陷>,更何况Go语言呢.Go语言中的许多坑其实并不是因为Go自身的问题.一些错误你再别的语言中 ...
深入Asyncio（四）Coroutines
Coroutines asyncio在3.4版本添加到Python中,但通过async def和await关键字创建coroutines的语法是3.5才加入的,在这之前,人们把generators当作 ...
andeoid硬件解码
Finally, I must say, finally, we get low-level media APIs in Android, the Android hardware decoding ...
再说WCF Data Contract KnownTypeAttribute
WCF 中的序列化是用DataContractSerializer,所有被[DataContract]和[DataMemeber]标记的类和属性会被DataContractSerializer序列化. ...
l两张图片轮播
在head里面加 <script language="javascript"> function scroll(spanlevel) { if (spanlevel.s ...
【转】soapUI和Jmeter的接口测试结构区别
使用SoapUI和Jmeter都可以进行自动化接口测试,但是每个工具都有自身的特点,所以他们的结构也有一定的区别 SoapUI 项目名称 -Rest服务.Rest资源在使用SoapUI进行接口测试时 ...
EasyPlayer RTSP Android安卓播放器实现视频源快速切换
EasyPlayer现在支持多视频源快速切换了,我们介绍一下是如何实现的. 这个需求通常应用在一个客户端需要查看多个视频源的情况,比如多个监控场景轮播. 由于EasyPlayer的播放端已经放在Fra ...
WCF基础之设计和实现服务协定
本来前面还有一个章节“WCF概述”,这章都是些文字概述,就不“复制”了,直接从第二章开始. 当然学习WCF还是要些基础的.https://msdn.microsoft.com/zh-cn/hh1482 ...

机器学习：决策树--python

机器学习：决策树--python的更多相关文章

随机推荐

热门专题