We are now going to consider a rather different approach to machine learning, starting with one of the most common and powerful data structures in the whole of computer science: the binary tree. The computational cost of making the tree is fairly low, but the cost of using it is even lower:

For these reasons, classification by decision trees has grown in popularity over recent years. You are very likely to have been subjected to decision trees if you've ever phoned a helpline, for example for computer faults. The phone operators are guided through the decision treed by your answers to their questions.

The idea of a decision tree is that we break classification down into a set of choices about each feature in turn, starting at the root (base) of the tree and progressing down to the leaves, where we receive the classification decision. The trees are very easy to understand, and can even be turned into a set of if-then rules, suitable for use in a rule induction system.

In terms of optimisation and search, decision trees use a greedy heuristic to perform search, evaluating the possible options at the current stage of learning and making the one that seems optimal at that point. This works well a surprisingly large amount of the time.

12.1 USING DECISION TREES

As a student it can be difficult to decide what to do in the evening. There are four things that you actually quite enjoy doing, or have to do: going to the pub, watching TV, going to a party, or even (gasp) studying. The choice is sometimes made for you – if you have an assignment due the next day, then you need to study, if you are feeling lazy then the pub isn't for you, and if there isn't party then you can't go to it. You are looking for a nice algorithm that will let you decide what to do each evening without having to think about it every night.

Each evening you start at the top (root) of the tree and check whether any of your friends know about a party that night. If there is one, then you need to go, regardless. Only there is not a party do you worry about whether or not you have an assignment deadline coming up. If there is a crucial deadline, then you have to study, but if there is nothing that is urgent for the next few days, you think about how you feel. A sudden burst of energy might make you study, but otherwise you'll be slumped in front of the TV indulging your secret love of Shortland Street (or other soap opera of your choice) rather than studying. Of course, near the start of the semester when there are no assignments to do, and you are feeling rich, you'll be in the pub.

One of the reasons that decision trees are popular is that we can turn them into a set of logical disjunctions (if … then rules) that then go into program code very simply - the first part of the tree above can be turned into:

ifthere is a partythengo to it

ifthere is not a party andyou have an urgent deadlinethenstudy

etc.

12.2 CONSTRUCTING DECISION TREES

In the example above, the three features that we need for the algorithm are the stated of your energy level, the date of your nearest deadline, and whether or not there is a party tonight. The question we need to ask is how, based on those features, we can construct the tree. There are a few different decision tree algorithms, but they are almost all variants of the same principle: the algorithms build the tree in a greedy manner starting at the root, choosing the most informative feature at each step. We are going to start by focusing on the most common: Quinlan's ID3, although we'll also mention its extension, known as C4.5, and another known as CART.

There was an important word hidden in the sentence above about how the trees work, which was informative. Choosing which feature to use next in the decision tree can be thought of as playing the game '20 Questions', where you try to elicit the item your opponent is thinking about by asking questions about it. At each stage, you choose a question that gives you the most information given what you know already. Thus, you would as 'Is it an animal?' before you ask 'Is it a cat?'. The idea is to quantify this question of how much information is provided to you by knowing certain facts. Encoding this mathematically is the task of information theory.

12.2.1 Quick Aside: Entropy in Information Theory

Information theory was 'born' in 1948 when Claude Shannon published a paper called “A Mathematical Theory of Communication”. In that paper, he proposed the measure of information entropy, which describes the amount of impurity in a set of features. The entropy

where the logarithm is base

For our decision tree, the best feature to pick as the one to classify on now is the one that gives you the most information, i.e., the one with the highest entropy. After using that feature, we re-evaluate the entropy of each feature and again pick the one with the highest entropy.

Information theory is a very interesting subject. It is possible to download Shannon's 1948 paper from the Internet, and also to find many resources showing where it has been applied. There are now while journals devoted to information theory because it is relevant to so many areas such as computer and telecommunication networks, machine learning, and data storage.

12.2.2 ID3

Now that we have a suitable measure for choosing which feature to choose next, entropy, we just have to work out how to apply it. The important idea is to work out how much the entropy of the whole training set would decrease if we choose each particular feature for the next classification step. This is known as the information gain, and it is defined as the entropy of the whole set minus the entropy when a particular feature is chosen. This is defined by (where

As an example, suppose that we have data (with outcomes)

The ID3 algorithm computes this information gain for each feature and chooses the one that produces the highest value. In essence, that is all there is to the algorithm. It searches the space of possible trees in a greedy way by choosing the feature with the highest information gain at each stage. The output of the algorithm is the tree, i.e., a list of nodes, edges, and leaves. As with any tree in computer science, it can be constructed recursively. At each stage the best feature is selected and then remove from the dataset, and the algorithm is recursively called on the rest. The recursion stops when either there is only one class remaining in the data (in which case a leaf is added with that class as its label), or there are no features left, when the most common label in the remaining data is used.

The ID3 Algorithm

If all examples have the same label:

- return a leaf with that label
Else if there are no features left to test:

- return a leaf with the most common label
Else:

- choose the feature

- add a branch from the node for each possible value

- for each branch:

calculate
recursively call the algorithm with

Owing to the focus on classification for real-world examples, trees are often used with text features rather than numeric values.

12.3 CLASSIFICATION AND REGRESSION TREES (CART)

There is another well-known tree-based algorithm, CART, whose name indicates that it can be used for both classification and regression. Classification is not wildly different in CART, although it is usually constrained to construct binary trees. This might seem odd at first, but there are sound computer science reasons why binary trees are good, as suggested in the computation cost discussion above, and it is not a real limitation. Even in the example that we started the chapter with, we can always turn questions into binary decisions by splitting the question up a little. Thus, a question that has three answers (say the question about when your nearest assignment deadline is, which is either 'urgent', 'near' or 'none') can be split into two questions: first, 'is the deadline urgent?', and then if the answer to that is 'no', second 'is the deadline near?' The only real difference with classification in CART is that a different information measure is commonly used.

12.3.1 Gini Impurity

The entropy that was used in ID3 as the information measure is not the only way to pick features. Another possibility is something known as the Gini impurity. The 'impurity' in the name suggests that the aim of the decision tree is to have each leaf node represent a set of datapoints that are in the same class, so that there are no mismatches. This is known as purity. If a leaf is pure then all of the training data within it have just one class. In which case, if we count the number of datapoints at the node (or better, the fraction of the number of datapoints) that belong to a class

where

Either way, the Gini impurity is equivalent to computing the expected error rate if the classification was picked according to the class distribution. The information gain can then be measured in the same way, subtracting each value

The information measure can be changed in another way, which is to add a weight to the misclassificaitons. The idea is to consider the cost of misclassifying an instance of class

12.3.2 Regression in Trees

The new part about CART is its application in regression. While it might seem strange to use trees for regression, it turns out to require only a simple modification to the algorithm. Suppose that the outputs are continuous, so that a regression model is appropriate. None of the node impurity measures that we have considered so far will work. Instead, we'll go back to our old favorite - the sum-of-squares error. To evaluate the choice of which feature to use next, we also need to find the value at which to split the database according to that feature. Remember that the output is a value at each leaf. In general, this is just a constant value for the output, computed as the mean average of all the datapoints that are situated in that leaf. This is the optimal choice in order to minimize the sum-of-squares error, but it also means that we can choose the split point quickly for a given feature, by choosing it to minimize the sum-of-squares error. We can then pick the feature that has the split point that provides the best sum-of-squares error, and continue to use the algorithm as for classification.

Learning with Trees的更多相关文章

Awesome Deep Vision
Awesome Deep Vision A curated list of deep learning resources for computer vision, inspired by awes ...
Seven Python Tools All Data Scientists Should Know How to Use
Seven Python Tools All Data Scientists Should Know How to Use If you’re an aspiring data scientist, ...
Machine Learning Methods: Decision trees and forests
Machine Learning Methods: Decision trees and forests This post contains our crib notes on the basics ...
【Machine Learning】决策树案例：基于python的商品购买能力预测系统
决策树在商品购买能力预测案例中的算法实现作者:白宁超 2016年12月24日22:05:42 摘要:随着机器学习和深度学习的热潮,各种图书层出不穷.然而多数是基础理论知识介绍,缺乏实现的深入理解.本 ...
【机器学习Machine Learning】资料大全
昨天总结了深度学习的资料,今天把机器学习的资料也总结一下(友情提示:有些网站需要"科学上网"^_^) 推荐几本好书: 1.Pattern Recognition and Machi ...
【深度学习Deep Learning】资料大全
最近在学深度学习相关的东西,在网上搜集到了一些不错的资料,现在汇总一下: Free Online Books by Yoshua Bengio, Ian Goodfellow and Aaron C ...
[Machine Learning & Algorithm] 随机森林（Random Forest）
1 什么是随机森林? 作为新兴起的.高度灵活的一种机器学习算法,随机森林(Random Forest,简称RF)拥有广泛的应用前景,从市场营销到医疗保健保险,既可以用来做市场营销模拟的建模,统计客户来 ...
[Machine Learning] 国外程序员整理的机器学习资源大全
本文汇编了一些机器学习领域的框架.库以及软件(按编程语言排序). 1. C++ 1.1 计算机视觉 CCV —基于C语言/提供缓存/核心的机器视觉库,新颖的机器视觉库 OpenCV—它提供C++, C ...
Exploratory Undersampling for Class-Imbalance Learning
Abstract - Undersampling is a popular method in dealing with class-imbalance problems, which uses on ...

随机推荐

python入门练习题1
常见python入门练习题 1.执行python脚本的两种方法第一种:给python脚本一个可执行的权限,进入到当前存放python程序的目录,给一个x可执行权限,如:有一个homework.py文 ...
linux下tftp使用（转一些命令）
转载于:http://cache.baiducontent.com/c?m=9d78d513d99d1af31fa7837e7c5083205b4380122ba6d1020ba5843990732c ...
MySQL如何关联查询
总的来说,mysql认为任何一个查询都是一次关联,并不仅仅是一个查询需要用到两个表匹配才叫关联,所以,在mysql中,每一个查询,每一个片段(包括子查询,甚至单表select)都可能是关联.所以,理解 ...
Java虚拟机学习（1）：体系结构内存模型
一:Java技术体系模块图 Java技术体系模块图二:JVM内存区域模型 1.方法区也称"永久代" ."非堆", 它用于存储虚拟机加载的类信息.常量.静态 ...
tomcat配置SSL证书(使用startSSL申请到的证书)
1. 生成p12证书生成完成后,下载保存. 2. 将p12文件上传至服务器上的tomcat/keystore/上,此处取名为test.com.p12 3. 修改tomcat/conf/server. ...
centos 服务开机启动设置
建立服务文件以nginx 为例 vim /lib/systemd/system/nginx.service 在nginx.service 中插入一下内容 [Unit] Description=ngin ...
利用gitHub搭建博客
##1.gitHub Page的的使用我觉得这边博文写的很清楚,方法.步骤.优缺点以及实例,所以就借用一下啦^_^ [搭建一个免费的,无限流量的Blog](http://www.ruanyifeng. ...
CSRF攻击原理以及防御
一.CSRF是什么? CSRF(Cross-site request forgery),中文名称:跨站请求伪造,也被称为:one click attack/session riding,缩写为:CSR ...
servlet学习笔记_1
一.动态页面和静态页面动态页面&静态页面:如果浏览器在不同时刻不同条件下访问web服务器的某个页面,浏览器所获得的页面内容会发生变化,那么这种页面称之为动态页面.动态页面和静态页面的区别在于 ...
c++ *.h和*.cpp在编译中的作用
首先,我们可以将所有东西都放在一个.cpp文件内.然后编译器就将这个.cpp编译成.obj,obj是什么东西?就是编译单元了. 一个程序,可以由一个编译单元组成,也可以有多个编译单元组成. 如果你不想 ...

Learning with Trees