Learning with Trees
We are now going to consider a rather different approach to machine learning, starting with one of the most common and powerful data structures in the whole of computer science: the binary tree. The computational cost of making the tree is fairly low, but the cost of using it is even lower:
For these reasons, classification by decision trees has grown in popularity over recent years. You are very likely to have been subjected to decision trees if you've ever phoned a helpline, for example for computer faults. The phone operators are guided through the decision treed by your answers to their questions.
The idea of a decision tree is that we break classification down into a set of choices about each feature in turn, starting at the root (base) of the tree and progressing down to the leaves, where we receive the classification decision. The trees are very easy to understand, and can even be turned into a set of if-then rules, suitable for use in a rule induction system.
In terms of optimisation and search, decision trees use a greedy heuristic to perform search, evaluating the possible options at the current stage of learning and making the one that seems optimal at that point. This works well a surprisingly large amount of the time.
12.1 USING DECISION TREES
As a student it can be difficult to decide what to do in the evening. There are four things that you actually quite enjoy doing, or have to do: going to the pub, watching TV, going to a party, or even (gasp) studying. The choice is sometimes made for you – if you have an assignment due the next day, then you need to study, if you are feeling lazy then the pub isn't for you, and if there isn't party then you can't go to it. You are looking for a nice algorithm that will let you decide what to do each evening without having to think about it every night.
Each evening you start at the top (root) of the tree and check whether any of your friends know about a party that night. If there is one, then you need to go, regardless. Only there is not a party do you worry about whether or not you have an assignment deadline coming up. If there is a crucial deadline, then you have to study, but if there is nothing that is urgent for the next few days, you think about how you feel. A sudden burst of energy might make you study, but otherwise you'll be slumped in front of the TV indulging your secret love of Shortland Street (or other soap opera of your choice) rather than studying. Of course, near the start of the semester when there are no assignments to do, and you are feeling rich, you'll be in the pub.
One of the reasons that decision trees are popular is that we can turn them into a set of logical disjunctions (if … then rules) that then go into program code very simply - the first part of the tree above can be turned into:
ifthere is a partythengo to it
ifthere is not a party andyou have an urgent deadlinethenstudy
etc.
12.2 CONSTRUCTING DECISION TREES
In the example above, the three features that we need for the algorithm are the stated of your energy level, the date of your nearest deadline, and whether or not there is a party tonight. The question we need to ask is how, based on those features, we can construct the tree. There are a few different decision tree algorithms, but they are almost all variants of the same principle: the algorithms build the tree in a greedy manner starting at the root, choosing the most informative feature at each step. We are going to start by focusing on the most common: Quinlan's ID3, although we'll also mention its extension, known as C4.5, and another known as CART.
There was an important word hidden in the sentence above about how the trees work, which was informative. Choosing which feature to use next in the decision tree can be thought of as playing the game '20 Questions', where you try to elicit the item your opponent is thinking about by asking questions about it. At each stage, you choose a question that gives you the most information given what you know already. Thus, you would as 'Is it an animal?' before you ask 'Is it a cat?'. The idea is to quantify this question of how much information is provided to you by knowing certain facts. Encoding this mathematically is the task of information theory.
12.2.1 Quick Aside: Entropy in Information Theory
Information theory was 'born' in 1948 when Claude Shannon published a paper called “A Mathematical Theory of Communication”. In that paper, he proposed the measure of information entropy, which describes the amount of impurity in a set of features. The entropy
where the logarithm is base
For our decision tree, the best feature to pick as the one to classify on now is the one that gives you the most information, i.e., the one with the highest entropy. After using that feature, we re-evaluate the entropy of each feature and again pick the one with the highest entropy.
Information theory is a very interesting subject. It is possible to download Shannon's 1948 paper from the Internet, and also to find many resources showing where it has been applied. There are now while journals devoted to information theory because it is relevant to so many areas such as computer and telecommunication networks, machine learning, and data storage.
12.2.2 ID3
Now that we have a suitable measure for choosing which feature to choose next, entropy, we just have to work out how to apply it. The important idea is to work out how much the entropy of the whole training set would decrease if we choose each particular feature for the next classification step. This is known as the information gain, and it is defined as the entropy of the whole set minus the entropy when a particular feature is chosen. This is defined by (where
As an example, suppose that we have data (with outcomes)
The ID3 algorithm computes this information gain for each feature and chooses the one that produces the highest value. In essence, that is all there is to the algorithm. It searches the space of possible trees in a greedy way by choosing the feature with the highest information gain at each stage. The output of the algorithm is the tree, i.e., a list of nodes, edges, and leaves. As with any tree in computer science, it can be constructed recursively. At each stage the best feature is selected and then remove from the dataset, and the algorithm is recursively called on the rest. The recursion stops when either there is only one class remaining in the data (in which case a leaf is added with that class as its label), or there are no features left, when the most common label in the remaining data is used.
The ID3 Algorithm |
- choose the feature - add a branch from the node for each possible value - for each branch:
|
Owing to the focus on classification for real-world examples, trees are often used with text features rather than numeric values.
12.3 CLASSIFICATION AND REGRESSION TREES (CART)
There is another well-known tree-based algorithm, CART, whose name indicates that it can be used for both classification and regression. Classification is not wildly different in CART, although it is usually constrained to construct binary trees. This might seem odd at first, but there are sound computer science reasons why binary trees are good, as suggested in the computation cost discussion above, and it is not a real limitation. Even in the example that we started the chapter with, we can always turn questions into binary decisions by splitting the question up a little. Thus, a question that has three answers (say the question about when your nearest assignment deadline is, which is either 'urgent', 'near' or 'none') can be split into two questions: first, 'is the deadline urgent?', and then if the answer to that is 'no', second 'is the deadline near?' The only real difference with classification in CART is that a different information measure is commonly used.
12.3.1 Gini Impurity
The entropy that was used in ID3 as the information measure is not the only way to pick features. Another possibility is something known as the Gini impurity. The 'impurity' in the name suggests that the aim of the decision tree is to have each leaf node represent a set of datapoints that are in the same class, so that there are no mismatches. This is known as purity. If a leaf is pure then all of the training data within it have just one class. In which case, if we count the number of datapoints at the node (or better, the fraction of the number of datapoints) that belong to a class
where
Either way, the Gini impurity is equivalent to computing the expected error rate if the classification was picked according to the class distribution. The information gain can then be measured in the same way, subtracting each value
The information measure can be changed in another way, which is to add a weight to the misclassificaitons. The idea is to consider the cost of misclassifying an instance of class
12.3.2 Regression in Trees
The new part about CART is its application in regression. While it might seem strange to use trees for regression, it turns out to require only a simple modification to the algorithm. Suppose that the outputs are continuous, so that a regression model is appropriate. None of the node impurity measures that we have considered so far will work. Instead, we'll go back to our old favorite - the sum-of-squares error. To evaluate the choice of which feature to use next, we also need to find the value at which to split the database according to that feature. Remember that the output is a value at each leaf. In general, this is just a constant value for the output, computed as the mean average of all the datapoints that are situated in that leaf. This is the optimal choice in order to minimize the sum-of-squares error, but it also means that we can choose the split point quickly for a given feature, by choosing it to minimize the sum-of-squares error. We can then pick the feature that has the split point that provides the best sum-of-squares error, and continue to use the algorithm as for classification.
Learning with Trees的更多相关文章
- Awesome Deep Vision
Awesome Deep Vision A curated list of deep learning resources for computer vision, inspired by awes ...
- Seven Python Tools All Data Scientists Should Know How to Use
Seven Python Tools All Data Scientists Should Know How to Use If you’re an aspiring data scientist, ...
- Machine Learning Methods: Decision trees and forests
Machine Learning Methods: Decision trees and forests This post contains our crib notes on the basics ...
- 【Machine Learning】决策树案例:基于python的商品购买能力预测系统
决策树在商品购买能力预测案例中的算法实现 作者:白宁超 2016年12月24日22:05:42 摘要:随着机器学习和深度学习的热潮,各种图书层出不穷.然而多数是基础理论知识介绍,缺乏实现的深入理解.本 ...
- 【机器学习Machine Learning】资料大全
昨天总结了深度学习的资料,今天把机器学习的资料也总结一下(友情提示:有些网站需要"科学上网"^_^) 推荐几本好书: 1.Pattern Recognition and Machi ...
- 【深度学习Deep Learning】资料大全
最近在学深度学习相关的东西,在网上搜集到了一些不错的资料,现在汇总一下: Free Online Books by Yoshua Bengio, Ian Goodfellow and Aaron C ...
- [Machine Learning & Algorithm] 随机森林(Random Forest)
1 什么是随机森林? 作为新兴起的.高度灵活的一种机器学习算法,随机森林(Random Forest,简称RF)拥有广泛的应用前景,从市场营销到医疗保健保险,既可以用来做市场营销模拟的建模,统计客户来 ...
- [Machine Learning] 国外程序员整理的机器学习资源大全
本文汇编了一些机器学习领域的框架.库以及软件(按编程语言排序). 1. C++ 1.1 计算机视觉 CCV —基于C语言/提供缓存/核心的机器视觉库,新颖的机器视觉库 OpenCV—它提供C++, C ...
- Exploratory Undersampling for Class-Imbalance Learning
Abstract - Undersampling is a popular method in dealing with class-imbalance problems, which uses on ...
随机推荐
- Git错误non-fast-forward后的冲突解决
Git错误non-fast-forward后的冲突解决当要push代码到git时,出现提示: error:failed to push some refs to ... Dealing with “n ...
- windows+caffe(三)——求取图片的均值
这个要在图片已经转化成lmdb格式下才能求均值... 1.查看caffe根目录下的bin是否存在compute_image_mean.exe(用的happey大神的) 如果没有存在,你需要打开Main ...
- [poj2828] Buy Tickets (线段树)
线段树 Description Railway tickets were difficult to buy around the Lunar New Year in China, so we must ...
- 从零开始学iPhone开发(3)——视图及控制器的使用
上一节我们分别使用IB和代码建立了两个视图并且熟悉了一些控件.这一节我们需要了解视图和视图的切换. 在iOS编程中,框架提供了很多视图,比如UIView,UIImageView, UIWebView等 ...
- iOS,Xcode7 制作Framework,含资源和界面
Xcode7 制作Framework 本文通过Demo方式介绍1)将含bundle和存代码编写界面打包进framework:2)将storyboard +assets.xcassets打包. (一) ...
- 十六、Swing高级组件
1.利用JTable类直接创建表格 (1)创建表格 构造方法:JTable(Object rowData,Object[] columnNames) (2)定制表格 编辑:isCellEditable ...
- 登录锁定状态下Win7关机技巧总结
登录锁定状态下Win7关机技巧总结 一般在锁定状态都是有个关闭电脑的图标的.但是如果你的系统没有,那么怎么样关机呢,所谓的锁定状态通常是指电脑在登录界面,具体的实现如下,感兴趣的朋友可以参考下 现在大 ...
- Cheatsheet: 2015 12.01 ~ 12.31
Mobile Setting Up the Development Environment iOS From Scratch With Swift: How to Test an iOS Applic ...
- macaca运行报错之chrome-driver问题处理,关闭 Chrome 的自动更新
由于chrome浏览器自动更新,导致 macaca运行报错,重新安装和更新chrome-driver 之后,还需要把chrome浏览器降级到50版本: 但是chrome会自动更新,所以需要禁止.找到这 ...
- ie6兼容之绝对定位元素内容为空时高度问题
正常显示: ie6下显示: line6元素高度最小16px; 解决办法: 添加内容在空的div里,并且设置行高即可. 其中,非ie6浏览器不需要再空的div里加无谓的内容,再次需要用“条件注释”来解决 ...