sklearn学习笔记3
Explaining Titanic hypothesis with decision trees
decision trees are very simple yet powerful supervised learning methods, which constructs a decision tree model, which will be used to make predictions.
The main advantage of this model is that a human being can easily understand and reproduce the sequence of decisions (especially if the number of attributes is small) taken to predict the target class of a new instance. This is very important for tasks such as medical diagnosis or credit approval, where we want to show a reason for the decision, rather than just saying this is what the training data suggests (which is, by defnition, what every supervised learning method does).
In this section, we will show you through a working example what decision trees look like, how they are built, and how they are used for prediction.
The problem we would like to solve is to determine if a Titanic's passenger would have survived, given her age, passenger class, and sex. Titanic dataset that can be downloaded from http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic.txt. Each instance in the dataset has the following form:
"1","1st",1,"Allen, Miss Elisabeth Walton",29.0000,"Southampton","StLouis, MO","B-5","24160 L221","2","female"
The list of attributes is: Ordinal, Class, Survived (0=no, 1=yes), Name, Age, Port of Embarkation, Home/Destination, Room, Ticket, Boat, and Sex. We will start by loading the dataset into a numpy array.
1 import csv
2 import numpy as np
3 with open('C:/Users/Administrator/Desktop/data/titanic.csv', 'rt') as csvfile:
4 titanic_reader = csv.reader(csvfile, delimiter=',', quotechar = '"')
5 # Header contains feature names
6 row = next(titanic_reader)
7 feature_names = np.array(row)
8
9 # Load dataset, and target classes
10 titanic_X, titanic_y = [], []
11 for row in titanic_reader:
12 titanic_X.append(row)
13 titanic_y.append(row[2]) #The target value is "survived"
14 titanic_X = np.array(titanic_X)
15 titanic_y = np.array(titanic_y)
The code shown uses the Python csv module to load the data.
print(feature_names)
print(titanic_X[0], titanic_y[0])
['row.names' 'pclass' 'survived' 'name' 'age' 'embarked' 'home.dest' 'room'
'ticket' 'boat' 'sex']
['1' '1st' '1' 'Allen, Miss Elisabeth Walton' '29.0000' 'Southampton'
'St Louis, MO' 'B-5' '24160 L221' '2' 'female'] 1
Preprocessing the data
The frst step we must take is to select the attributes we will use for learning:
# we keep class, age and sex
titanic_X = titanic_X[:, [1, 4, 10]]
feature_names = feature_names[[1, 4, 10]]
We have selected feature numbers 1, 4, and 10 that is class, age, and sex, based on the assumption that the remaining attributes have no effect on the passenger's survival.
有时候,特征选择会由手工进行,基于我们对于问题领域的知识和我们选择使用的机器学习方法。有时候,特征选择也会由自动化工具来进行。
Very specifc attributes (such as Name in our case) could result in overftting (consider a tree that just asks if the name is X, she survived); attributes where there is a small number of instances with each value, present a similar problem (they might not be useful for generalization). We will use class, age, and sex because a priori, we expect them to have influenced the passenger's survival.
Now, our learning data looks like:
print(feature_names)
print(titanic_X[12], titanic_y[12])
['pclass' 'age' 'sex']
['1st' 'NA' 'female'] 1
这里我们打印了12号样本,因为他提出了一个问题需要解决;他其中一个特征(年龄)缺失。数据集中有缺失值,这是一个很普遍的问题。在这个案例里,我们决定使用训练集中平均年龄来代替缺失值。我们也可以使用不同的方法来解决,例如,使用训练集中的众数或中位数。当我们代替缺失值时,我们必须知道一个问题,我们正在修改原始问题,所以我们不得不对于我们正在做的特别小心。这是机器学习中一个普遍的规则。当我们改变数据的时候,我们应该非常清楚我们正在改变的,避免影响最后结果的准确性。
# We have missing values for age
# Assign the mean value
ages = titanic_X[:, 1]
mean_age = np.mean(titanic_X[ages != 'NA',1].astype(np.float))
titanic_X[titanic_X[:, 1] == 'NA', 1] = mean_age
The implementation of decision trees in scikit-learn expects as input a list of realvalued features, and the decision rules of the model would be of the form:
Feature <= value
For example, age <= 20.0. Our attributes (except for age) are categorical; that is, they correspond to a value taken from a discrete set such as male and female. So, we have to convert categorical data into real values. Let's start with the sex feature. The preprocessing module of scikit-learn includes a LabelEncoder class, whose fit method allows conversion of a categorical set into a 0..K-1 integer, where K is the number of different classes in the set (in the case of sex, just 0 or 1):
# Encode sex
from sklearn.preprocessing import LabelEncoder
enc = LabelEncoder()
label_encoder = enc.fit(titanic_X[:, 2])
print("Categorical classes:",label_encoder.classes_)
Categorical classes: ['female' 'male']
integer_classes = label_encoder.transform(label_encoder.classes_)
print("Iteger classes:", integer_classes)
t = label_encoder.transform(titanic_X[:, 2])
titanic_X[:, 2] = t
Iteger classes: [0 1]
The last two sentences transform the values of the sex attribute into 0-1 values, and modify the training set.
print(feature_names)
print(titanic_X[12], titanic_y[12])
['pclass' 'age' 'sex']
['1st' '31.19418104265403' '0'] 1
We still have a categorical attribute: class. We could use the same approach and convert its three classes into 0, 1, and 2. This transformation implicitly introduces an ordering between classes, something that is not an issue in our problem. However, we will try a more general approach that does not assume an ordering, and it is widely used to convert categorical classes into real-valued attributes. We will introduce an additional encoder and convert the class attributes into three new
binary features, each of them indicating if the instance belongs to a feature value (1) or (0). This is called one hot encoding, and it is a very common way of managing categorical attributes for real-based methods:
enc = OneHotEncoder()
one_hot_encoder = enc.fit(integer_classes)
# First, convert classes to 0-(N-1) integers using label_encoder
num_of_rows = titanic_X.shape[0]
t = label_encoder.transform(titanic_X[:, 0]).reshape(num_of_rows, 1)
# Second, create a sparse matrix with three columns, each one indicating if the instance belongs to the class
new_features = one_hot_encoder.transform(t)
# Add the new features to titanix_X
titanic_X = np.concatenate([titanic_X, new_features.toarray()], axis = 1)
#Eliminate converted columns
titanic_X = np.delete(titanic_X, [0], 1)
# Update feature names
feature_names = ['age', 'sex', 'first_class', 'second_class', 'third_class']
# Convert to numerical values
titanic_X = titanic_X.astype(float)
titanic_y = titanic_y.astype(float)
The preceding code first converts the classes into integers and then uses the OneHotEncoder class to create the three new attributes that are added to the array of features. It fnally eliminates from training data the original class feature.
print(feature_names)
print(titanic_X[0], titanic_y[0])
['age', 'sex', 'first_class', 'second_class', 'third_class']
[ 29. 0. 1. 0. 0.] 1.0
We have now a suitable learning set for scikit-learn to learn a decision tree. Also, standardization is not an issue for decision trees because the relative magnitude of features does not affect the classifer performance.
The preprocessing step is usually underestimated in machine learning methods, but as we can see even in this very simple example, it can take some time to make data look as our methods expect. It is also very important in the overall machine learning process; if we fail in this step (for example, incorrectly encoding attributes, or selecting the wrong features), the following steps will fail, no matter how good the method we use for learning.
Training a decision tree classifer
Now to the interesting part; let's build a decision tree from our training data. As usual, we will frst separate training and testing data.
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(titanic_X,titanic_y, test_size=0.25, random_state=33)
Now, we can create a new DecisionTreeClassifier and use the fit method of the classifer to do the learning job.
from sklearn import tree
clf = tree.DecisionTreeClassifier(criterion='entropy', max_depth=3,min_samples_leaf=5)
clf = clf.fit(X_train,y_train)
DecisionTreeClassifier accepts (as most learning methods) several hyperparameters that control its behavior. In this case, we used the Information Gain (IG) criterion for splitting learning data, told the method to build a tree of at most three levels, and to accept a node as a leaf if it includes at least fve training instances. To explain this and show how decision trees work, let's visualize the model built. The following code assumes you are using IPython and that your Python distribution includes the pydot module. Also, it allows generation of Graphviz code from the tree and assumes that Graphviz itself is installed. For more information about Graphviz, please refer to http://www.graphviz.org/.
import pydotplus
from io import StringIO
dot_data = StringIO()
tree.export_graphviz(clf, out_file=dot_data, feature_names=['age','sex','1st_class','2nd_class', '3rd_class'])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_png('titanic.png')
from IPython.core.display import Image
Image(filename='titanic.png')
You might be asking how our method decides which questions should be asked in each step. The answer is Information Gain (IG) (or the Gini index, which is a similar measure of disorder used by scikit-learn). IG measures how much entropy we lose if we answer the question, or alternatively, how much surer we are after answering it. Entropy is a measure of disorder in a set, if we have zero entropy, it means all values are the same (in our case, all instances of the target classes are the same), while it reaches its maximum when there is an equal number of instances of each class (in our case, when half of the instances correspond to survivors and the other half to non survivors). At each node, we have a certain number of instances (starting from the whole dataset), and we measure its entropy. Our method will select the questions that yield more homogeneous partitions (with the lowest entropy), when we consider only those instances for which the answer for the question is yes or no, that is, when the entropy after answering the question decreases.
Interpreting the decision tree
As you can see in the tree, at the beginning of the decision tree growing process, you have the 984 instances in the training set, 662 of them corresponding to class 0 (fatalities), and 322 of them to class 1 (survivors). The measured entropy for this initial group is about 0.9121. From the possible list of questions we can ask, the one that produces the greatest information gain is: Was she a woman? (remember that the female category was encoded as 0). If the answer is yes, entropy is almost the same, but if the answer is no, it is greatly reduced (the proportion of men who died was much greater than the general proportion of casualties). In this sense, the woman question seems to be the best to ask. After that, the process continues, working in each node only with the instances that have feature values that correspond to the questions in the path to the node.
If you look at the tree, in each node we have: the question, the initial Shannon entropy, the number of instances we are considering, and their distribution with respect to the target class. In each step, the number of instances gets reduced to those that answer yes (the left branch) and no (the right branch) to the question posed by that node. The process continues until a certain stopping criterion is met (in our case, until we have a fourth-level node, or the number of considered samples is lower than fve).
At prediction time, we take an instance and start traversing the tree, answering the questions based on the instance features, until we reach a leaf. At this point, we look at to how many instances of each class we had in the training set, and select the class to which most instances belonged.
For example, consider the question of determining if a 10-year-old girl, from first class would have survived. The answer to the frst question (was she female?) is yes, so we take the left branch of the tree. In the two following questions the answers are no (was she from third class?) and yes (was she from frst class?), so we take the left and right branch respectively. At this time, we have reached a leaf. In the training set, we had 102 people with these attributes, 97 of them survivors. So, our answer would be survived.
In general, we found reasonable results: the group with more casualties (449 from 496) corresponded to adult men from second or third class, as you can check in the tree. Most girls from frst class, on the other side, survived. Let's measure the accuracy of our method in the training set (we will frst defne a helper function to measure the performance of a classifer):
from sklearn import metrics
def measure_performance(X,y,clf, show_accuracy=True,
show_classification_report=True, show_confusion_matrix=True):
y_pred=clf.predict(X)
if show_accuracy:
print("Accuracy:{0:.3f}".format(metrics.accuracy_score(y, y_pred)),"\n")
if show_classification_report:
print("Classification report")
print(metrics.classification_report(y,y_pred),"\n")
if show_confusion_matrix:
print("Confussion matrix")
print(metrics.confusion_matrix(y,y_pred),"\n")
measure_performance(X_train,y_train,clf,show_classification_report=False, show_confusion_matrix=False)
Accuracy:0.838
Our tree has an accuracy of 0.838 on the training set. But remember that this is not a good indicator. This is especially true for decision trees as this method is highly susceptible to overftting. Since we did not separate an evaluation set, we should apply cross-validation. For this example, we will use an extreme case of crossvalidation, named leave-one-out cross-validation. For each instance in the training sample, we train on the rest of the sample, and evaluate the model built on the only instance left out. After performing as many classifcations as training instances, we calculate the accuracy simply as the proportion of times our method correctly predicted the class of the left-out instance, and found it is a little lower (as we expected) than the resubstitution accuracy on the training set.
from sklearn.cross_validation import cross_val_score, LeaveOneOut
from scipy.stats import sem
def loo_cv(X_train, y_train,clf):
# Perform Leave-One-Out cross validation
# We are preforming 1313 classifications!
loo = LeaveOneOut(X_train[:].shape[0])
scores = np.zeros(X_train[:].shape[0])
for train_index, test_index in loo:
X_train_cv, X_test_cv = X_train[train_index], X_train[test_index]
y_train_cv, y_test_cv = y_train[train_index], y_train[test_index]
clf = clf.fit(X_train_cv,y_train_cv)
y_pred = clf.predict(X_test_cv)
scores[test_index] = metrics.accuracy_score(y_test_cv.astype(int), y_pred.astype(int))
print(("Mean score: {0:.3f} (+/-{1:.3f})").format(np.mean(scores), sem(scores)))
loo_cv(X_train, y_train, clf)
Mean score: 0.837 (+/-0.012)
The main advantage of leave-one-out cross-validation is that it allows almost as much data for training as we have available, so it is particularly well suited for those cases where data is scarce. Its main problem is that training a different classifer for each instance could be very costly in terms of the computation time.
A big question remains here: how we selected the hyperparameters for our method instantiation? This problem is a general one, it is called model selection.
Random Forests – randomizing decisions
A common criticism to decision trees is that once the training set is divided after answering a question, it is not possible to reconsider this decision. For example, if we divide men and women, every subsequent question would be only about men or women, and the method could not consider another type of question (say, age less than a year, irrespective of the gender). Random Forests try to introduce some level of randomization in each step, proposing alternative trees and combining them to get the fnal prediction. These types of algorithms that consider several classifers answering the same question are called ensemble methods. In the Titanic task, it is probably hard to see this problem because we have very few features, but consider the case when the number of features is in the order of thousands.
Random Forests propose to build a decision tree based on a subset of the training instances (selected randomly, with replacement), but using a small random number of features at each set from the feature set. This tree growing process is repeated several times, producing a set of classifers. At prediction time, each grown tree, given an instance, predicts its target class exactly as decision trees do. The class that most of the trees vote (that is the class most predicted by the trees) is the one suggested by the ensemble classifer.
In scikit-learn, using Random Forests is as simple as importing RandomForestClassifier from the sklearn.ensemble module, and ftting the training data as follows:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=10, random_state=33)
clf = clf.fit(X_train, y_train)
loo_cv(X_train, y_train, clf)
Mean score: 0.817 (+/-0.012)
We fnd that results are actually worse for Random Forests. It seems that introducing randomization was, after all, not a good idea because the number of features was too small. However, for bigger datasets, with a bigger number of features, Random Forests is a very fast, simple, and popular method to improve accuracy, retaining the virtues of decision trees. Actually, in the next section, we will use them for regression.
Evaluating the performance
The fnal step in every supervised learning task should be to evaluate our best classifer on the previously unseen data, to get an idea of its prediction performance. Remember, this step should not be used to select among competing methods or parameters. That would be cheating (because again, we risk overftting the new data). So, in our case, let's measure the performance of decision trees on the testing data.
clf_dt = tree.DecisionTreeClassifier(criterion='entropy', max_depth=3, min_samples_leaf=5)
clf_dt.fit(X_train, y_train)
measure_performance(X_test, y_test, clf_dt)
Accuracy:0.793 Classification report
precision recall f1-score support 0.0 0.77 0.96 0.85 202
1.0 0.88 0.54 0.67 127 avg / total 0.81 0.79 0.78 329 Confussion matrix
[[193 9]
[ 59 68]]
From the classifcation results and the confusion matrix, it seems that our method tends to predict too much that the person did not survive.
sklearn学习笔记3的更多相关文章
- sklearn学习笔记之简单线性回归
简单线性回归 线性回归是数据挖掘中的基础算法之一,从某种意义上来说,在学习函数的时候已经开始接触线性回归了,只不过那时候并没有涉及到误差项.线性回归的思想其实就是解一组方程,得到回归函数,不过在出现误 ...
- sklearn学习笔记2
Text classifcation with Naïve Bayes In this section we will try to classify newsgroup messages using ...
- sklearn学习笔记1
Image recognition with Support Vector Machines #our dataset is provided within scikit-learn #let's s ...
- sklearn学习笔记
用Bagging优化模型的过程:1.对于要使用的弱模型(比如线性分类器.岭回归),通过交叉验证的方式找到弱模型本身的最好超参数:2.然后用这个带着最好超参数的弱模型去构建强模型:3.对强模型也是通过交 ...
- sklearn学习笔记(一)——数据预处理 sklearn.preprocessing
https://blog.csdn.net/zhangyang10d/article/details/53418227 数据预处理 sklearn.preprocessing 标准化 (Standar ...
- sklearn学习笔记之岭回归
岭回归 岭回归是一种专用于共线性数据分析的有偏估计回归方法,实质上是一种改良的最小二乘估计法,通过放弃最小二乘法的无偏性,以损失部分信息.降低精度为代价获得回归系数更为符合实际.更可靠的回归方法,对病 ...
- sklearn学习笔记之开始
简介 自2007年发布以来,scikit-learn已经成为Python重要的机器学习库了.scikit-learn简称sklearn,支持包括分类.回归.降维和聚类四大机器学习算法.还包含了特征 ...
- sklearn学习笔记(1)--make_blobs函数及相应参数简介
make_blobs方法: sklearn.datasets.make_blobs(n_samples=100,n_features=2,centers=3, cluster_std=1.0,cent ...
- Google TensorFlow深度学习笔记
Google Deep Learning Notes Google 深度学习笔记 由于谷歌机器学习教程更新太慢,所以一边学习Deep Learning教程,经常总结是个好习惯,笔记目录奉上. Gith ...
随机推荐
- Maven修改本地仓库路径
仓库知识参考 http://www.cnblogs.com/luotaoyeah/p/3785044.html 1. 修改配置文件settings.xml 假设你的maven位置在 D:\apache ...
- LintCode Interleaving String
Given three strings: s1, s2, s3, determine whether s3 is formed by the interleaving of s1 and s2. Ex ...
- phpstorm10.0.1 注册
注册时选择License server,填http://idea.lanyus.com,然后点击OK,就注册了
- ASP.NET Razor - C# 变量
变量是用来存储数据的命名实体. 变量 变量是用来存储数据的. 一个变量的名称必须以字母字符开头,并且不能包含空格或者保留字符. 一个变量可以是一个指定的类型,表示它所存储的数据类型.string 变量 ...
- mybatis——使用mapper代理开发方式
---------------------------------------------------------------generatorConfig.xml------------------ ...
- android 学习资源总结
http://android-arsenal.com/free 国外的android分类资源网站 http://www.ibm.com/developerworks/cn/topics/ IB ...
- Spring源码学习之: 通过Spring @PostConstruct 和 @PreDestroy 方法 实现初始化和销毁bean之前进行的操作
关于在spring 容器初始化 bean 和销毁前所做的操作定义方式有三种: 第一种:通过@PostConstruct 和 @PreDestroy 方法 实现初始化和销毁bean之前进行的操作 第二 ...
- git初学习体会
github:项目版本控制器 git和传统的版本控制器相比,最大的一点是,界面简单,给与非线性开发模式的强有力的支持,完全分布式等. 对于完全分布式的实现,我的理解是这个样子的.这多少要涉及到一点它的 ...
- 【耐克】【空军一号 Nike Air Force 1】【软木塞】
[高帮 全白 36-45] [空军一号 低帮 36-46] [空军一号 36-45] [Nike Air Force 1 Flyknit 空军中帮飞线系列 全黑 36-44] [耐克空军一号 软木塞 ...
- java-装箱/拆箱-字符串转换成基本数据类型
一.理解java中包的含义及种类 java是一个面向对象编程,即一切皆是对象,那么有一个矛盾,从数据上划分知道java中的数据分为基本数据类型和引用数据类型,但是基本数据类型如何是一个对象呢?此时,就 ...