[机器学习实践] 针对Breast-Cancer数据集

本篇博客中，我们将对一个UCI数据库中的数据集：Breast-Cancer数据集，应用已有的机器学习方法来实现一个分类器。

本文代码链接

数据集概况

数据集的地址为：link

在该页面中，可以进入Data Set Description 来查看数据的说明文档，另外一个连接是Data Folder 查看数据集的下载地址。

这里我们使用的文件是：

breast-cancer-wisconsin.data
breast-cancer-wisconsin.names

即：

这两个文件，第一个文件（连接）是我们的数据文件，第二个文件（连接）是数据的说明文档。

对于这样的一份数据，我们应该首先阅读说明文档中的内容来对数据有一个基本的了解。

对数据的预处理

我们可以知道文件有11个列，第1个列为id号，第2-10列为特征，11列为标签（2为良性、4为恶性）。具体的特征内容在文档中，但是我们可以不关心医学上的具体意义，这部分在文档中的描述如下：

7. Attribute Information: (class attribute has been moved to last column)

   #  Attribute                     Domain

   -- -----------------------------------------

   1. Sample code number            id number

   2. Clump Thickness               1 - 10

   3. Uniformity of Cell Size       1 - 10

   4. Uniformity of Cell Shape      1 - 10

   5. Marginal Adhesion             1 - 10

   6. Single Epithelial Cell Size   1 - 10

   7. Bare Nuclei                   1 - 10

   8. Bland Chromatin               1 - 10

   9. Normal Nucleoli               1 - 10

  10. Mitoses                       1 - 10

  11. Class:                        (2 for benign, 4 for malignant)

另外从文档中我们还可以知道一些其他的信息：

数据集中共有699条信息
数据集中有16处缺失值，缺失值使用"?"表示
数据集中良性数据有458条，恶性数据有241条

缺失值处理和分割数据集

因为缺失的数据不多（11条），所以我们暂时先采用丢弃带有“？”的数据，加上前面读取数据、添加表头的操作，代码如下：

# import the packets

import numpy as np

import pandas as pd

DATA_PATH = "breast-cancer-wisconsin.data"

# create the column names

columnNames = [

    'Sample code number',

    'Clump Thickness',

    'Uniformity of Cell Size',

    'Uniformity of Cell Shape',

    'Marginal Adhesion',

    'Single Epithelial Cell Size',

    'Bare Nuclei',

    'Bland Chromatin',

    'Normal Nucleoli',

    'Mitoses',

    'Class'

]

data = pd.read_csv(DATA_PATH, names = columnNames)

# show the shape of data

print data.shape

# use standard missing value to replace "?"

data = data.replace(to_replace = "?", value = np.nan)

# then drop the missing value

data = data.dropna(how = 'any')

print data.shape

输出结果为：

(699, 11)

(683, 11)

可以看到，现在数据中带有缺失值的数据都被丢弃掉了。

我们可以通过类似 data['Class'] 的方式来访问特定的属性，如下图：

然后我们会将数据集分割为两部分：训练数据集和测试数据集，使用了train_test_split，这个函数已经自动完成了随机分割的功能，函数文档。

然后我们分割数据集：

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(

    data[ columnNames[1:10] ], # features

    data[ columnNames[10]   ], # labels

    test_size = 0.25,

    random_state = 33

)

得到的变量为：

X_train ：训练数据集的特征
X_test ：测试数据集的特征
y_train ：训练数据集的标签
y_test ：测试数据集的标签

因为是监督学习，所以所有数据都有标签，且认为标签的内容百分之百准确。

应用机器学习模型

应用机器模型前，应该将每个特征的数值转化为均值为0，方差为1的数据，使训练出的模型不会被某些维度过大的值主导。

这里使用的使scikit-learn 中的 StandardScaler 模块，doc链接。

from sklearn.preprocessing import StandardScaler

ss = StandardScaler()

X_train = ss.fit_transform(X_train) # fit_transform for train data

X_test = ss.transform(X_test)

然后我们将建立一个机器学习模型，这里我们使用了Logestic Regression 和 SVM：

# use logestic-regression

from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()

lr.fit(X_train, y_train)

lr_y = lr.predict(X_test)

# use svm

from sklearn.svm import LinearSVC

lsvc = LinearSVC()

lsvc.fit(X_train, y_train)

svm_y = lsvc.predict(X_test)

分类器的效果评估

首先我们用分类器自带的.score方法来对准确性进行打印：



# now we will check the performance of the classifier

from sklearn.metrics import classification_report

# use the classification_report to present result

# `.score` method can be used to test the accuracy

print 'Accuracy of the LogesticRegression: ', lr.score(X_test, y_test)

# print 'Accuracy on the train dataset: ', lr.score(X_train, y_train)

# print 'Accuracy on the predict result (should be 1.0): ', lr.score(X_test, lr_y)

print 'Accuracy of the SVM: ' , lsvc.score(X_test, y_test)

输出为：

Accuracy of the LogesticRegression:  0.953216374269

Accuracy of the SVM:  0.959064327485

除此以外，我们还可以使用classification_report对分类器查看更详细的性能测试结果：

print classification_report(y_test, svm_y, target_names = ['Benign', 'Malignant'])

其结果如下：

             precision    recall  f1-score   support

     Benign       0.96      0.98      0.97       111

  Malignant       0.96      0.92      0.94        60

avg / total       0.96      0.96      0.96       171