Examples of Scikit-learn Usages
Examples of Machine Learning Toolkit Usage
Scikit-learn
KFold K-折交叉验证
>>> import numpy as np
>>> from sklearn.model_selection import KFold
>>> X = ["a", "b", "c", "d"]
>>> kf = KFold(n_splits=2)
>>> for train, test in kf.split(X):
... print("%s %s" % (train, test))
[2 3] [0 1]
[0 1] [2 3]
Reference : http://scikit-learn.org/stable/modules/cross_validation.html#k-fold
Decision Trees Classification 决策树分类
>>> from sklearn import tree
>>> X = [[0, 0], [1, 1]]
>>> Y = [0, 1]
>>> clf = tree.DecisionTreeClassifier()
>>> clf = clf.fit(X, Y)
>>> clf.predict([[2., 2.]])
array([1])
Reference : http://scikit-learn.org/stable/modules/tree.html#classification
KNN k近邻
该算法可以用一句成语来帮助理解:近朱者赤近墨者黑。
from sklearn.neighbors import KNeighborsClassifier
knc = KNeighborsClassifier()
knc.fit(X_train, y_train)
y_pred = knc.predict(X_test)
Logistic Regression 逻辑斯蒂回归
>>> from sklearn.linear_model import LogisticRegression
>>> x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.25, random_state=33)
>>> model = LogisticRegression(penalty='l2', random_state=0, solver='newton-cg', multi_class='multinomial')
>>> model = fit(x_train, y_train)
>>> y_pred = model.predict(x_test)
Leave One Out 留一法
>>> from sklearn.model_selection import LeaveOneOut
>>> X = [1, 2, 3, 4]
>>> loo = LeaveOneOut()
>>> for train, test in loo.split(X):
... print("%s %s" % (train, test))
[1 2 3] [0]
[0 2 3] [1]
[0 1 3] [2]
[0 1 2] [3]
Reference : http://scikit-learn.org/stable/modules/cross_validation.html#leave-one-out-loo
train_test_split 随机分割
随机地,将数组或矩阵分割成训练集和测试集
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
iris = load_iris()
x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.25, random_state=33)
参数 test_size
如果是 float,应该在0到1之间,并且代表数据集在列车分割中所包含的比例。
如果是 int,表示训练样本的绝对数量。
如果是 None,则自动将值设置为测试大小的补充。
参数 random_state
如果 int,随机状态是随机数生成器所使用的种子;
如果是 RandomState 实例,随机数是随机数生成器;
如果是 None,随机数生成器是NP-随机使用的随机状态实例。
StandardScaler 特征标准化
标准化数据特征,保证每个维度的特征数据方差为1,均值为0。使得预测结果1不会被某些维度过大的特征而主导
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
X_train = ss.fit_transform(X_train)
X_test = ss.transform(X_test)
Reference: 《Python机器学习及实践》 https://book.douban.com/subject/26886337
实践
StandardScaler 在鸢尾花(Iris)数据上的表现并不好。未使用 StandardScaler 处理特征时,可以获得:
accuracy 0.947368
avg precision 0.96
avg recall 0.95
f1-score 0.95
代码如下:
# -*- encoding=utf8 -*-
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report
if __name__ == '__main__':
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.25, random_state=33)
knc = KNeighborsClassifier()
knc.fit(X_train, y_train)
y_pred = knc.predict(X_test)
print("accuracy is %f" % (knc.score(X_test, y_test)))
print(classification_report(y_test, y_pred, target_names=iris.target_names))
使用了 StandardScaler 以后,这四个指标反而下降了,分别如下所示:
accuracy 0.894737
avg precision 0.92
avg recall 0.89
f1-score 0.90
而使用了 StandardScaler 的代码如下:
# -*- encoding=utf8 -*-
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report
from sklearn.preprocessing import StandardScaler
if __name__ == '__main__':
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.25, random_state=33)
# 标准化数据特征,保证每个维度的特征数据方差为1,均值为0.
# 使得预测结果1不会被某些维度过大的特征而主导
ss = StandardScaler()
X_train = ss.fit_transform(X_train)
X_test = ss.transform(X_test)
knc = KNeighborsClassifier()
knc.fit(X_train, y_train)
y_pred = knc.predict(X_test)
print("accuracy is %f" % (knc.score(X_test, y_test)))
print(classification_report(y_test, y_pred, target_names=iris.target_names))
这是一个奇怪的问题,需要今后更进一步的探究。
shuffle 随机打乱
该函数可以随机地打乱训练数据和测试数据(让训练数据和测试数据保持对应)
from sklearn.utils import shuffle
x = [1,2,3,4]
y = [1,2,3,4]
x,y = shuffle(x,y)
Out:
x : [1,4,3,2]
y : [1,4,3,2]
Reference : http://scikit-learn.org/stable/modules/generated/sklearn.utils.shuffle.html
Classification Report
Presicion, recall and F1-score.
>>> from sklearn.metrics import classification_report
>>> print(classification_report(y_test, y_pred, target_names=iris.target_names))
precision recall f1-score support
setosa 1.00 1.00 1.00 8
versicolor 0.79 1.00 0.88 11
virginica 1.00 0.84 0.91 19
accuracy 0.92 38
macro avg 0.93 0.95 0.93 38
weighted avg 0.94 0.92 0.92 38
XGBoost
from xgboost import XGBClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
if __name__ == '__main__':
iris = load_iris()
x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target)
xgb = XGBClassifier()
xgb.fit(x_train, y_train)
y_pred = xgb.predict(x_test)
print(classification_report(y_test, y_pred))
实验结果
precision recall f1-score support
0 1.00 1.00 1.00 14
1 0.93 1.00 0.97 14
2 1.00 0.90 0.95 10
avg / total 0.98 0.97 0.97 38
Examples of Scikit-learn Usages的更多相关文章
- scikit learn 模块 调参 pipeline+girdsearch 数据举例:文档分类 (python代码)
scikit learn 模块 调参 pipeline+girdsearch 数据举例:文档分类数据集 fetch_20newsgroups #-*- coding: UTF-8 -*- import ...
- (原创)(三)机器学习笔记之Scikit Learn的线性回归模型初探
一.Scikit Learn中使用estimator三部曲 1. 构造estimator 2. 训练模型:fit 3. 利用模型进行预测:predict 二.模型评价 模型训练好后,度量模型拟合效果的 ...
- (原创)(四)机器学习笔记之Scikit Learn的Logistic回归初探
目录 5.3 使用LogisticRegressionCV进行正则化的 Logistic Regression 参数调优 一.Scikit Learn中有关logistics回归函数的介绍 1. 交叉 ...
- Scikit Learn: 在python中机器学习
转自:http://my.oschina.net/u/175377/blog/84420#OSC_h2_23 Scikit Learn: 在python中机器学习 Warning 警告:有些没能理解的 ...
- Scikit Learn
Scikit Learn Scikit-Learn简称sklearn,基于 Python 语言的,简单高效的数据挖掘和数据分析工具,建立在 NumPy,SciPy 和 matplotlib 上.
- 机器学习-scikit learn学习笔记
scikit-learn官网:http://scikit-learn.org/stable/ 通常情况下,一个学习问题会包含一组学习样本数据,计算机通过对样本数据的学习,尝试对未知数据进行预测. 学习 ...
- Linear Regression with Scikit Learn
Before you read This is a demo or practice about how to use Simple-Linear-Regression in scikit-lear ...
- 【359】scikit learn 官方帮助文档
官方网站链接 sklearn.neighbors.KNeighborsClassifier sklearn.tree.DecisionTreeClassifier sklearn.naive_baye ...
- 如何使用scikit—learn处理文本数据
答案在这里:http://www.tuicool.com/articles/U3uiiu http://scikit-learn.org/stable/modules/feature_extracti ...
- Query意图分析:记一次完整的机器学习过程(scikit learn library学习笔记)
所谓学习问题,是指观察由n个样本组成的集合,并根据这些数据来预测未知数据的性质. 学习任务(一个二分类问题): 区分一个普通的互联网检索Query是否具有某个垂直领域的意图.假设现在有一个O2O领域的 ...
随机推荐
- 使用mysqlbinlog从二进制日志文件中查询mysql执行过的sql语句 (原)
前提MySQL开启了binlog日志操作1. 查看MySQL是否开启binlog(进mysql操作) mysql> show variables like 'log_bin%'; 2 ...
- tp视图模板
<?php namespace Home\Controller; use Think\Controller; class IndexController extends Controller { ...
- 水题B
国际象棋的棋盘是黑白相间的8 * 8的方格,棋子放在格子中间.如下图所示: 王.后.车.象的走子规则如下: 王:横.直.斜都可以走,但每步限走一格. 后:横.直.斜都可以走,每步格数不受限制. 车:横 ...
- python 爬qidian小说
import re import urllib.request from bs4 import BeautifulSoup import time url=input("第一章网址:&quo ...
- django-pagination 样式修改
默认 django-pagination 样式: 使用bootstrap后样式: (有些瑕疵,下面来完善一下) 修改后: 效果还不错吧.那么讲下如何修改. 首先找到其源码: (路径:site-pac ...
- 记一次CentOS5.7更新glibc导致libc.so.6失效,系统无法启动
以下是错误示范,错误过程还原,请勿模仿!!! wkhtmltopdf 启动,提示/lib64/libc.so.6版本过低 $ ./wkhtmltopdf http:www.baidu.com 1. ...
- I Count Tow Three
#include<cstdio>#include<cstring>#include<algorithm>#include<iostream>#inclu ...
- Hadoop安装教程_单机/伪分布式配置_Hadoop2.6.0/Ubuntu14.04(转)
http://www.powerxing.com/install-hadoop/ http://blog.csdn.net/beginner_lee/article/details/6429146 h ...
- E. Kefa and Watch hash 线段树
2015-09-28 14:11:36 by opas 这题给的是一个字符串 把其中一些子串给取出来 判断是否是周期为d的字符串 还需要把 其中的一个区间完全变成一个数 ,然后在查询,我们把每个字符 ...
- jQuery选择器--:eq(index)、:lt(index)和:gt(index)
:eq(index) 概述 匹配一个给定索引值的元素 参数 index 从 0 开始计数 :gt(index) 概述 匹配所有大于给定索引值的元素 参数 index 从 0 开始计数 ...