kaggle Cross-Validation
The Cross-Validation Procedure
In cross-validation, we run our modeling process on different subsets of the data to get multiple measures of model quality. For example, we could have 5 folds or experiments. We divide the data into 5 pieces, each being 20% of the full dataset.

We run an experiment called experiment 1 which uses the first fold as a holdout set, and everything else as training data. This gives us a measure of model quality based on a 20% holdout set, much as we got from using the simple train-test split.
We then run a second experiment, where we hold out data from the second
fold (using everything except the 2nd fold for training the model.) This
gives us a second estimate of model quality.
We repeat this process, using every fold once as the holdout. Putting
this together, 100% of the data is used as a holdout at some point.
Returning to our example above from train-test split, if we have 5000
rows of data, we end up with a measure of model quality based on 5000
rows of holdout (even if we don't use all 5000 rows simultaneously.
Trade-offs Between Cross-Validation and Train-Test Split¶
Cross-validation gives a more accurate measure of model quality, which is especially important if you are making a lot of modeling decisions. However, it can take more time to run, because it estimates models once for each fold. So it is doing more total work.
Given these tradeoffs, when should you use each approach? On small datasets, the extra computational burden of running cross-validation isn't a big deal. These are also the problems where model quality scores would be least reliable with train-test split. So, if your dataset is smaller, you should run cross-validation.
For the same reasons, a simple train-test split is sufficient for larger datasets. It will run faster, and you may have enough data that there's little need to re-use some of it for holdout.
There's no simple threshold for what constitutes a large vs small dataset. If your model takes a couple minute or less to run, it's probably worth switching to cross-validation. If your model takes much longer to run, cross-validation may slow down your workflow more than it's worth.
Alternatively, you can run cross-validation and see if the scores for each experiment seem close. If each experiment gives the same results, train-test split is probably sufficient.
# First we read the data
import pandas as pd
data = pd.read_csv('../input/melb_data.csv')
cols_to_use = ['Rooms', 'Distance', 'Landsize', 'BuildingArea', 'YearBuilt']
X = data[cols_to_use]
y = data.Price # Then specify a pipeline of our modeling steps (It can be very difficult to do cross-validation properly if you arent't using pipelines)
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Imputer
my_pipeline = make_pipeline(Imputer(), RandomForestRegressor()) # Finally get the cross-validation scores:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(my_pipeline, X, y, scoring='neg_mean_absolute_error')
print(scores) # It is a little surprising that we specify negative mean absolute error in this case. Scikit-learn has a convention where all
# metrics are defined so a high number is better. Using negatives here allows them to be consistent with that convention, though
# negative MAE is almost unheard of elsewhere. print('Mean Absolute Error %2f' %(-1 * scores.mean()))
kaggle Cross-Validation的更多相关文章
- validation set以及cross validation的常见做法
如果给定的样本充足,进行模型选择的一种简单方法是随机地将数据集切分成三部分,分为训练集(training set).验证集(validation set)和测试集(testing set).训练集用来 ...
- 交叉验证(Cross Validation)原理小结
交叉验证是在机器学习建立模型和验证模型参数时常用的办法.交叉验证,顾名思义,就是重复的使用数据,把得到的样本数据进行切分,组合为不同的训练集和测试集,用训练集来训练模型,用测试集来评估模型预测的好坏. ...
- 交叉验证 Cross validation
来源:CSDN: boat_lee 简单交叉验证 hold-out cross validation 从全部训练数据S中随机选择s个样例作为训练集training set,剩余的作为测试集testin ...
- Cross Validation done wrong
Cross Validation done wrong Cross validation is an essential tool in statistical learning 1 to estim ...
- 交叉验证(cross validation)
转自:http://www.vanjor.org/blog/2010/10/cross-validation/ 交叉验证(Cross-Validation): 有时亦称循环估计, 是一种统计学上将数据 ...
- 10折交叉验证(10-fold Cross Validation)与留一法(Leave-One-Out)、分层采样(Stratification)
10折交叉验证 我们构建一个分类器,输入为运动员的身高.体重,输出为其从事的体育项目-体操.田径或篮球. 一旦构建了分类器,我们就可能有兴趣回答类似下述的问题: . 该分类器的精确率怎么样? . 该分 ...
- Cross Validation(交叉验证)
交叉验证(Cross Validation)方法思想 Cross Validation一下简称CV.CV是用来验证分类器性能的一种统计方法. 思想:将原始数据(dataset)进行分组,一部分作为训练 ...
- S折交叉验证(S-fold cross validation)
S折交叉验证(S-fold cross validation) 觉得有用的话,欢迎一起讨论相互学习~Follow Me 仅为个人观点,欢迎讨论 参考文献 https://blog.csdn.net/a ...
- 交叉验证(Cross Validation)简介
参考 交叉验证 交叉验证 (Cross Validation)刘建平 一.训练集 vs. 测试集 在模式识别(pattern recognition)与机器学习(machine lea ...
- cross validation笔记
preface:做实验少不了交叉验证,平时常用from sklearn.cross_validation import train_test_split,用train_test_split()函数将数 ...
随机推荐
- HTTP协议状态代码和错误状态含义的解释
面试互联网公司经常被问的就是HTTP协议的知识,甚至比TCP/IP问的还多,其中HTTP代码的知识也是开发过程中经常会接触的,今天学习所有 HTTP 状态代码及其定义. 代码 指示 2xx ...
- Redis底层探秘(六):对象多态及回收
本篇是我们redis系列的最后一篇,整个系列其实是我学习<redis设计与实现>的笔记,这本书感觉不错,推荐使用redis的小伙伴都可以看看. 整个系列的文字都比较干,很多数据结构和C语言 ...
- DWZ富客户端HTML框架
一.了解 概述:是中国人自己开发的基于jQuery实现的Ajax RIA开源框架. 目的:简单实用.扩展方便(在原有架构基础上扩展方便).快速开发.RIA思路.轻量级 使用:用html扩展的方式来代替 ...
- 服务器上传大小限制 --- 来自 FastAdmin 项目开发的引发的问题 (TODO)
服务器上传大小限制 --- 来自 FastAdmin 项目开发的引发的问题 服务器上传有几个地方修改. FastAdmin 的配置. php.ini 的配置. NGINX 的配置.
- 扒站工具Teleport Pro教程
1.下载软件 http://www.jb51.net/softs/44134.html 2.安装 3.界面 先点开帮助点注册(类似于破解要不全站扒不全) 下面请看ppt, http://www.doc ...
- Gwt第三方组件、框架介绍
介绍一下我接触过的Gwt第三方组件.框架及项目 1. Mygwt 曾经的大名鼎鼎的gwt第三方框架,在某些gwt框架的排名中排名第一.这个框架完全用gwt的方式实现了ext-js的功能,不依赖于ext ...
- 一个分类,两个问题之ArrayList
前段时间,在做一个商品的分类,分类有3级,类似于以下这种形式的: ---食物 ---蔬菜 ---白菜 ---材料 ---鸡肉 ....... 而我需要做的是将取得的一个商品的字符串类型的分类ID集,然 ...
- jenkins学习 03 jenkins配置Maven项目
我们的产品使用Git作为版本管理工具,而jenkins需要git插件来支持git,所以我们需要为jenkins添加git插件. 在Available tab页中找到Git Plugin 点击下方的In ...
- linux串口基本编程
Linux的串口表现为设备文件.Linux的串口设备文件命名一般为/dev/ttySn(n=0.1.2„„),若串口是USB扩展的,则串口设备文件命名多为/dev/ttyUSBn(n=0.1.2„„) ...
- pycharm中 unittests in xxxx 运行模式
pycham中 当你运行时 ,使用的 是 Run "unittests in xxxx" 模式时候,if __name__ == '__main__': 后面的代码是不执行的 ...