1) A Simple Option: Drop Columns with Missing Values

如果这些列具有有用信息(在未丢失的位置),则在删除列时,模型将失去对此信息的访问权限。 此外,如果您的测试数据在您的训练数据没有的地方缺少值,则会导致错误。

  1. data_without_missing_values = original_data.dropna(axis=1)
  2.  
  3. #同时操作tran和test部分
  4. cols_with_missing = [col for col in original_data.columns
  5. if original_data[col].isnull().any()]
  6. redued_original_data = original_data.drop(cols_with_missing, axis=1)
  7. reduced_test_data = test_data.drop(cols_with_missing, axis=1)

2) A Better Option: Imputation

默认行为填写了插补的平均值。 统计学家已经研究了更复杂的策略,但是一旦将结果插入复杂的机器学习模型,那些复杂的策略通常没有任何好处。

关于Imputation的一个(很多)好处是它可以包含在scikit-learn Pipeline中。 管道简化了模型构建,模型验证和模型部署。

  1. from sklearn.impute import SimpleImputer
  2. my_imputer = SimpleImputer()
  3. data_with_imputed_values = my_imputer.fit_transform(original_data)

3) An Extension To Imputation

估算是标准方法,通常效果很好。 但是,估算值可能系统地高于或低于其实际值(未在数据集中收集)。 或者具有缺失值的行可能以某种其他方式看来是唯一的。 在这种情况下,您的模型会通过考虑最初缺少哪些值来做出更好的预测。

  1. # make copy to avoid changing original data (when Imputing)
  2. new_data = original_data.copy()
  3.  
  4. # make new columns indicating what will be imputed
  5. cols_with_missing = (col for col in new_data.columns
  6. if new_data[col].isnull().any())
  7. for col in cols_with_missing:
  8. new_data[col + '_was_missing'] = new_data[col].isnull()
  9.  
  10. # Imputation
  11. my_imputer = SimpleImputer()
  12. new_data = pd.DataFrame(my_imputer.fit_transform(new_data))
  13. new_data.columns = original_data.columns

Example (Comparing All Solutions)

  1. import pandas as pd
  2.  
  3. # Load data
  4. melb_data = pd.read_csv('../input/melbourne-housing-snapshot/melb_data.csv')
  5.  
  6. from sklearn.ensemble import RandomForestRegressor
  7. from sklearn.metrics import mean_absolute_error
  8. from sklearn.model_selection import train_test_split
  9.  
  10. melb_target = melb_data.Price
  11. melb_predictors = melb_data.drop(['Price'], axis=1)
  12.  
  13. # For the sake of keeping the example simple, we'll use only numeric predictors.
  14. melb_numeric_predictors = melb_predictors.select_dtypes(exclude=['object'])
  15.  
  16. from sklearn.ensemble import RandomForestRegressor
  17. from sklearn.metrics import mean_absolute_error
  18. from sklearn.model_selection import train_test_split
  19.  
  20. X_train, X_test, y_train, y_test = train_test_split(melb_numeric_predictors,
  21. melb_target,
  22. train_size=0.7,
  23. test_size=0.3,
  24. random_state=0)
  25.  
  26. def score_dataset(X_train, X_test, y_train, y_test):
  27. model = RandomForestRegressor()
  28. model.fit(X_train, y_train)
  29. preds = model.predict(X_test)
  30. return mean_absolute_error(y_test, preds)
  31.  
  32. # Get Model Score from Dropping Columns with Missing Values
    # 直接丢弃含有缺失值的列
  33. cols_with_missing = [col for col in X_train.columns
  34. if X_train[col].isnull().any()]
  35. reduced_X_train = X_train.drop(cols_with_missing, axis=1)
  36. reduced_X_test = X_test.drop(cols_with_missing, axis=1)
  37. print("Mean Absolute Error from dropping columns with Missing Values:")
  38. print(score_dataset(reduced_X_train, reduced_X_test, y_train, y_test))
  39.  
  40. # Get Model Score from Imputation
    # 插入值
  41. from sklearn.impute import SimpleImputer
  42.  
  43. my_imputer = SimpleImputer()
  44. imputed_X_train = my_imputer.fit_transform(X_train)
  45. imputed_X_test = my_imputer.transform(X_test)
  46. print("Mean Absolute Error from Imputation:")
  47. print(score_dataset(imputed_X_train, imputed_X_test, y_train, y_test))
  48.  
  49. # Get Score from Imputation with Extra Columns Showing What Was Imputed
    # 添加额外列显示缺失值
  50. imputed_X_train_plus = X_train.copy()
  51. imputed_X_test_plus = X_test.copy()
  52.  
  53. cols_with_missing = (col for col in X_train.columns
  54. if X_train[col].isnull().any())
  55. for col in cols_with_missing:
  56. imputed_X_train_plus[col + '_was_missing'] = imputed_X_train_plus[col].isnull()
  57. imputed_X_test_plus[col + '_was_missing'] = imputed_X_test_plus[col].isnull()
  58.  
  59. # Imputation
  60. my_imputer = SimpleImputer()
  61. imputed_X_train_plus = my_imputer.fit_transform(imputed_X_train_plus)
  62. imputed_X_test_plus = my_imputer.transform(imputed_X_test_plus)
  63.  
  64. print("Mean Absolute Error from Imputation while Track What Was Imputed:")
  65. print(score_dataset(imputed_X_train_plus, imputed_X_test_plus, y_train, y_test))

Handling Missing Values的更多相关文章

  1. [sklearn]官方例程-Imputing missing values before building an estimator 随机填充缺失值

    官方链接:http://scikit-learn.org/dev/auto_examples/plot_missing_values.html#sphx-glr-auto-examples-plot- ...

  2. [sklearn] 官方例程-Imputing missing values before building an estimator 随机填充缺失值

    官方链接:http://scikit-learn.org/dev/auto_examples/plot_missing_values.html#sphx-glr-auto-examples-plot- ...

  3. Multi-batch TMT reveals false positives, batch effects and missing values(解读人:胡丹丹)

    文献名:Multi-batch TMT reveals false positives, batch effects and missing values (多批次TMT定量方法中对假阳性率,批次效应 ...

  4. 缺失值处理(Missing Values)

    什么是缺失值?缺失值指数据集中某些变量的值有缺少的情况,缺失值也被称为NA(not available)值.在pandas里使用浮点值NaN(Not a Number)表示浮点数和非浮点数组中的缺失值 ...

  5. Web Scraping with R: How to Fill Missing Value (爬虫:如何处理缺失值)

    网络上有大量的信息与数据.我们可以利用爬虫技术来获取这些巨大的数据资源. 这次用 IMDb 网站的2018年100部最欢迎的电影 来练练手,顺便总结一下 R 爬虫的方法. >> Prepa ...

  6. A Complete Tutorial on Tree Based Modeling from Scratch (in R & Python)

    A Complete Tutorial on Tree Based Modeling from Scratch (in R & Python) MACHINE LEARNING PYTHON  ...

  7. Kaggle:Home Credit Default Risk 特征工程构建及可视化(2)

    博主在之前的博客 Kaggle:Home Credit Default Risk 数据探索及可视化(1) 中介绍了 Home Credit Default Risk 竞赛中一个优秀 kernel 关于 ...

  8. 【转】The most comprehensive Data Science learning plan for 2017

    I joined Analytics Vidhya as an intern last summer. I had no clue what was in store for me. I had be ...

  9. data cleaning

    Cleaning data in Python   Table of Contents Set up environments Data analysis packages in Python Cle ...

随机推荐

  1. 机器学习中常用的距离及其python实现

    1 概述 两个向量之间的距离(此时向量作为n维坐标系中的点)计算,在数学上称为向量的距离(Distance),也称为样本之间的相似性度量(Similarity Measurement).它反映为某类事 ...

  2. Ajax请求参数传到后台为空

    1.编码格式 $.ajax({ method:'POST', url:'/midservice/studentAction/addStudent', data:$.toJSON(userDate), ...

  3. mysql三表联合查询,结果集合并

    参考: mysql 结果集去重复值并合并成一行 SQL 三表联查 数据库三表连接查询怎么做 合并: MySQL中group_concat函数 完整的语法如下: group_concat([DISTIN ...

  4. java虚拟机(十一)--GC日志分析

    GC相关:java虚拟机(六)--垃圾收集器和内存分配策略 java虚拟机(五)--垃圾回收机制GC 打印日志相关参数: -XX:+PrintGCDetails -XX:PrintGCTimestam ...

  5. c++设计模式:模板模式

    模板模式和策略模式的区别: 模板方法模式的主要思想:定义一个算法流程,将一些特定步骤的具体实现.延迟到子类.使得可以在不改变算法流程的情况下,通过不同的子类.来实现“定制”流程中的特定的步骤. 策略模 ...

  6. python 3.6 关于python的介绍

    python的官方网站 https://www.python.org/ python 3.6 的官方网站的下载地址 https://www.python.org/downloads/release/p ...

  7. http和tcp/ip,socket的区别

    http协议和tcp/ip协议乍看起来,感觉是同一类的东西,其实不然,下面简单的说说他们的区别. http协议是应用层的一种数据封装协议,类似的还有ftp,telnet等等,而tcp/ip是数据传输层 ...

  8. docker 整理

    管理 docker批量删除容器.镜像   1.删除所有容器 docker rm `docker ps -a -q` 1.1 按条件删除容器 删除包含某个字段 ,镜像名或容器名均可, 例如删除 zhy* ...

  9. jmeter参数化之配置元件CSV控件

    1.     用badboby进行录制,录制完成后保存,用JMeter格式进行保存,如:登陆.jmx 2.     在jmeter中打开保存的文件登陆.jmx. 3.     对登陆账号和密码进行参数 ...

  10. java 遍历

    LinkedList倒序遍历 public List<Integer> getNewsFeed(int userId) { List<Integer> res = new Ar ...