Here I list some useful functions in Python to get familiar with your data. As an example, we load a dataset named housing which is a DataFrame object. Usually, the first thing to do is get top five rows the dataset by head() function:

housing = load_housing_data()
housing.head()

The info() method is useful to get a quick description of the data, in particular the total number of rows, and each attribute’s type and number of non-null values.

housing.info()

The describe() function will return statistics including count, mean, median, std, min, max and quantiles of each feature.

housing.describe()

For categorical varibles, we usually hope to see the labels and the count for each label. value_counts() function works here:

housing["ocean_proximity"].value_counts()

That’s it. I’ll update more functions if I meet in further study.

[Machine Learning with Python] Familiar with Your Data的更多相关文章

  1. Python (1) - 7 Steps to Mastering Machine Learning With Python

    Step 1: Basic Python Skills install Anacondaincluding numpy, scikit-learn, and matplotlib Step 2: Fo ...

  2. Getting started with machine learning in Python

    Getting started with machine learning in Python Machine learning is a field that uses algorithms to ...

  3. 《Learning scikit-learn Machine Learning in Python》chapter1

    前言 由于实验原因,准备入坑 python 机器学习,而 python 机器学习常用的包就是 scikit-learn ,准备先了解一下这个工具.在这里搜了有 scikit-learn 关键字的书,找 ...

  4. 【Machine Learning】Python开发工具:Anaconda+Sublime

    Python开发工具:Anaconda+Sublime 作者:白宁超 2016年12月23日21:24:51 摘要:随着机器学习和深度学习的热潮,各种图书层出不穷.然而多数是基础理论知识介绍,缺乏实现 ...

  5. Machine Learning的Python环境设置

    Machine Learning目前经常使用的语言有Python.R和MATLAB.如果采用Python,需要安装大量的数学相关和Machine Learning的包.一般安装Anaconda,可以把 ...

  6. [Machine Learning with Python] My First Data Preprocessing Pipeline with Titanic Dataset

    The Dataset was acquired from https://www.kaggle.com/c/titanic For data preprocessing, I firstly def ...

  7. [Machine Learning with Python] Data Preparation through Transformation Pipeline

    In the former article "Data Preparation by Pandas and Scikit-Learn", we discussed about a ...

  8. [Machine Learning with Python] Data Preparation by Pandas and Scikit-Learn

    In this article, we dicuss some main steps in data preparation. Drop Labels Firstly, we drop labels ...

  9. [Machine Learning with Python] How to get your data?

    Using Pandas Library The simplest way is to read data from .csv files and store it as a data frame o ...

随机推荐

  1. poj 1862 2*根号(n1*n2)问题 贪心算法

    题意: 有n个数,要把其中2个数进行2*根号(n1*n2)操作,求剩下最小的那个数是多少? 哭诉:看题目根本没看出来要让我做这个操作. 思路: 每次把最大的,次大的拿出来进行操作 用"优先队 ...

  2. 【LeetCode】Linked List Cycle II(环形链表 II)

    这是LeetCode里的第142道题. 题目要求: 给定一个链表,返回链表开始入环的第一个节点. 如果链表无环,则返回 null. 说明:不允许修改给定的链表. 进阶:你是否可以不用额外空间解决此题? ...

  3. Linux系统监视工具

    转自      http://bbs.51cto.com/thread-971896-1.html # 1: top – 查看活动进程的命令TOP工具能够实时显示系统中各个进程的资源占用状况.默认情况 ...

  4. python基础学习笔记——字符串方法

    索引和切片: 索引:取出数组s中第3个元素:x=s[2] 切片:用极少的代码将数组元素按需处理的一种方法.切片最少有1个参数,最多有3个参数,演示如下: 我们假设下面所用的数组声明为array=[2, ...

  5. 理解依赖注入 for Zend framework 2

    依赖注入(Dependency Injection),也成为控制反转(Inversion of Control),一种设计模式,其目的是解除类之间的依赖关系. 假设我们需要举办一个Party,Part ...

  6. rocketmq源码分析1-benchmark学习

    benchmark 分析 组成部分 三个java类,都含有main方法,可选的传递一些参数,诸如测试线程数量,消息体积大小.三个类分别用于测试普通生产者,事务生产者,消费者.生产者 默认64个测试线程 ...

  7. 电商平台API接口

  8. SCOI 2010 滑雪

    题目描述 a180285非常喜欢滑雪.他来到一座雪山,这里分布着 MM 条供滑行的轨道和 NN 个轨道之间的交点(同时也是景点),而且每个景点都有一编号 ii ( 1 \le i \le N1≤i≤N ...

  9. Leetcode 454.四数相加II

    四数相加II 给定四个包含整数的数组列表 A , B , C , D ,计算有多少个元组 (i, j, k, l) ,使得 A[i] + B[j] + C[k] + D[l] = 0. 为了使问题简单 ...

  10. TOJ1840: Jack Straws 判断两线段相交+并查集

    1840: Jack Straws  Time Limit(Common/Java):1000MS/10000MS     Memory Limit:65536KByteTotal Submit: 1 ...