1: The Competition

We'll be learning how to generate a submission for a Kaggle competition. Kaggle is a site where you create algorithms, and compete against machine learning practitioners around the world. Your algorithm wins if it's the most accurate on a given dataset. Kaggle is a fun way to practice your machine learning skills.

Kaggle has several different competitions on their site. One of them is about predicting which passengers survived the sinking of the Titanic. In this and the next mission, we'll explore the data, train our first model, and prepare our first submission to the competition. However, if you want to actually submit to the Kaggle.com website for evaluation, you'll have to do this on your own computer.

Our data is in .csv format. You can get started with the competition and download the data here.

A good first step is to think logically about the columns and what we're trying to predict. What variables might logically affect the outcome of survived? (reading more about the Titanic might help here).

We know that women and children were more likely to survive. Thus, Age and Sex are probably good predictors. It's also logical to think that passenger class might affect the outcome, as first class cabins were closer to the deck of the ship. Fare is tied to passenger class, and will probably be highly correlated with it, but might add some additional information. Number of siblings and parents/children will probably be correlated with survival one way or the other, as either there are more people to help you, or more people to think about and try to save.

There's a less clear link between survival and columns like Embarked (maybe there is some information about how close to the top of the ship people's cabins were here), Ticket, and Name.

This step is generally known as acquiring domain knowledge, and it fairly important to most machine learning tasks. We're looking to engineer the features so that we maximize the information we have about what we're trying to predict.

2: Looking At The Data

We'll be using python 3, the pandas library, and scikit-learn to analyze our data and create a submission. We'll be interactively coding in code boxes, like the one you see below. If you aren't familiar with Python yet, you might want to look at our courses.

A good second step is to look at high level descriptors of the data. In this case, we can use the pandas .describe() method to look at different characteristics of each numeric column.

Instructions

  • Describe the dataset using print(titanic.describe()).
  • Hit Check to run your code.
  • Then hit Next Step to the next step.

Hint

Type print(titanic.describe()) in the code area, and hit Check.

# We can use the pandas library in python to read in the csv file.
# This creates a pandas dataframe and assigns it to the titanic variable.
titanic = pandas.read_csv("titanic_train.csv") # Print the first 5 rows of the dataframe.
print(titanic.head(5))
print(titanic.describe())

3: Missing Data

When you used .describe() on the titanic dataframe in the last screen, you might have noticed that the Age column has a count of 714 when all the other columns have a count of 891. This indicates that there are missing values in the Age column -- the count is of non-missing (null, NA, or not a number) values.

This means that the data isn't perfectly clean, and we're going to have to clean it ourselves. We don't want to have to remove the rows with missing values, because more data helps us train a better algorithm. We also don't want to get rid of the whole column, as age is probably fairly important to our analysis.

There are many strategies for cleaning up missing data, but a simple one is to just fill in all the missing values with the median of all the values in the column.

We can select a single column by indexing the dataframe like a dictionary. This gives us a pandas series:

titanic["Age"]

We can then use the .fillna method on the series to replace any missing values. .fillna takes one argument, the value to replace the missing values with.

In our case, we want to fill with the median of the column:

.fillna(titanic["Age"].median())

We then have to assign the result back to the column to replace it:

titanic["Age"] = titanic["Age"].fillna(titanic["Age"].median())

Instructions

Replace all the missing values in the Age column of titanic.

Hint

Use the code

titanic["Age"] = titanic["Age"].fillna(titanic["Age"].median())

# The titanic variable is available here.
titanic["Age"] = titanic["Age"].fillna(titanic["Age"].median())

4: Non-Numeric Columns

When we used .describe() two screens ago, you might have also noticed that not all the columns were shown. Only the numeric columns were shown. Several of our columns are non-numeric, which is a problem when it comes time to make predictions -- we can't feed non-numeric columns into a machine learning algorithm and expect it to make sense of them.

We have to either exclude our non-numeric columns when we train our algorithm (Name, Sex, Cabin, Embarked, and Ticket), or find a way to convert them to numeric columns.

We'll ignore the Ticket, Cabin, and Name columns. There isn't much information we can extract from there. Most of the values in the cabin column are missing (only 204 values out of 891 rows), and it likely isn't a particularly informative column in the first place. The Ticket and Name columns are unlikely to tell us much without some domain knowledge about what the ticket numbers mean, and about which names correlate with characteristics like large or rich families.

5: Converting The Sex Column

The Sex column is non-numeric, but we want to keep it around -- it could be very informative. We can convert it to a numeric column by replacing each gender with a numeric code. A machine learning algorithm will then be able to use these categories to make predictions.

To do this, we first have to find all the unique genders in the column (we know male and female are there, but did whoever recorded the dataset use another code for missing values?). We'll also assign a code of 0 to male, and a code of 1 to female.

We can select all the male values in the Sex column with:

titanic.loc[titanic["Sex"] == "male", "Sex"]

We can then replace all these values with 0:

titanic.loc[titanic["Sex"] == "male", "Sex"] = 0

Instructions

Replace all the female values in the Sex column with 1.

Hint

You can use the same code as we used for replacing the male values, but switch the gender and the code.

# Find all the unique genders -- the column appears to contain only male and female.
print(titanic["Sex"].unique()) # Replace all the occurences of male with the number 0.
titanic.loc[titanic["Sex"] == "male", "Sex"] = 0
titanic.loc[titanic["Sex"] == "female", "Sex"] = 1

6: Converting The Embarked Column

We now can convert the Embarked column to codes the same way we converted the Sex column. The unique values in Embarked are S, C, Q, and missing (nan). Each letter is an abbreviation of an embarkation port name.

Instructions

  • The first step is to replace the missing values in the column.

    • The most common embarkation port is S, so let's assume everyone got on there.
    • Replace all the missing values in the Embarked column with S.
  • We'll assign the code 0 to S, 1 to C and 2 to Q.
    • Replace each value in the Embarked column with its corresponding code.

Hint

You can use the .fillna method to replace missing values. You can use the code from the last screen to replace values with codes.

# Find all the unique values for "Embarked".
print(titanic["Embarked"].unique())
titanic["Embarked"] = titanic["Embarked"].fillna("S") titanic.loc[titanic["Embarked"] == "S", "Embarked"] = 0
titanic.loc[titanic["Embarked"] == "C", "Embarked"] = 1
titanic.loc[titanic["Embarked"] == "Q", "Embarked"] = 2

7: On To Machine Learning!

Now that we've cleaned up our data a bit, we're ready to explore some machine learning. Let's say our first few rows of data look like this:

Age     Sex     Survived
10 0 0
5 1 1
30 0 0

If we wanted to make predictions about whether someone survived or not from the Age column, we could use a technique called linear regression. Linear regression follows the equation y=mx+by=mx+b, where y is the value we're trying to predict, m is a coefficient called the slope, x is the value of a column, and b is a constant called the intercept.

We could make predictions by assigning -2 to m, and 20 to b. We'd get this:

Age     Sex     Survived    Predictions
10 0 0 -2 * 10 + 20 = 0
5 1 1 -2 * 5 + 20 = 10
30 0 0 -2 * 30 + 20 = -40

If we turn any prediction greater than 0 into 1, and any prediction 0 or less into 0, we'll end up with this:

Age     Sex     Survived    Predictions
10 0 0 0
5 1 1 1
30 0 0 0

This simple model predicts survival fairly well. Linear regression can be a very powerful algorithm, but it has a few downsides:

  • If a column and an outcome aren't related linearly, it won't work well. For example, if older women don't survive as well, except women over 80, linear regression won't pick this up.
  • It can't give you survival probabilities, only absolute values indicating whether or not someone will survive.

We'll talk about how to address both of these issues later on. For now, we'll learn how to calculate linear regression coefficients automatically, and how to use multiple columns to predict an outcome.

8: Cross Validation

We can now use linear regression to make predictions on our training set.

We want to train the algorithm on different data than we make predictions on. This is critical if we want to avoid overfitting. Overfitting is what happens when a model fits itself to "noise", not signal. Every dataset has its own quirks that don't exist in the full population. For example, if I asked you to predict the top speed of a car from its horsepower and other characteristics, and gave you a dataset that randomly had cars with very high top speeds, you would create a model that overstated speed. The way to figure out if your model is doing this is to evaluate its performance on data it hasn't been trained using.

Every machine learning algorithm can overfit, although some (like linear regression) are much less prone to it. If you evaluate your algorithm on the same dataset that you train it on, it's impossible to know if it's performing well because it overfit itself to the noise, or if it actually is a good algorithm.

Luckily, cross validation is a simple way to avoid overfitting. To cross validate, you split your data into some number of parts (or "folds"). Lets use 3 as an example. You then do this:

  • Combine the first two parts, train a model, make predictions on the third.
  • Combine the first and third parts, train a model, make predictions on the second.
  • Combine the second and third parts, train a model, make predictions on the first.

This way, we generate predictions for the whole dataset without ever evaluating accuracy on the same data we train our model using.

9: Making Predictions

We can use the excellent scikit-learn library to make predictions. We'll use a helper from sklearn to split the data up into cross validation folds, and then train an algorithm for each fold, and make predictions. At the end, we'll have a list of predictions, with each list item containing predictions for the corresponding fold.

Instructions

Read the code, then hit "Next Step" to move forward.

Hint

Just hit "Next".

# Import the linear regression class
from sklearn.linear_model import LinearRegression
# Sklearn also has a helper that makes it easy to do cross validation
from sklearn.cross_validation import KFold # The columns we'll use to predict the target
predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"] # Initialize our algorithm class
alg = LinearRegression()
# Generate cross validation folds for the titanic dataset. It return the row indices corresponding to train and test.
# We set random_state to ensure we get the same splits every time we run this.
kf = KFold(titanic.shape[0], n_folds=3, random_state=1) predictions = []
for train, test in kf:
# The predictors we're using the train the algorithm. Note how we only take the rows in the train folds.
train_predictors = (titanic[predictors].iloc[train,:])
# The target we're using to train the algorithm.
train_target = titanic["Survived"].iloc[train]
# Training the algorithm using the predictors and target.
alg.fit(train_predictors, train_target)
# We can now make predictions on the test fold
test_predictions = alg.predict(titanic[predictors].iloc[test,:])
predictions.append(test_predictions)

10: Evaluating Error

Now that we have predictions, we can evaluate our error.

We'll first need to define an error metric, so we can figure out how accurate our model is. From the Kaggle competition description, the error metric is percentage of correct predictions. We'll use this same metric to evaluate our performance locally.

The metric will basically involve finding the number of values in predictions that are the exact same as their counterparts in titanic["Survived"], and then dividing by the total number of passengers.

Before we can do this, we need to combine the 3 sets of predictions into one column. Since each set of predictions is a numpy (python scientific computing library) array, we can use a numpy function to concatenate them into one.

Instructions

  • Figure out what proportion of the values in predictions are the exact same as the values in titanic["Survived"].

    • This calculation should be left as a float (decimal) and assigned to the variable accuracy.

Hint

Find the number of values in predictions that exactly match the values in the same position in titanic["Survived"]. Then divide by the total number of predictions.

import numpy as np

# The predictions are in three separate numpy arrays.  Concatenate them into one.
# We concatenate them on axis 0, as they only have one axis.
predictions = np.concatenate(predictions, axis=0) # Map predictions to outcomes (only possible outcomes are 1 and 0)
predictions[predictions > .5] = 1
predictions[predictions <=.5] = 0 accuracy = sum(predictions[predictions == titanic["Survived"]]) / len(predictions)

11: Logistic Regression

We have our first predictions! They aren't very good, though, with only 78.3% accuracy. We can instead use logistic regression to instead output values between 0 and 1.

One good way to think of logistic regression is that it takes the output of a linear regression, and maps it to a probability value between 0 and 1. The mapping is done using the logit function. Passing any value through the logit function will map it to a value between 0 and 1 by "squeezing" the extreme values. This is perfect for us, because we only care about two outcomes.

Sklearn has a class for logistic regression that we can use. We'll also make things easier by using an sklearn helper function to do all of our cross validation and evaluation for us.

Instructions

This step is a demo. Play around with code or advance to the next step.

from sklearn import cross_validation

# Initialize our algorithm
alg = LogisticRegression(random_state=1)
# Compute the accuracy score for all the cross validation folds. (much simpler than what we did before!)
scores = cross_validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=3)
# Take the mean of the scores (because we have one for each fold)
print(scores.mean())

12: Processing The Test Set

Our accuracy is decent, but not great. We can still try a few things to make it better, which we'll talk about in the next mission.

But, we need to make a submission to the competition. To do this, we need to take the exact same steps on the test data that we took on the training data. If we don't do the exact same operations, then we won't be able to make valid predictions on it.

These operations are all the changes we made to the columns before.

Instructions

Process titanic_test the same way we processed titanic:

  • Replace the missing values in the "Age" column with the median age from the train set. The age has to be the exact same value we replaced the missing ages in the training set with (it can't be the median of the test set, because this is different). You should use titanic["Age"].median() to find the median.
  • Replace any male values in the Sex column with 0, and any female values with 1.
  • Fill any missing values in the Embarked column with S.
  • In the Embarked column, replace S with 0, C with 1, and Q with 2.

We'll also need to replace a missing value in the Fare column.

  • Use .fillna with the median of the column in the test set to replace this.
  • There are no missing values in the Fare column of the training set, but test sets can sometimes be different.

Hint

You can look back on what we did in the previous screens to see what we did.

titanic_test = pandas.read_csv("titanic_test.csv")
titanic_test["Age"] = titanic_test["Age"].fillna(titanic["Age"].median())
titanic_test["Fare"] = titanic_test["Fare"].fillna(titanic_test["Fare"].median())
titanic_test.loc[titanic_test["Sex"] == "male", "Sex"] = 0
titanic_test.loc[titanic_test["Sex"] == "female", "Sex"] = 1
titanic_test["Embarked"] = titanic_test["Embarked"].fillna("S") titanic_test.loc[titanic_test["Embarked"] == "S", "Embarked"] = 0
titanic_test.loc[titanic_test["Embarked"] == "C", "Embarked"] = 1
titanic_test.loc[titanic_test["Embarked"] == "Q", "Embarked"] = 2

13: Generating A Submission File

Now we have everything we need to generate a submission for the competition!

First, we have to train an algorithm on the training data. Then, we make predictions on the test set. Finally, we'll generate a csv file with the predictions and passenger ids.

Once you finish all the steps in the code below, you can output to a csv by using submission.to_csv("kaggle.csv", index=False). This will give you everything you need to make a first submission -- it won't give you great accuracy, though (~.75).

Instructions

This step is a demo. Play around with code or advance to the next step.

# Initialize the algorithm class
alg = LogisticRegression(random_state=1) # Train the algorithm using all the training data
alg.fit(titanic[predictors], titanic["Survived"]) # Make predictions using the test set.
predictions = alg.predict(titanic_test[predictors]) # Create a new dataframe with only the columns Kaggle wants from the dataset.
submission = pandas.DataFrame({
"PassengerId": titanic_test["PassengerId"],
"Survived": predictions
})

14: Next Steps

We just generated a submission file, but the accuracy isn't great when we submit (~.75). The score is lower on the test set than in our cross validation due to the fact that we're predicting on different data. In the next mission, we'll learn how to generate better features and use better models to increase our score.

转载自 https://www.dataquest.io/mission/74/getting-started-with-kaggle/3/missing-data

Getting started with Kaggle -- Kaggle Competitions的更多相关文章

  1. Kaggle Competition Past Solutions

    Kaggle Competition Past Solutions We learn more from code, and from great code. Not necessarily alwa ...

  2. linux服务器上配置进行kaggle比赛的深度学习tensorflow keras环境详细教程

    本文首发于个人博客https://kezunlin.me/post/6b505d27/,欢迎阅读最新内容! full guide tutorial to install and configure d ...

  3. 初识kaggle,以及记录 kaggle的使用

    1.简介:Kaggle是一个数据建模和数据分析竞赛的平台.企业和研究者可在其上发布数据,统计学者和数据挖掘专家可在其上进行竞赛,通过“众包”的形式以产生最好的模型.Kaggle可以分为Competit ...

  4. 如何在google colab加载kaggle数据

    参考https://medium.com/@yvettewu.dw/tutorial-kaggle-api-google-colaboratory-1a054a382de0 从本地上传到colab上十 ...

  5. 基于Colab Pro & Google Drive的Kaggle实战

    原文:https://hippocampus-garden.com/kaggle_colab/ 原文标题:How to Kaggle with Colab Pro & Google Drive ...

  6. 通过kaggle api下载数据集

    Kaggle API使用教程 https://www.kaggle.com 的官方 API ,可使用 Python 3 中实现的命令行工具访问. Beta 版 - Kaggle 保留修改当前提供的 A ...

  7. 【项目实战】kaggle产品分类挑战

    多分类特征的学习 这里还是b站刘二大人的视频课代码,视频链接:https://www.bilibili.com/video/BV1Y7411d7Ys?p=9 相关注释已经标明了(就当是笔记),因此在这 ...

  8. Kaggle大数据竞赛平台入门

    Kaggle大数据竞赛平台入门 大数据竞赛平台,国内主要是天池大数据竞赛和DataCastle,国外主要就是Kaggle.Kaggle是一个数据挖掘的竞赛平台,网站为:https://www.kagg ...

  9. 你能在泰坦尼克号上活下来吗?Kaggle的经典挑战

    Kaggle Kaggle是一个数据科学家共享数据.交换思想和比赛的平台.人们通常认为Kaggle不适合初学者,或者它学习路线较为坎坷. 没有错.它们确实给那些像你我一样刚刚起步的人带来了挑战.作为一 ...

随机推荐

  1. js模拟散列

    //散列 //类似于 对象存储,key-value // 存入前,先将key进行hash编码,然后存入 function HashTable(){ var hashData = []; this.dj ...

  2. MongoDB--搭建mongodb服务器

    此为手动搭建: 可以看到初始化data时所有的数据,和log里已经有日志文件

  3. .net委托链

    委托链可以增加方法,可以移除方法,如果是无返回值的方法,我们把它们都绑定到一个委托上面的话,直接调用,那么调用此委托就会依次调用其中的方法:但是如果是多个有返回值的委托链,如果我们不采用特殊手段,直接 ...

  4. jquery dataTables例子

    https://datatables.net/examples/styling/bootstrap.html http://datatables.club/example/#styling http: ...

  5. 深入理解Java虚拟机2-chap3-斗之气9段

    一.GC需要完成三件事 哪些内存需要回收:找出不需要使用的对象 什么时候回收:JVM空闲/堆内存紧张 如何回收:回收垃圾的策略 二.寻找已死对象:第一件事 判断对象是否存活算法 1.引用计数算法 原理 ...

  6. 理解tcp顺序释放操作和tcp的半关闭

    Shutdown的调用        在关闭socket的时候,可以有两种方式closesocket和shutdown,这两个函数的区别在什么地方呢? #include <sys/socket. ...

  7. 同一个电脑安装两个jdk版本

    同一个电脑安装两个jdk版本 场景:公司项目使用的jdk为1.,最近不是很忙,学习scala.该系统使用到了jdk1.8的特性,所以I need 俩版本,开整!!! . 准备两个版本的jdk我的两个j ...

  8. oracle 修改表结构,增加列,删除列等

    增加一列:ALTER TABLE yourTabbleName ADD columnName dataType; 增加多列:ALTER TABLE yourTabbleName ADD (column ...

  9. 记无法用被动方式登录远程linux主机的原因

    [环境]: linux主机:华为企业云 ftp服务端:vsftpd 客户端:ftp命令行工具,安卓端ES文件浏览器 [现象]: 在ES文件浏览器中,使用被动方式没法连接,使用主动方式可以连接,但是没法 ...

  10. HDU 6298

    Problem Description Given an integer n, Chiaki would like to find three positive integers x, y and z ...