Getting started with machine learning in Python

Machine learning is a field that uses algorithms to learn from data and make predictions. Practically, this means that we can feed data into an algorithm, and use it to make predictions about what might happen in the future. This has a vast range of applications, from self-driving cars to stock price prediction. Not only is machine learning interesting, it’s also starting to be widely used, making it an extremely practical skill to learn.

In this tutorial, we’ll guide you through the basic principles of machine learning, and how to get started with machine learning in Python. Luckily for us, Python has an amazing ecosystem of libraries that make machine learning easy to get started with. We’ll be using the excellent Scikit-learnPandas, and Matplotlib libraries in this tutorial.

If you want to dive more deeply into machine learning, and apply algorithms in your browser, check out our courses here.

The dataset

Before we dive into machine learning, we’re going to explore a dataset, and figure out what might be interesting to predict. The dataset is from BoardGameGeek, and contains data on 80000 board games. Here’s a single boardgame on the site. This information was kindly scraped into csv format by Sean Beck, and can be downloaded here.

The dataset contains several data points about each board game. Here’s a list of the interesting ones:

  • name – name of the board game.
  • playingtime – the playing time (given by the manufacturer).
  • minplaytime – the minimum playing time (given by the manufacturer).
  • maxplaytime – the maximum playing time (given by the manufacturer).
  • minage – the minimum recommended age to play.
  • users_rated – the number of users who rated the game.
  • average_rating – the average rating given to the game by users. (0-10)
  • total_weights – Number of weights given by users. Weight is a subjective measure that is made up by BoardGameGeek. It’s how “deep” or involved a game is. Here’s a full explanation.
  • average_weight – the average of all the subjective weights (0-5).

Introduction to Pandas

The first step in our exploration is to read in the data and print some quick summary statistics. In order to do this, we’ll us the Pandas library. Pandas provides data structures and data analysis tools that make manipulating data in Python much quicker and more effective. The most common data structure is called adataframe. A dataframe is an extension of a matrix, so we’ll talk about what a matrix is before coming back to dataframes.

Our data file looks like this (we removed some columns to make it easier to look at):

id,type,name,yearpublished,minplayers,maxplayers,playingtime
12333,boardgame,Twilight Struggle,2005,2,2,180
120677,boardgame,Terra Mystica,2012,2,5,150

This is in a format called csv, or comma-separated values, which you can read more about here. Each row of the data is a different board game, and different data points about each board game are separated by commas within the row. The first row is the header row, and describes what each data point is. The entire set of one data point, going down, is a column.

We can easily conceptualize a csv file as a matrix:

    1       2           3                   4
1 id type name yearpublished
2 12333 boardgame Twilight Struggle 2005
3 120677 boardgame Terra Mystica 2012

We removed some of the columns here for display purposes, but you can still get a sense of how the data looks visually. A matrix is a two-dimensional data structure, with rows and columns. We can access elements in a matrix by position. The first row starts with id, the second row starts with 12333, and the third row starts with 120677. The first column is id, the second is type, and so on. Matrices in Python can be used via theNumPy library.

A matrix has some downsides, though. You can’t easily access columns and rows by name, and each column has to have the same datatype. This means that we can’t effectively store our board game data in a matrix – the name column contains strings, and the yearpublished column contains integers, which means that we can’t store them both in the same matrix.

A dataframe, on the other hand, can have different datatypes in each column. It has has a lot of built-in niceities for analyzing data as well, such as looking up columns by name. Pandas gives us access to these features, and generally makes working with data much simpler.

Reading in our data

We’ll now read in our data from a csv file into a Pandas dataframe, using the read_csv method.

# Import the pandas library.
import pandas # Read in the data.
games = pandas.read_csv("board_games.csv")
# Print the names of the columns in games.
print(games.columns)
Index(['id', 'type', 'name', 'yearpublished', 'minplayers', 'maxplayers',
'playingtime', 'minplaytime', 'maxplaytime', 'minage', 'users_rated',
'average_rating', 'bayes_average_rating', 'total_owners',
'total_traders', 'total_wanters', 'total_wishers', 'total_comments',
'total_weights', 'average_weight'],
dtype='object')

The code above read the data in, and shows us all of the column names. The columns that are in the data but aren’t listed above should be fairly self-explanatory.

print(games.shape)
(81312, 20)

We can also see the shape of the data, which shows that it has 81312 rows, or games, and 20 columns, or data points describing each game.

Plotting our target variables

It could be interesting to predict the average score that a human would give to a new, unreleased, board game. This is stored in the average_rating column, which is the average of all the user ratings for a board game. Predicting this column could be useful to board game manufacturers who are thinking of what kind of game to make next, for instance.

We can access a column is a dataframe with Pandas using games["average_rating"]. This will extract a single column from the dataframe.

Let’s plot a histogram of this column so we can visualize the distribution of ratings. We’ll use Matplotlib to generate the visualization. Matplotlib is the main plotting infrastructure in Python, and most other plotting libraries, like seaborn and ggplot2 are built on top of Matplotlib.

We import Matplotlib’s plotting functions with import matplotlib.pyplot as plt. We can then draw and show plots.

# Import matplotlib
import matplotlib.pyplot as plt # Make a histogram of all the ratings in the average_rating column.
plt.hist(games["average_rating"]) # Show the plot.
plt.show()

What we see here is that there are quite a few games with a 0 rating. There’s a fairly normal distribution of ratings, with some right skew, and a mean rating around 6 (if you remove the zeros).

Exploring the 0 ratings

Are there truly so many terrible games that were given a 0 rating? Or is something else happening? We’ll need to dive into the data bit more to check on this.

With Pandas, we can select subsets of data using Boolean series (vectors, or one column/row of data, are known as series in Pandas). Here’s an example:

games[games["average_rating"] == 0]

The code above will create a new dataframe, with only the rows in games where the value of the average_ratingcolumn equals 0.

We can then index the resulting dataframe to get the values we want. There are two ways to index in Pandas – we can index by the name of the row or column, or we can index by position. Indexing by names looks like games["average_rating"] – this will return the whole average_rating column of games. Indexing by position looks likegames.iloc[0] – this will return the whole first row of the dataframe. We can also pass in multiple index values at once – games.iloc[0,0] will return the first column in the first row of games. Read more about Pandas indexinghere.

# Print the first row of all the games with zero scores.
# The .iloc method on dataframes allows us to index by position.
print(games[games["average_rating"] == 0].iloc[0])
# Print the first row of all the games with scores greater than 0.
print(games[games["average_rating"] > 0].iloc[0])
id                             318
type boardgame
name Looney Leo
users_rated 0
average_rating 0
bayes_average_rating 0
Name: 13048, dtype: object
id 12333
type boardgame
name Twilight Struggle
users_rated 20113
average_rating 8.33774
bayes_average_rating 8.22186
Name: 0, dtype: object

This shows us that the main difference between a game with a 0 rating and a game with a rating above 0 is that the 0 rated game has no reviews. The users_rated column is 0. By filtering out any board games with 0reviews, we can remove much of the noise.

Removing games without reviews

# Remove any rows without user reviews.
games = games[games["users_rated"] > 0]
# Remove any rows with missing values.
games = games.dropna(axis=0)

We just filtered out all of the rows without user reviews. While we were at it, we also took out any rows with missing values. Many machine learning algorithms can’t work with missing values, so we need some way to deal with them. Filtering them out is one common technique, but it means that we may potentially lose valuable data. Other techniques for dealing with missing data are listed here.

Clustering games

We’ve seen that there may be distinct sets of games. One set (which we just removed) was the set of games without reviews. Another set could be a set of highly rated games. One way to figure out more about these sets of games is a technique called clustering. Clustering enables you to find patterns within your data easily by grouping similar rows (in this case, games), together.

We’ll use a particular type of clustering called k-means clustering. Scikit-learn has an excellent implementation of k-means clustering that we can use. Scikit-learn is the primary machine learning library in Python, and contains implementations of most common algorithms, including random forests, support vector machines, and logistic regression. Scikit-learn has a consistent API for accessing these algorithms.

# Import the kmeans clustering model.
from sklearn.cluster import KMeans # Initialize the model with 2 parameters -- number of clusters and random state.
kmeans_model = KMeans(n_clusters=5, random_state=1)
# Get only the numeric columns from games.
good_columns = games._get_numeric_data()
# Fit the model using the good columns.
kmeans_model.fit(good_columns)
# Get the cluster assignments.
labels = kmeans_model.labels_

In order to use the clustering algorithm in Scikit-learn, we’ll first intialize it using two parameters – n_clusters defines how many clusters of games that we want, and random_state is a random seed we set in order to reproduce our results later. Here’s more information on the implementation.

We then only get the numeric columns from our dataframe. Most machine learning algorithms can’t directly operate on text data, and can only take numbers as input. Getting only the numeric columns removes typeand name, which aren’t usable by the clustering algorithm.

Finally, we fit our kmeans model to our data, and get the cluster assignment labels for each row.

Plotting clusters

Now that we have cluster labels, let’s plot the clusters. One sticking point is that our data has many columns – it’s outside of the realm of human understanding and physics to be able to visualize things in more than 3 dimensions. So we’ll have to reduce the dimensionality of our data, without losing too much information. One way to do this is a technique called principal component analysis, or PCA. PCA takes multiple columns, and turns them into fewer columns while trying to preserve the unique information in each column. To simplify, say we have two columns, total_owners, and total_traders. There is some correlation between these two columns, and some overlapping information. PCA will compress this information into one column with new numbers while trying not to lose any information.

We’ll try to turn our board game data into two dimensions, or columns, so we can easily plot it out.

# Import the PCA model.
from sklearn.decomposition import PCA # Create a PCA model.
pca_2 = PCA(2)
# Fit the PCA model on the numeric columns from earlier.
plot_columns = pca_2.fit_transform(good_columns)
# Make a scatter plot of each game, shaded according to cluster assignment.
plt.scatter(x=plot_columns[:,0], y=plot_columns[:,1], c=labels)
# Show the plot.
plt.show()

We first initialize a PCA model from Scikit-learn. PCA isn’t a machine learning technique, but Scikit-learn also contains other models that are useful for performing machine learning. Dimensionality reduction techniques like PCA are widely used when preprocessing data for machine learning algorithms.

We then turn our data into 2 columns, and plot the columns. When we plot the columns, we shade them according to their cluster assignment.

The plot shows us that there are 5 distinct clusters. We could dive more into which games are in each cluster to learn more about what factors cause games to be clustered.

Figuring out what to predict

There are two things we need to determine before we jump into machine learning – how we’re going to measure error, and what we’re going to predict. We thought earlier that average_rating might be good to predict on, and our exploration reinforces this notion.

There are a variety of ways to measure error (many are listed here). Generally, when we’re doing regression, and predicting continuous variables, we’ll need a different error metric than when we’re performing classification, and predicting discrete values.

For this, we’ll use mean squared error – it’s easy to calculate, and simple to understand. It shows us how far, on average, our predictions are from the actual values.

Finding correlations

Now that we want to predict average_rating, let’s see what columns might be interesting for our prediction. One way is to find the correlation between average_rating and each of the other columns. This will show us which other columns might predict average_rating the best. We can use the corr method on Pandas dataframes to easily find correlations. This will give us the correlation between each column and each other column. Since the result of this is a dataframe, we can index it and only get the correlations for the average_ratingcolumn.

games.corr()["average_rating"]
id                      0.304201
yearpublished 0.108461
minplayers -0.032701
maxplayers -0.008335
playingtime 0.048994
minplaytime 0.043985
maxplaytime 0.048994
minage 0.210049
users_rated 0.112564
average_rating 1.000000
bayes_average_rating 0.231563
total_owners 0.137478
total_traders 0.119452
total_wanters 0.196566
total_wishers 0.171375
total_comments 0.123714
total_weights 0.109691
average_weight 0.351081
Name: average_rating, dtype: float64

We see that the average_weight and id columns correlate best to rating. ids are presumably assigned when the game is added to the database, so this likely indicates that games created later score higher in the ratings. Maybe reviewers were not as nice in the early days of BoardGameGeek, or older games were of lower quality.average_weight indicates the “depth” or complexity of a game, so it may be that more complex games are reviewed better.

Picking predictor columns

Before we get started predicting, let’s only select the columns that are relevant when training our algorithm. We’ll want to remove certain columns that aren’t numeric.

We’ll also want to remove columns that can only be computed if you already know the average rating. Including these columns will destroy the purpose of the classifier, which is to predict the rating without any previous knowledge. Using columns that can only be computed with knowledge of the target can lead tooverfitting, where your model is good in a training set, but doesn’t generalize well to future data.

The bayes_average_rating column appears to be derived from average_rating in some way, so let’s remove it.

# Get all the columns from the dataframe.
columns = games.columns.tolist()
# Filter the columns to remove ones we don't want.
columns = [c for c in columns if c not in ["bayes_average_rating", "average_rating", "type", "name"]] # Store the variable we'll be predicting on.
target = "average_rating"

Splitting into train and test sets

We want to be able to figure out how accurate an algorithm is using our error metrics. However, evaluating the algorithm on the same data it has been trained on will lead to overfitting. We want the algorithm to learn generalized rules to make predictions, not memorize how to make specific predictions. An example is learning math. If you memorize that 1+1=2, and 2+2=4, you’ll be able to perfectly answer any questions about 1+1 and 2+2. You’ll have 0 error. However, the second anyone asks you something outside of your training set where you know the answer, like 3+3, you won’t be able to solve it. On the other hand, if you’re able to generalize and learn addition, you’ll make occasional mistakes because you haven’t memorized the solutions – maybe you’ll get 3453 + 353535 off by one, but you’ll be able to solve any addition problem thrown at you.

If your error looks surprisingly low when you’re training a machine learning algorithm, you should always check to see if you’re overfitting.

In order to prevent overfitting, we’ll train our algorithm on a set consisting of 80% of the data, and test it on another set consisting of 20% of the data. To do this, we first randomly samply 80% of the rows to be in the training set, then put everything else in the testing set.

# Import a convenience function to split the sets.
from sklearn.cross_validation import train_test_split # Generate the training set. Set random_state to be able to replicate results.
train = games.sample(frac=0.8, random_state=1)
# Select anything not in the training set and put it in the testing set.
test = games.loc[~games.index.isin(train.index)]
# Print the shapes of both sets.
print(train.shape)
print(test.shape)
(45515, 20)
(11379, 20)

Above, we exploit the fact that every Pandas row has a unique index to select any row not in the training set to be in the testing set.

Fitting a linear regression

Linear regression is a powerful and commonly used machine learning algorithm. It predicts the target variable using linear combinations of the predictor variables. Let’s say we have a 2 values, 3, and 4. A linear combination would be 3 * .5 + 4 * .5. A linear combination involves multiplying each number by a constant, and adding the results. You can read more here.

Linear regression only works well when the predictor variables and the target variable are linearly correlated. As we saw earlier, a few of the predictors are correlated with the target, so linear regression should work well for us.

We can use the linear regression implementation in Scikit-learn, just as we used the k-means implementation earlier.

# Import the linearregression model.
from sklearn.linear_model import LinearRegression # Initialize the model class.
model = LinearRegression()
# Fit the model to the training data.
model.fit(train[columns], train[target])

When we fit the model, we pass in the predictor matrix, which consists of all the columns from the dataframe that we picked earlier. If you pass a list to a Pandas dataframe when you index it, it will generate a new dataframe with all of the columns in the list. We also pass in the target variable, which we want to make predictions for.

The model learns the equation that maps the predictors to the target with minimal error.

Predicting error

After we train the model, we can make predictions on new data with it. This new data has to be in the exact same format as the training data, or the model won’t make accurate predictions. Our testing set is identical to the training set (except the rows contain different board games). We select the same subset of columns from the test set, and then make predictions on it.

# Import the scikit-learn function to compute error.
from sklearn.metrics import mean_squared_error # Generate our predictions for the test set.
predictions = model.predict(test[columns]) # Compute error between our test predictions and the actual values.
mean_squared_error(predictions, test[target])
1.8239281903519875

Once we have the predictions, we’re able to compute error between the test set predictions and the actual values. Mean squared error has the formula 1n∑ni=1(yi−y^i)2 . Basically, we subtract each predicted value from the actual value, square the differences, and add them together. Then we divide the result by the total number of predicted values. This will give us the average error for each prediction.

Trying a different model

One of the nice things about Scikit-learn is that it enables us to try more powerful algorithms very easily. One such algorithm is called random forest. The random forest algorithm can find nonlinearities in data that a linear regression wouldn’t be able to pick up on. Say, for example, that if the minage of a game, is less than 5, the rating is low, if it’s 5-10, it’s high, and if it is between 10-15, it is low. A linear regression algorithm wouldn’t be able to pick up on this because there isn’t a linear relationship between the predictor and the target. Predictions made with a random forest usually have less error than predictions made by a linear regression.

# Import the random forest model.
from sklearn.ensemble import RandomForestRegressor # Initialize the model with some parameters.
model = RandomForestRegressor(n_estimators=100, min_samples_leaf=10, random_state=1)
# Fit the model to the data.
model.fit(train[columns], train[target])
# Make predictions.
predictions = model.predict(test[columns])
# Compute the error.
mean_squared_error(predictions, test[target])
1.4144905030983794

Further exploration

We’ve managed to go from data in csv format to making predictions. Here are some ideas for further exploration:

  • Try a support vector machine.
  • Try ensembling multiple models to create better predictions.
  • Try predicting a different column, such as average_weight.
  • Generate features from the text, such as length of the name of the game, number of words, etc.

Want to learn more about machine learning?

At Dataquest, we offer interactive lessons on machine learning and data science. We believe in learning by doing, and you’ll learn interactively in your browser by analyzing real data and building projects. Check out our machine learning lessons here.

Getting started with machine learning in Python的更多相关文章

  1. 【Machine Learning】Python开发工具:Anaconda+Sublime

    Python开发工具:Anaconda+Sublime 作者:白宁超 2016年12月23日21:24:51 摘要:随着机器学习和深度学习的热潮,各种图书层出不穷.然而多数是基础理论知识介绍,缺乏实现 ...

  2. Python (1) - 7 Steps to Mastering Machine Learning With Python

    Step 1: Basic Python Skills install Anacondaincluding numpy, scikit-learn, and matplotlib Step 2: Fo ...

  3. 《Learning scikit-learn Machine Learning in Python》chapter1

    前言 由于实验原因,准备入坑 python 机器学习,而 python 机器学习常用的包就是 scikit-learn ,准备先了解一下这个工具.在这里搜了有 scikit-learn 关键字的书,找 ...

  4. Machine Learning的Python环境设置

    Machine Learning目前经常使用的语言有Python.R和MATLAB.如果采用Python,需要安装大量的数学相关和Machine Learning的包.一般安装Anaconda,可以把 ...

  5. [Machine Learning with Python] Familiar with Your Data

    Here I list some useful functions in Python to get familiar with your data. As an example, we load a ...

  6. [Machine Learning with Python] My First Data Preprocessing Pipeline with Titanic Dataset

    The Dataset was acquired from https://www.kaggle.com/c/titanic For data preprocessing, I firstly def ...

  7. [Machine Learning with Python] Data Preparation through Transformation Pipeline

    In the former article "Data Preparation by Pandas and Scikit-Learn", we discussed about a ...

  8. [Machine Learning with Python] Data Preparation by Pandas and Scikit-Learn

    In this article, we dicuss some main steps in data preparation. Drop Labels Firstly, we drop labels ...

  9. [Machine Learning with Python] Cross Validation and Grid Search: An Example of KNN

    Train model: from sklearn.model_selection import GridSearchCV param_grid = [ # try 6 (3×2) combinati ...

随机推荐

  1. BZOJ 4555: [Tjoi2016&Heoi2016]求和 (NTT + 第二类斯特林数)

    题意 给你一个数 \(n\) 求这样一个函数的值 : \[\displaystyle f(n)=\sum_{i=0}^{n}\sum_{j=0}^{i} \begin{Bmatrix} i \\ j ...

  2. luogu4267 TamingtheHerd (dp)

    设f[i][j]为让前i天发生j次暴动需要改变的最少的值 则f[i][j]=min{f[k][j-1]+(x[k+1]!=0)+(x[k+2]!=1)+...+(x[i]!=(i-k-1))} $O( ...

  3. luogu2865 路障 (dijkstra)

    求次短路,dijkstra时同时记下到某点的最短距离和次短距离即可. #include<cstdio> #include<cstring> #include<algori ...

  4. 【POJ2248】加法链 idfs

    首先,在这道题的搜索框架中,在对每一位进行枚举时,复杂度为\(O(n^2)\),但是可知最优解序列的长度不会太长. 其次,采用 \(bool\) 类型返回值时,是一种存在性搜索,并不一定能够得到最优解 ...

  5. Servlet学习:(三)Servlet3.0 上传文件

    转: Servlet学习:(三)Servlet3.0 上传文件 2018年08月03日 11:57:58 iDark_CSDN 阅读数:362   一.注意事项 客户端(浏览器) 表单的提交方法必须是 ...

  6. Visio画UML类图、序列图 for Java

    参考文档: 1.百度搜索: 怎样用Visio 2007画C++类图 连接 https://jingyan.baidu.com/article/9f7e7ec07286e16f281554f7.html ...

  7. “==”运算符与equals()

    “==”运算符,比较内存地址 equals(),比较内容(值) public class Book { public static void main(String[] args) { String ...

  8. 【题解】亚瑟王 HNOI 2015 BZOJ 4008 概率 期望 动态规划

    传送门:http://www.lydsy.com/JudgeOnline/problem.php?id=4008 一道不简单的概率和期望dp题 根据期望的线性性质,容易想到,可以算出每张卡的期望伤害, ...

  9. 洛谷 P3159(BZOJ 2668)[CQOI2012]交换棋子

    有一个\(n\)行\(m\)列的黑白棋盘,你每次可以交换两个相邻格子(相邻是指有公共边或公共顶点)中的棋子,最终达到目标状态.要求第\(i\)行第\(j\)列的格子只能参与\(m[i][j]\)次交换 ...

  10. Leetcode 496. 下一个更大元素 I

    1.题目描述 给定两个没有重复元素的数组 nums1 和 nums2 ,其中nums1 是 nums2 的子集.找到 nums1 中每个元素在 nums2 中的下一个比其大的值. nums1 中数字  ...