Similarity-based approaches to machine learning come from the idea that the best way to make a predictions is to simply look at what has worked well in the past and predict the same thing. The fundamental concepts required to build a system based on this idea are feature spaces and measures of similarity, and these are covered in the fundamentals section of this chapter. These concepts allow us to understand the standard approach to building similarity-based models: the nearest neighbor algorithm. After covering the standard algorithm, we then look at extensions and variations that allow us to handle noisy data (the k nearest neighbor, or k-NN, algorithm), to make predictions more efficiently (k-d trees), to predict continuous targets, and to handle different kinds of descriptive features with varying measures of similarity. We also take the opportunity to introduce the use of data normalization and feature selection in the context of similarity-based learning. These techniques are generally applicable to all machine learning algorithms but are especially important when similarity-based approaches are used.

5.1 Big Idea

The year is 1798, and you are Lieutenant-Colonel David Collins of HMS Calcutta exploring the region around Hawkesbury River, in New South Wales. One day, after an expedition up the river has returned to the ship, one of the men from the expedition tells you that he saw a strange animal near the river. You ask him to describe the animal to you, and he explains that he didn't see it very well because, as he approached it, the animal growled at him, so he didn't approach too closely. However, he did notice that the animal had webbed feet and a duck-billed snout.

In order to plan the expedition for the next day, you decide that you need to classify the animal so that you can determine whether it is dangerous to approach it or not. You decide to do this by thinking about the animals you can remember coming across before and comparing the features of these animals with the features the sailor described to you. We illustrate this process by listing some of the animals you have encountered before and how they compare with the growling, web-footed, duck-billed animal that the sailor described. For each known animal, you count how many features it has in common with the unknown animal. At the end of this process, you decide that the unknown animal is most similar to a duck, so that is what it must be. A duck, no matter how strange, is not a dangerous animal, so you tell the men to get ready for another expedition up the river the next day.

The process of classifying an unknown animal by matching the features of the animal against the features of animals you have encountered before neatly encapsulated the big idea underpinning similarity-based learning: if you are trying to make a prediction for a current situation then you should search your memory to find situations that are similar to the current one and make a prediction based on what was true for the most similar situation in your memory.

5.2 Fundamentals

As the name similarity-based learning suggests, a key component of this approach to prediction is defining a computational measure of similarity between instances. Often this measure of similarity is actually some form of distance measure. A consequence of this, and a somehow less obvious requirement of similarity-based learning, is that if we are going to compute distances between instances, we need to have a concept of space in the representation of the domain used by our model. In this section we introduce the concept of a feature space as a representation for a training dataset and then illustrate how we can compute measures of similarity between instances in a feature space.

5.2.1 Feature Space

We list an example dataset containing two descriptive features, the SPEED and AGILITY ratings for college athletes (both measures out of 10), and one target feature that list whether the athletes were drafted to a professional team. We can represent this dataset in a feature space by taking each of the descriptive features to the axes of a coordinate system. We can then place each instance within the feature space based on the values of its descriptive features. There is a scatter plot to illustrate the resulting feature space when we do this using the data. In this figure SPEED has been plotted on the horizontal axis, and AGILITY has been plotted on the vertical axis. The value of the DRAFT feature is indicated by the shape representing each instance as a point in the feature space: triangles for no and crosses for yes.

There is always one dimension for every descriptive feature in a dataset. In this example, there are only two descriptive features, so the feature space is two-dimensional. Feature spaces can, however,have many more dimensions-in document classification tasks, for example, it is not uncommon to have thousands of descriptive features and therefore thousands of dimensions in the associated feature space. Although we can't easily draw feature spaces beyond three dimensions, the ideas underpinning them remain the same.

We can formally define a feature space as an abstract m-dimensional space that is created by making each descriptive feature in a dataset an axis of an m-dimensional coordinate system and mapping each instance in the dataset to a point in the coordinate system based on the values of its descriptive features.

For similarity-based learning, the nice thing about the way feature spaces work is that if the values of the descriptive features of two or more instances in the dataset are the same, then these instances will be mapped to some point in the feature space. Also, as the difference between the values of the descriptive features of two instances grows, so too does the distance between the points in the feature space that represent these instances. So the distance between two points in the feature space is a useful measure of the similarity of the descriptive features of the two instances.

5.2.2 Measuring Similarity Using Distance Metrics

The simplest way to measure the similarity between two instances, and , in a dataset is to measure the distance between the instances in a feature space. We can use a distance metric to do this:

i

must conform to the following four criteria:

1. Non-negativity:

2.
Identity:

3.
Symmetry:

4.
Triangular
Inequality
:

One
of the best known distance metrics is Euclidean
distance
,
which computes the length of the straight line between two points.
Euclidean distance between two instances and in an m-dimensional
feature space is defined as

(1)

The
descriptive features in the college athlete dataset are both
continuous, which means that the feature space representing this data
is technically known as a Euclidean
coordinate space
,
and we can compute the distance between instances in it using
Euclidean distance. Foe example, the Euclidean distance between
instances

(SPEED
= 5.00, AGILITY = 2.50) and
(SPEED
= 2.75, AGILITY = 7.50) from Table (1) is

Another,
less well-known, distance metric is the Manhattan
distance
.
The Manhattan distance between two instance
and
in a feature space with
dimensions is defined as

where
the
function returns the absolute value. For example, the Manhattan
distance between instances
(SPEED = 5.00, AGILITY = 2.50) and
(SPEED = 2.75, AGILITY = 7.50) in Table (2) is

We
illustrate the differences between the Manhattan and Euclidean
distances between two points in a two-dimensional feature space. If
we compare Equation(1) and Equation(2), we can see that both
distance metrics are essentially functions of the differences between
the values of the features. Indeed, the Euclidean and Manhattan
distances are special cases of the Minkowski
distance
,
which defines a family of distance metrics based on different between
features.

The
Minkowski
distance

between two distances
and
in a feature space with
dimensions is defined as

where
the parameter
is typically set to a positive value and defines the behavior of the
distance metric. Different distance metrics result from adjusting the
value of
For example, the Minkowski distance with
is the Manhattan distance, and with
is the Euclidean distance. Continuing in this manner, we can define
an infinite number of distance metrics.

The
fact that we can define an infinite number of distance metrics is not
merely an academic curiosity. In fact, the predictions produced by a
similarity-based model will change depending on the exact Minkowski
distance used (i.e.,
Larger values of
place more emphasis on large differences between feature values than
smaller values of
because all differences are raised to the power of
Consequently, the Euclidean distance (with
is more strong influenced by a single large difference in one feature
than the Manhattan distance (with
.

We
can see this if we compare the Euclidean and Manhattan distances
between instances
and
with the Euclidean and Manhattan distances between instances
and
(SPEED = 5.25, AGILITY = 9.50).

The
Manhattan distances between both pairs of instances are the same:
7.25. It is striking, however, that the Euclidean distance between
and
is 8.25, which is greater than the Euclidean distance between
and
which is just 5.48. This because the maximum difference between
and
for any single feature is 7 units (for AGILITY), whereas the maximum
difference between
and
on any single feature is just 5 units (for AGILITY). Because
these differences are squared in the Euclidean distance calculation,
the larger maximum single difference between
and
results in a larger overall distance being calculated for this pair
of instances. Overall the Euclidean distance weights features with
larger differences in values more than features with smaller
differences in values. This means that the Euclidean difference is
more influenced by a single large difference in one feature rather
than a log of small differences across a set of features, whereas the
opposite is true of Manhattan distance.

Although
we have an infinite number of Minkowski-based distance metrics to
choose from, Euclidean distance and Manhattan distance are the most
commonly used of these. The question of which is the best one to use,
however, still remains. From a computational perspective, the
Manhattan distance has a slight advantage over the Euclidean distance
- the computation of the squaring and the square root is saved - and
computational considerations can become important when dealing with
very large datasets. Computational considerations aside, Euclidean
distance is often used as the default.

Similarity-based Learning的更多相关文章

  1. 强化学习之 免模型学习(model-free based learning)

    强化学习之 免模型学习(model-free based learning) ------ 蒙特卡罗强化学习 与 时序查分学习 ------ 部分节选自周志华老师的教材<机器学习> 由于现 ...

  2. Pros and Cons of Game Based Learning

    https://www.gamedesigning.org/learn/game-based-learning/ I remember days gone by at elementary schoo ...

  3. Game Based Learning: Why Does it Work?

    Forty years of research[i] says yes, games are effective learning tools. People learn from games, an ...

  4. chip based learning

    chip types Transistor mode of operation Digital chip: 0/1  -> digital clac Analog chip: sound / b ...

  5. Embedded based learning

    简单整理了一些嵌入式底层需要接触的相关概念.   # CPU  CU. Control Unit. send need-clac-data -> ALU clac -> get resul ...

  6. 论文笔记: Deep Learning based Recommender System: A Survey and New Perspectives

    (聊两句,突然记起来以前一个学长说的看论文要能够把论文的亮点挖掘出来,合理的进行概括23333) 传统的推荐系统方法获取的user-item关系并不能获取其中非线性以及非平凡的信息,获取非线性以及非平 ...

  7. A Complete Tutorial on Tree Based Modeling from Scratch (in R & Python)

    A Complete Tutorial on Tree Based Modeling from Scratch (in R & Python) MACHINE LEARNING PYTHON  ...

  8. 100 Most Popular Machine Learning Video Talks

    100 Most Popular Machine Learning Video Talks 26971 views, 1:00:45,  Gaussian Process Basics, David ...

  9. 机器学习算法之旅A Tour of Machine Learning Algorithms

    In this post we take a tour of the most popular machine learning algorithms. It is useful to tour th ...

  10. Learning Query and Document Similarities from Click-through Bipartite Graph with Metadata

    读了一篇paper,MSRA的Wei Wu的一篇<Learning Query and Document Similarities from Click-through Bipartite Gr ...

随机推荐

  1. curl_init函数用法

    使用PHP的cURL库可以简单和有效地去抓网页.你只需要运行一个脚本,然后分析一下你所抓取的网 页,然后就可以以程序的方式得到你想要的数据了.无论是你想从从一个链接上取部分数据,或是取一个XML文件并 ...

  2. TCP/IP学习-链路层

    链路层: 路径MTU: 网络层: ifconfig netstat IP首部 网络字节序:大端字节序

  3. 如何通过CRM评估客户价值和提高客户忠诚度?

    随着市场经济的日益繁荣,同行业之间企业的竞争越来越激烈,企业纷纷各出奇招吸引和挖掘客户,力求让自己的品牌成为更多客户的第一选择.那么,我们可以用什么方法来评估客户价值,提高客户忠诚度呢? 在互联网时代 ...

  4. mac上安装opencv3

    转载于:http://blog.csdn.net/sanwandoujiang/article/details/51159983 在macosx上安装opencv2 brew tap homebrew ...

  5. python(六)内置函数

    一.函数知识补充 函数不设置值,默认返回None:函数中参数都是按引用传递,函数里修改了参数,原始参数也会修改. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 1 ...

  6. POI 导出Excel

    package east.mvc.utils; import java.io.*; import java.lang.reflect.*; import java.text.SimpleDateFor ...

  7. 使用VB6制作RTD函数

    以前模仿大神在vs里使用c#实现RTD函数功能.(真是很生僻的东东啊)C#制作RTD参考:大神博客跳转.最近想VB里能不能做?就试着做了做,好像基本成了,整套代码有些毛病,勉强能算个样子,暂时不打算再 ...

  8. 【Android】解析Json数据

    Json数据:"{\"UserID\":\"Allen\",\"Dep\":IT,\"QQ\":\" ...

  9. linux python pip包安装

    python  -m  pip   install    --trusted-host pypi.python.org

  10. CSS的压缩 方法与解压

    为什么要压缩CSS? 1.大网站节约流量 2.加快访问速度 工具:Dreamweaver(手工替换,个人感觉任何文本编辑器都可以)不过DW可以还原 CSS压缩与CSS代码压缩还原方法,CSS压缩工具有 ...