Support Vector Machines for classification

To whet your appetite for support vector machines, here’s a quote from machine learning researcher Andrew Ng:

“SVMs are among the best (and many believe are indeed the best) ‘off-the-shelf’ supervised learning algorithms.”

Professor Ng covers SVMs in his excellent Machine Learning MOOC, a gateway for many into the realm of data science, but leaves out some details, motivating us to put together some notes here to answer the question:

“What are the support vectors in support vector machines?”

We also provide python code using scikit-learn’s svm module to fit a binary classification problem using a custom kernel, along with code to generate the (awesome!) interactive plots in Part 3.

This post consists of three sections:

  • Part 1 sets up the problem from a geometric point of view and then shows how it can be framed as an optimization problem.
  • Part 2 transforms the optimization problem and uncovers the support vectors in the process.
  • Part 3 discusses how kernels can be used to separate non-linearly separable data.

Part 1: Defining the margin

Maximizing the margin

The figure below is a binary classification problem (points labeled yi=±1) that is linearly separable.

There are many possible decision boundaries that would perfectly separate the two classes, but an SVM will choose the line in 2-d (or “hyperplane”, more generally) that maximizes the margin around the boundary.

Intuitively, we can be very confident about the labels of points that fall far from the boundary, but we’re less confident about points near the boundary.

Formulating the margin with geometry

Any point x lying on the separating hyperplane satisfies:
w⋅x+b=0
w is the vector normal to the plane, and b is a constant that describes how much the plane is shifted relative to the origin.  The distance of the plane from the origin is b/∥w∥.

Now draw parallel planes on either side of the decision boundary, so we have what looks like a road, with the decision boundary as the median, and the additional planes as gutters.  The margin, i.e. the width of the road, is (d++d−) and is restricted by the data points closest to the boundary, which lie on the gutters.

The half-spaces bounded by the planes on the gutters are:

w⋅x+b≥+a, for yi=+1

w⋅x+b≤−a, for yi=−1

These two conditions can be put more succinctly:

yi(w⋅x+b)≥a,∀i

Some arithmetic leads to the equation for the margin:

d++d−=2a/∥w∥

Without loss of generality, we can set a=1, since it only sets the scale (units) of b and w.  So to maximize the margin, we have to maximize 1/∥w∥.  However, this is an unpleasant (non-convex) objective function.  Instead we minimize ∥w∥2, which is convex.

The optimization problem

Maximizing the margin boils down to a constrained optimization problem: minimize some quantity f(w), subject to constraints g(w,b).  This optimization problem is particularly nice because it is convex; the objective ∥w∥2 is convex, as are the constraints, which are linear.

In other words, we are faced with a quadratic programming problem.  The standard format of the optimization problem for the separable case is

minimizesubject tof(w)≡(1/2)∥w∥2g(w,b)≡−yi(w⋅x+b)+1≤0,i=1…m(1)

Before we address how to solve this optimization problem in Part 2, let’s first consider the case when data is non-separable.

Soft margin SVM: the non-separable problem and regularization

For non-separable data, we relax the constraints in (1) while penalizing misclassified points via a cost parameter C and slack variables ξi that define the amount by which data points are on the wrong side of the margin.

minimizesubject to(1/2)∥w∥2+C∑imξiyi(w⋅x+b)≥1–ξi,i=1…mξi≥0,i=1…m(2)

A large penalty — large C — for misclassifications will lead to learning a lower bias, higher variance SVM, and vice versa for small C.

The soft margin is used in practice; even in the separable case, it can be desirable to allow tradeoffs between the size of the margin and number of misclassifications. Outliers can skew the decision boundary learned by (1) towards a model with small margins + perfect classification, in contrast to a possibly more robust model learned by (2) with large margins + some misclassified points.


Part 2: Solving the optimization problem

In Part 1, we showed how to set up SVMs as an optimization problem. In this section, we’ll see how the eponymous support vectors emerge when we rephrase the minimization problem as an equivalent maximization problem.

To recap: Given m training points that are labeled yi=±1, our goal is to maximize the margin of the hyperplane defined by w⋅x+b=0.

We’ll use the separable case (1) as our starting point, but the steps in the procedure and final result are similar for the non-separable case (also worked out in ref [3]).

The Lagrangian formulation

How do we solve this optimization problem? Minimizing a function without constraints is probably familiar: set the derivative of the function (the objective) to zero and solve.

With constraints, the procedure is similar to setting the derivative of the objective equal to zero. Instead of taking the derivative of the objective itself, however, we’ll operate on the Lagrangian L, which combines the objective and inequality constraints into one function:

L(w,b,α)=f(w)+∑imαigi(w,b)(3)

We’ve just introduced additional variables αi, Lagrange multipliers, that make it easier to work with the constraints (see Wikipedia about the method of Lagrange multipliers). Note, a more general form for the Lagrangian would include another summation term in (3) to uphold equality constraints. Since there are only inequality constraints here, we’ll omit the extra term.

Constructing the dual problem

Much of the following discussion is based off ref [2], which has a nice introduction to duality in the context of SVMs.

First, let’s make the following observation:

maxαL(w,b,α)={f(w),∞,if gi(w)≤0,(constraints satisfied)if gi(w)>0,(constraints violated)(Obs. 1)

Basically, if any constraint j is violated, i.e. gj(w)>0, then the Lagrange multiplier αj that is multiplying gj(w) can be made arbitrarily large (→∞) in order to maximize L.

On the other hand, if all the constraints are satisfied, gi(w)≤0 ∀i, then L is maximized by setting the αis that are multiplying negative quantities equal to zero. However, Lagrangian multipliers multiplying gi(w) that satisfy the constraints with equality, gi(w)=0, can be non-zero without diminishing L.

The last statement amounts to the property of “complementary slackness” in the Karush-Kuhn-Tucker conditions for the solution:

αigi(w)=0(4)

Recall from the original geometric picture: only a few points lie exactly on the margins, and those points are described by gi(w)=0 (and thus have non-zero Lagrange multipliers). The points on the margin are the support vectors.

Next, we make use of the Max-Min inequality:

maxαminw,bL(w,b,α)≤minw,bmaxαL(w,b,α)

This inequality is an equality under certain conditions, which our problem satisfies (convex f and g). The left side of the inequality is called the dual problem, and the right side is the primal problem.

Now we can put it all together: Observation 1 tells us that solving the right side (primal problem) of the Max-Min inequality is the same as solving the original problem. Because our problem is convex, solving the left side (dual) is equivalent to solving the primal problem by the Max-Min inequality.

Thus we’re set to approach the solution via the dual problem, which is useful for dealing with nonlinear decision boundaries.

Solving the dual problem

The dual problem to solve is maxαminw,bL(w,b,α), subject to constraints* on the Lagrange multipliers: αi≥0.

Let’s work out the inner part of the expression explicitly. We obtain minw,bL(w,b,α) by setting:

∇wL=0;∂bL=0

These equations for the partial derivatives give us, respectively:

w=∑iαiyixi;∑iαiyi=0

w is a linear combination of the coordinates of the training data. Only the support vectors, which have non-zero αi, contribute to the sum. To predict the label for a new test point xt, simply evaluate the sign of

w⋅x+b=∑iαiyixi⋅xt+b(5)

where b can be computed from the KKT complementarity condition (4) by plugging in the values for any support vector. The equation for the separating hyperplane is entirely determined by the support vectors.

Plugging the last two equations into L leads to the dual formulation of the problem maxαLD:

maxαsubject to∑iαi–(1/2)∑i,jαiαjyiyjxi⋅xjαi≥0,i=1…m∑iαiyi=0(6)

The dual for the non-separable primal Lagrangian (2) — derived using the same procedure we just followed — looks just like the dual for the separable case (6), except that the Lagrange multipliers are bounded from above by the regularization constant: 0≤αi≤C. Notably, the slack variables ξi do not appear in the dual of the soft margin SVM.

The dual (called the Wolfe dual) is easier to solve because of the simpler form of its inequality constraints and is the form used in algorithms such as the Sequential Minimal Optimizationalgorithm, which is implemented in the popular SVM solver, LIBSVM. The key feature of the dual is that training vectors only appear as dot products xi⋅xj. This property allows us to generalize to the nonlinear case via the “kernel trick” discussed in Part 3 of this post.

* Some of you may be familiar with using Lagrangian multipliers to optimize some function f(x)subject to equality constraints g(x)=0, in which case the Lagrangian multipliers are unconstrained. The Karush-Kuhn-Tucker conditions generalize the method to include inequality constraints g(x)≤0, which results in additional constraints on the associated Lagrangian multipliers (as we have here).


Part 3: Kernels

Data that is not linearly separable in the original input space may be separable if mapped to a different space. Consider the following example of nonlinearly separable, two-dimensional data:

However, if we map the 2-d input data x=(x,y) to 3-d feature space by a function Φ(x)=(x,y,x2+y2), the blue and red points can be separated with a plane in the new (3-d) space. See the plot below of the decision boundary, the mapped points, as well as the the original data points in the x-y plane. Drag the figure to rotate it, or zoom in and out with your mouse wheel!

Code to generate and fit the data in this example with scikit-learn’s SVM module, as well as code to create the plot.ly interactive plots above, is available in IPython notebooks on github.

From maps to kernels

So how do we incorporate mapping the data into the formulation of the problem?

Recall that the data appears as a dot product in the dual Lagrangian (6). If we decide to train an SVM on the mapped data, then the dot product of the input data in (6) is replaced by the dot product of the mapped data: xi⋅xj→Φ(xi)⋅Φ(xj)

The kernel is simply the dot product of the mapping functions. In the example above, the inner product of the mapping function is an instance of a polynomial kernel:

K(xi,xj)=Φ(xi)⋅Φ(xj)=xixj+yiyj+(x2i+y2i)(x2j+y2j)

In practice, we work directly with the kernel K(xi,xj) rather than explicitly computing the map of the data points**. Computing the kernel directly allows us to sidestep the computationally expensive operation of mapping data to a high dimensional space and then taking a dot product (see ref [2] for examples comparing computational times of the two methods).

Using a kernel, the second term in the objective of the dual problem (6) becomes

∑i,jαiαjyiyjK(xi,xj)

The kernel also appears in the evaluation of (5) to predict the classification of a test point xt:

sgn(∑iαiyiK(xi,xt)+b)(8)

Which functions are valid kernels to use in the kernel trick? i.e. given K(xi,xj), does some feature map Φ exist such that K(xi,xj)=Φ(xi)⋅Φ(xj) for any i, j? Mercer’s condition states that a necessary and sufficient condition for K to be a valid kernel is that it is symmetric and positive semi-definite.

Some popular kernels are:

polynomial:Gaussian radial basis function:(xi⋅xj+c)pexp(−∥xi–xj∥2/2σ2)

The optimal parameters for the degree of the polynomial p and spread of the Gaussian σ (as well as the regularization parameter) are determined by cross-validation. Computing the above kernels takes O(d) time, where d is the dimension of the input space, since we have to evaluate xi⋅xj in the polynomial kernel and xi–xj in the Gaussian kernel.

Comparing runtimes of linear and nonlinear kernels

The computational complexity for classification/prediction, i.e. at test time, can be obtained by eyeballing (5) and (8). Let d be the dimension of the input space and n be the size of the training set, and assume the number of support vectors nS is some fraction of n, nS∼O(n).

In the case of working with the linear kernel/original input space, w can be explicitly evaluated to obtain the separating hyperplane parameters, so that classification in (5) takes O(d) time. On the other hand, with the kernel trick, the hyperplane parameters are not explicitly evaluated. Assume calculating a kernel takes O(d) time, cf. the polynomial and Gaussian kernels; then test time for a nonlinear K in (8) takes O(nd) time.

Estimating the computational complexity for training is complicated, so we defer the discussion to refs [4a4b] and simply state the result: training for linear kernels is O(nd) while training for nonlinear kernels using the Sequential Minimal Optimization algorithm is O(n2) to O(n3)(dependent on the regularization parameter C), making nonlinear kernel SVMs impractical for larger datasets (a couple of 10,000 samples according to scikit-learn).

** More than one mapping and feature space (dimension) may exist for a particular kernel. See section 4 of ref [1] for examples.

† See ref [2] for a simple proof in terms of the Kernel (Gram) matrix, i.e. the kernel function evaluated on a finite set of points.


Discussion

We’ve glimpsed the elegant theory behind the construction of SVMs and seen how support vectors pop out of the mathematical machinery. Geometrically, the support vectors are the points lying on the margins of the decision boundary.

How about using SVMs in practice?

In his Coursera course, Professor Ng recommends linear and Gaussian kernels for most use cases. He also provides some rules of thumb (based on the current state of SVM algorithms) for different sample sizes n and input data dimension/number of features d, restated here:

case method n d
n≪d, e.g. genomics, bioinformatics data linear kernel SVM or logistic regression 10 – 1000 10,000
n intermediate, d small Gaussian kernel SVM 10 – 10,000 1 – 1000
n≫d create features, then linear kernel SVM or logistic regression 50,000+ 1 – 1000

The creators of the LIBSVM and LIBLINEAR packages also provide a user’s guide for novices, which includes a study of when to use linear instead of radial basis function kernels. They recommend linear SVMs when d and n are both large, often encountered in document classification problems where bag-of-words approaches can generate huge numbers of features (in their example n=20,000, d= 47,000).

The idea is that if the input data is already high-dimensional, then it shouldn’t be necessary to apply nonlinear transformations to it in order to obtain a separating hyperplane.

Tip: LIBLINEAR is specifically optimized for linear kernels and should be used instead of LIBSVM in the linear case.

Further reading

In addition to the many excellent written tutorials on SVMs online, we highly recommend viewing lectures 14 and 15 of Yaser Abu-Mostafa’s MOOC, Learning from Data, which cover SVMs at about the same level as this post, with the considerable added benefit of Professor Abu-Mostafa’s explanations. He also discusses the generalization performance of SVMs as a function of the number of support vectors using VC theory (also see [1]).

There is a lot more theory on SVMs that we haven’t touched upon. For example, SVMs can be framed as a penalization method [3] or “regularization network”, c.f. ridge regression, but with a hinge loss rather than squared error. Insights about the choice of a kernel have also been developed in that framework.


[1] Burges, C. J.C. (1998). A Tutorial on Support Vector Machines for Pattern Recognition. Knowledge Discovery and Data Mining 2 (2) 121-167.

[2] Ng, A. Support Vector Machines [PDF document]. Retrieved from lecture notes online: http://cs229.stanford.edu/notes/cs229-notes3.pdf
Lecture notes by Andrew Ng for a more advanced class (but still in his signature intuitive style).

[3] Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning.
See section 12.2.1, page 420, for derivation of the dual Lagrangian for the nonseparable case.

[4a] Bottou, L. and Lin C-J., (2006). Support Vector Machine Solvers.
[4b] Chang, C-C. and Lin C-J., (2013). LIBSVM: A Library for Support Vector Machines.

[5] Ben-Hur, A. and Weston, J. (2009). A User’s Guide to Support Vector Machines. In Carugo, O. and Eisenhaber, F. (Eds.), Methods in Molecular Biology 609, 223-229.

Andrew Ng photo credit: InverseHypercubecreative commons license.

Support Vector Machines for classification的更多相关文章

  1. [Scikit-learn] 1.4 Support Vector Machines - Linear Classification

    Outline: 作为一种典型的应用升维的方法,内容比较多,自带体系,以李航的书为主,分篇学习. 函数间隔和几何间隔 最大间隔 凸最优化问题 凸二次规划问题 线性支持向量机和软间隔最大化 添加的约束很 ...

  2. Machine Learning - 第7周(Support Vector Machines)

    SVMs are considered by many to be the most powerful 'black box' learning algorithm, and by posing构建 ...

  3. Introduction to One-class Support Vector Machines

    Traditionally, many classification problems try to solve the two or multi-class situation. The goal ...

  4. 【原】Coursera—Andrew Ng机器学习—课程笔记 Lecture 12—Support Vector Machines 支持向量机

    Lecture 12 支持向量机 Support Vector Machines 12.1 优化目标 Optimization Objective 支持向量机(Support Vector Machi ...

  5. 【Supervised Learning】支持向量机SVM (to explain Support Vector Machines (SVM) like I am a 5 year old )

    Support Vector Machines 引言 内核方法是模式分析中非常有用的算法,其中最著名的一个是支持向量机SVM 工程师在于合理使用你所拥有的toolkit 相关代码 sklearn-SV ...

  6. (原创)Stanford Machine Learning (by Andrew NG) --- (week 7) Support Vector Machines

    本栏目内容来源于Andrew NG老师讲解的SVM部分,包括SVM的优化目标.最大判定边界.核函数.SVM使用方法.多分类问题等,Machine learning课程地址为:https://www.c ...

  7. Coursera 机器学习 第7章 Support Vector Machines 学习笔记

    7 Support Vector Machines7.1 Large Margin Classification7.1.1 Optimization Objective支持向量机(SVM)代价函数在数 ...

  8. [C7] 支持向量机(Support Vector Machines) (待整理)

    支持向量机(Support Vector Machines) 优化目标(Optimization Objective) 到目前为止,你已经见过一系列不同的学习算法.在监督学习中,许多学习算法的性能都非 ...

  9. Ng第十二课:支持向量机(Support Vector Machines)(三)

    11 SMO优化算法(Sequential minimal optimization) SMO算法由Microsoft Research的John C. Platt在1998年提出,并成为最快的二次规 ...

随机推荐

  1. Go Walk教程 - 流程控制( switch)

    Go的 switch 非常灵活,表达式不必是常量或整数,执行的过程从上至下,直到找到匹配项,不要break: var score =98 var result string switch score/ ...

  2. C#委托Action、Action<T>、Func<T>、Predicate<T>

    CLR环境中给我们内置了几个常用委托Action. Action<T>.Func<T>.Predicate<T>,一般我们要用到委托的时候,尽量不要自己再定义一 个 ...

  3. Windows下MemCache多端口安装配置

    Windows下MemCache环境安装配置的文章很多,但大部分都是用的默认端口11211,如何修改默认端口.如何在一台服务器上配置多个MemCache端口?这正式本文要解决的问题. 1.从微软官网下 ...

  4. 信息安全系统设计基础实验一 20135210&20135218

    北京电子科技学院(BESTI) 实     验    报     告 课程: 密码系统设计基础                                                     ...

  5. KMP算法的Next数组详解

    转载请注明来源,并包含相关链接. 网上有很多讲解KMP算法的博客,我就不浪费时间再写一份了.直接推荐一个当初我入门时看的博客吧:http://www.cnblogs.com/yjiyjige/p/32 ...

  6. 16.C#初见Lambda表达式及表达式树(九章9.1-9.3)

    在说明Lambda相关知识前,我们需要了解Lambda表达式常用于LINQ,那么我们来聊下LINQ. LINQ的基本功能就是创建操作管道,以及这些操作需要的任何状态.这些操作表示了各种关于数据的逻辑: ...

  7. 第四十一课:CSS3 animation详解

    animation是css3的另一个重要的模块,它成型比transition晚,吸取了Flash的关键帧的理念,实用性高. animation是一个复合样式,它可以细分为8个更细的样式. (1)ani ...

  8. ubuntu的命令day1

    ls -i    显示所有的文件,包括隐藏的文件. 以  .  开头的文件都是隐藏文件,可以在终端用ls -i显示所有的文件.比如.ssh linux生成密钥的命令如下: 1. cd .ssh/    ...

  9. 每天一个linux命令(11):less 命令

    less 工 具也是对文件或其它输出进行分页显示的工具,应该说是linux正统查看文件内容的工具,功能极其强大.less 的用法比起 more 更加的有弹性. 在 more 的时候,我们并没有办法向前 ...

  10. Java死锁的例子

    死锁 死锁是这样一种情形:多个线程同时被阻塞,它们中的一个或者全部都在等待某个资源被释放.由于线程被无限期地阻塞,因此程序不可能正常终止. 导致死锁的根源在于不适当地运用“synchronized”关 ...