What are the advantages of different classification algorithms?
What are the advantages of different classification algorithms?
How large is your training set?
If your training set is small, high bias/low variance classifiers (e.g., Naive Bayes) have an advantage over low bias/high variance classifiers (e.g., kNN or logistic regression), since the latter will overfit. But low bias/high variance classifiers start to win out as your training set grows (they have lower asymptotic error), since high bias classifiers aren't powerful enough to provide accurate models.
You can also think of this as a generative model vs. discriminative model distinction.
Advantages of some particular algorithms
Advantages of Naive Bayes: Super simple, you're just doing a bunch of counts. If the NB conditional independence assumption actually holds, a Naive Bayes classifier will converge quicker than discriminative models like logistic regression, so you need less training data. And even if the NB assumption doesn't hold, a NB classifier still often performs surprisingly well in practice. A good bet if you want to do some kind of semi-supervised learning, or want something embarrassingly simple that performs pretty well.
Advantages of Logistic Regression: Lots of ways to regularize your model, and you don't have to worry as much about your features being correlated, like you do in Naive Bayes. You also have a nice probabilistic interpretation, unlike decision trees or SVMs, and you can easily update your model to take in new data (using an online gradient descent method), again unlike decision trees or SVMs. Use it if you want a probabilistic framework (e.g., to easily adjust classification thresholds, to say when you're unsure, or to get confidence intervals) or if you expect to receive more training data in the future that you want to be able to quickly incorporate into your model.
Advantages of Decision Trees: Easy to interpret and explain (for some people -- I'm not sure I fall into this camp). Non-parametric, so you don't have to worry about outliers or whether the data is linearly separable (e.g., decision trees easily take care of cases where you have class A at the low end of some feature x, class B in the mid-range of feature x, and A again at the high end). Their main disadvantage is that they easily overfit, but that's where ensemble methods like random forests (or boosted trees) come in. Plus, random forests are often the winner for lots of problems in classification (usually slightly ahead of SVMs, I believe), they're fast and scalable, and you don't have to worry about tuning a bunch of parameters like you do with SVMs, so they seem to be quite popular these days.
Advantages of SVMs: High accuracy, nice theoretical guarantees regarding overfitting, and with an appropriate kernel they can work well even if you're data isn't linearly separable in the base feature space. Especially popular in text classification problems where very high-dimensional spaces are the norm. Memory-intensive and kind of annoying to run and tune, though, so I think random forests are starting to steal the crown.
To go back to the particular question of logistic regression vs. decision trees (which I'll assume to be a question of logistic regression vs. random forests) and summarize a bit: both are fast and scalable, random forests tend to beat out logistic regression in terms of accuracy, but logistic regression can be updated online and gives you useful probabilities. And since you're at Square (not quite sure what an inference scientist is, other than the embodiment of fun) and possibly working on fraud detection: having probabilities associated to each classification might be useful if you want to quickly adjust thresholds to change false positive/false negative rates, and regardless of the algorithm you choose, if your classes are heavily imbalanced (as often happens with fraud), you should probably resample the classes or adjust your error metrics to make the classes more equal.
But...
Recall, though, that better data often beats better algorithms, and designing good features goes a long way. And if you have a huge dataset, your choice of classification algorithm might not really matter so much in terms of classification performance (so choose your algorithm based on speed or ease of use instead).
And if you really care about accuracy, you should definitely try a bunch of different classifiers and select the best one by cross-validation. Or, to take a lesson from the Netflix Prize and Middle Earth, just use an ensemble method to choose them all!
Related Questions
- Number of training examples
- Dimensionality of the feature space
- Do I expect the problem to be linearly separable?
- Are features independent?
- Are features expected to be in a linear scale?
- Is overfitting expected to be a problem?
- What are the system's requirement in terms of speed/performance/memory usage...?
- ...
This list may seem a bit daunting because there are many issues that are not straightforward to answer. The good news though is, that as many problems in life, you can address this question by following the Occam's Razor principle: use the least complicated algorithm that can address your needs and only go for something more complicated if strictly necessary.
Logistic Regression
As a general rule of thumb, I would recommend to start with Logistic Regression. Logistic regression is a pretty well-behaved classification algorithm that can be trained as long as you expect your features to be roughly linear and the problem to be linearly separable. You can do some feature engineering to turn most non-linear features into linear pretty easily. It is also pretty robust to noise and you can avoid overfitting and even do feature selection by using l2 or l1 regularization. Logistic regression can also be used in Big Data scenarios since it is pretty efficient and can be distributed using, for example, ADMM (seelogreg). A final advantage of LR is that the output can be interpreted as a probability. This is something that comes as a nice side effect since you can use it, for example, for ranking instead of classification.
Even in a case where you would not expect Logistic Regression to work 100%, do yourself a favor and run a simple l2-regularized LR to come up with a baseline before you go into using "fancier" approaches.
Ok, so now that you have set your baseline with Logistic Regression, what should be your next step. I would basically recommend two possible directions: (1) SVM's, or (2) Tree Ensembles. If I knew nothing about your problem, I would definitely go for (2), but I will start with describing why SVM's might be something worth considering.
Support Vector Machines
Support Vector Machines (SVMs) use a different loss function (Hinge) from LR. They are also interpreted differently (maximum-margin). However, in practice, an SVM with a linear kernel is not very different from a Logistic Regression (If you are curious, you can see how Andrew Ng derives SVMs from Logistic Regression in his Coursera Machine Learning Course). The main reason you would want to use an SVM instead of a Logistic Regression is because your problem might not be linearly separable. In that case, you will have to use an SVM with a non linear kernel (e.g. RBF). The truth is that a Logistic Regression can also be used with a different kernel, but at that point you might be better off going for SVMs for practical reasons. Another related reason to use SVMs is if you are in a highly dimensional space. For example, SVMs have been reported to work better for text classification.
Unfortunately, the major downside of SVMs is that they can be painfully inefficient to train. So, I would not recommend them for any problem where you have many training examples. I would actually go even further and say that I would not recommend SVMs for most "industry scale" applications. Anything beyond a toy/lab problem might be better approached with a different algorithm.
Tree Ensembles
This gets me to the third family of algorithms: Tree Ensembles. This basically covers two distinct algorithms: Random Forests and Gradient Boosted Trees. I will talk about the differences later, but for now let me treat them as one for the purpose of comparing them to Logistic Regression.
Tree Ensembles have different advantages over LR. One main advantage is that they do not expect linear features or even features that interact linearly. Something I did not mention in LR is that it can hardly handle categorical (binary) features. Tree Ensembles, because they are nothing more than a bunch of Decision Trees combined, can handle this very well. The other main advantage is that, because of how they are constructed (using bagging or boosting) these algorithms handle very well high dimensional spaces as well as large number of training examples.
As for the difference between Random Forests (RF) and Gradient Boosted Decision Trees (GBDT), I won't go into many details, but one easy way to understand it is that GBDTs will usually perform better, but they are harder to get right. More concretely, GBDTs have more hyper-parameters to tune and are also more prone to overfitting. RFs can almost work "out of the box" and that is one reason why they are very popular.
Deep Learning
Last but not least, this answer would not be complete without at least a minor reference to Deep Learning. I would definitely not recommend this approach as a general-purpose technique for classification. But, you might probably have heard how well these methods perform in some cases such as image classification. If you have gone through the previous steps and still feel you can squeeze something out of your problem, you might want to use a Deep Learning approach. The truth is that if you use an open source implementation such as Theano, you can get an idea of how some of these approaches perform in your dataset pretty quickly.
Summary
So, recapping, start with something simple like Logistic Regression to set a baseline and only make it more complicated if you need to. At that point, tree ensembles, and in particular Random Forests since they are easy to tune, might be the right way to go. If you feel there is still room for improvement, try GBDT or get even fancier and go for Deep Learning.
You can also take a look at the Kaggle Competitions. If you search for the keyword "classification" and select those that are completed, you will get a good sense of what people used to win competitions that might be similar to your problem at hand. At that point you will probably realize that using an ensemble is always likely to make things better. The only problem with ensembles, of course, is that they require to maintain all the independent methods working in parallel. That might be your final step to get as fancy as it gets.
Does it require variables to be normally distributed?
Does it suffer multicollinearity issue?
Dose it do as well with categorical variables as continuous variables?
Does it calculate CI without CV?
Does it conduct variables selection without stepwise?
Does it apply to sparse data?
Here is the comparison:
Logistic regression: no distribution requirement, perform well with few categories categorical variables, compute the logistic distribution, good for few categories variables, easy to interpret, compute CI, suffer multicollinearity
Decision Trees: no distribution requirement, heuristic, good for few categories variables, not suffer multicollinearity (by choosing one of them)
NB: generally no requirements, good for few categories variables, compute the multiplication of independent distributions, suffer multicollinearity
LDA(Linear discriminant analysis not latent Dirichlet allocation):require normal, not good for few categories variables, compute the addition of Multivariate distribution, compute CI, suffer multicollinearity
SVM: no distribution requirement, compute hinge loss, flexible selection of kernels for nonlinear correlation, not suffer multicollinearity, hard to interpret
Lasso: no distribution requirement, compute L1 loss, variable selection, suffer multicollinearity
Ridge: no distribution requirement, compute L2 loss, no variable selection, not suffer multicollinearity
Bagging, boosting, ensemble methods(RF, Ada, etc): generally outperform single algorithm listed above.
Above all, Logistic regression is still the most widely used for its good features, but if the variables are normally distributed and the categorical variables all have 5+ categories, you may be surprised by the performance ofLDA, and if the correlations are mostly nonlinear, you can't beat a good SVM,
and if sparsity and multicollinearity are a concern, I would recommendAdaptive Lasso with Ridge(weights) + Lasso, this would suffice for most scenarios without much tuning.
And in the end if you need one fine tuned model, go for ensemble methods.
PS: Just see the sub question, with 10000 instances and more than 100000 features, the quick answer will be Lasso.
Random Forests:
1. Almost always have lower classification error and better f-scores than decision trees.
2. Almost always perform as well as or better than SVMs, but are far easier for humans to understand.
3. Deal really well with uneven data sets that have missing variables.
4. Give you a really good idea of which features in your data set are the most important for free.
5. Generally train faster than SVMs (though this obviously depends on your implementation).
More info at Random forests - classification description
http://www.amsta.leeds.ac.uk/~ch...
D. Michie, D.J. Spiegelhalter, C.C. Taylor (eds). Machine Learning, Neural and Statistical Classification
- Decision Trees are fast to train and easy to evaluate and interrupt.
- Support vector machine gives good accuracy, power of flexibility from kernels.
- Neural network are slow to converge and hard to set parameters but if done with care it work wells
- Bayesian classifiers are easy to understand.
- kNN should be avoided in my case since the evaluation is quite heavy if your training dataset contains several thousand elements ; although it gives really good results.
- Naive Bayes is very simple and quicky to evaluate but I had to tweak it to handle unbalanced classes.
- Rocchio seems very naive but it works surprisingly well and is very efficient.
Finally, I use a combination of Naive Bayes and Rocchio to gain accuracy on the same principal than boosting (linear mixing obtained by cross-validation). You can also use EM on NB or Rocchio, since the formulation is very simple in these cases. This could help.
All in all, I'd say that this is very data-dependent. Best technics depends on the data and what accuracy/efficiency trade off you are expecting.
Support Vector Machines work very well in many circumstances and perform very good with large amounts of data.
Association ones such as Apriori have an excellent performance, due to how the algorithm is build, and it always reaches the proper solution.
Naive Bayes mechanism is very simple to understand, it has also a high performance and is also easy to implement
see these older references
An Empirical Comparison of Supervised Learning Algorithms
Page on cornell.edu
An Empirical Comparison of Supervised Learning Algorithms Using Different Performance Metrics
Page on niculescu-mizil.org
and the video lecture
Which Supervised Learning Method Works Best for What? An Empirical Comparison of Learning Methods and Metrics
Advanced methods like Deep Learning are now available but probably pretty hard to use for a novice. An exception would be SVM based Factorization Machines (see below)
Many modern methods can be formulated as regularization problems solving a convex optimizations with some loss function. So Logistic Regression, SVM, L1-SVM, Boosting, etc, can be viewed as variations of the same method and, these days, are usually implemented with a common solver.
I would hesitate to distinguish between logistic regression and an SVM as different methods for the purposes of classification. Many packages, like LibSVM, provide a variety of methods and can be applied to the same data set for comparing.
MultiClass or MultiLabel:
SVMs are very good at basic and multiclass classification, and Random Forests are probably better for multilabel classification.
There are specific SVM implementations for Multiclass (Cramer & Singer algo) and Structural (SvmLight) problems, and even MultiLabel SVMs (M3L). These are a bit esoteric but it is important to be able to solve more than just binary classification problems.
Still, it is non-trivial to modify an SVM for multilabel problems whereas Random Forests are easily adapted to this.
see:
Multi-Label Learning with Millions of Labels: Recommending Advertiser Bid Phrases for Web Pages
Page on microsoft.com
Instance Weights:
Sometimes we may have more information than just labels; we may have weights on instances. Here, there is an open source implementation of LibLinear that provides instance weights.
If you are using Scikit learn, one can use instance (sample) weights with Random Forests (but not the SVM)
Incomplete Information:
Machine Learning methods do very poorly when we don't have enough labels for our data, or the labels are wrong.
SVMs can even be applied in situations when the labels are only partially or weakly known, but we have additional information about the global statistics
Convex Relaxations of Transductive Learning
And this works particularly well for text classification when choosing an extended basis set, such as using a word2vec or glove.
Kernels--and your nature features:
Using an SVM or any Kernel Method requires choosing a regularizer, and maybe a Kernel. Generally you should not use an RBF Kernel unless the underlying problem is like a signal processing problem:
Kernels Part 1: What is an RBF Kernel? Really?
That is, you are looking for something like a smooth solution
A recent extension to SVMs Kernels is Factorization Machines (FMs), and while this method are typically thought of as recommender system, it can be used for SVM like classification as well.
FMs would be useful for picking up very weak correlations between sparse, discrete features
Regularization:
Most modern methods include some basic regularizers like L1, L2, or some combination of the 2 (elastic net)
Why would one choose, say, an L1 SVM over an L2 SVM? The L1 SVM pre-supposed the features are very, very sparse. For example, I have used an L1 SVM to reduce a model with 50K features to a few hundred. L2 SVM can capture models that require many tiny features
Numerical performance:
SVMs, like many convex methods, have been optimized for the past 15 years and these days scale very well...I routinely use linear SVMs to solve classifications problems with tens of millions of instances and say half a million features. I can do this on my laptop. I am bit puzzled why others would claim these are hard to train (unless they are trying to use a non-linear / RBF kernel and have a poor implementation) And in my experience in using them over 15 years in production, they are very easy train at very large scale for most commercial applications. They are convex methods with only a 1-2 adjustable parameters.
I have routinely trained linear SVMs with nearly 10M instances and 1M features--on my laptop in near real time.
For more than say 10M instances and 1M features, the training can be done in parallel using lock free stochastic coordinate descent, or in online mode using. Modern open source implementations include libsvm/liblinear, Graphlab/Dato, and Vowpal Wabbit. There are even GPU accelerated versions like BidMach
and there are distributed memory, parallel implementations running on upto 1000 nodes.
Interpretation:
Linear SVMs are very easy to interpret--it is the non-linear case that is a bit tricky. Although any non-linear problem is going to require some work to interpret in that you would need to find a compact basis set to represent the non-linearities. Heck, If you know the basis set a-priori, you can just project your data onto this basis and run a linear SVM and the problem is trivial.
what you need to do, transparency and governance, regulatory needs, business interpretation issues and others will outweigh maths in a business.
2. Data determines the result more than the algorithm
if you have lots of good data, all algorithms are good. Multicollinearity is less of a problem with large datasets.
What are the advantages of different classification algorithms?的更多相关文章
- 【机器学习Machine Learning】资料大全
昨天总结了深度学习的资料,今天把机器学习的资料也总结一下(友情提示:有些网站需要"科学上网"^_^) 推荐几本好书: 1.Pattern Recognition and Machi ...
- Netflix工程总监眼中的分类算法:深度学习优先级最低
Netflix工程总监眼中的分类算法:深度学习优先级最低 摘要:不同分类算法的优势是什么?Netflix公司工程总监Xavier Amatriain根据奥卡姆剃刀原理依次推荐了逻辑回归.SVM.决策树 ...
- 从 Quora 的 187 个问题中学习机器学习和NLP
从 Quora 的 187 个问题中学习机器学习和NLP 原创 2017年12月18日 20:41:19 作者:chen_h 微信号 & QQ:862251340 微信公众号:coderpai ...
- How to handle Imbalanced Classification Problems in machine learning?
How to handle Imbalanced Classification Problems in machine learning? from:https://www.analyticsvidh ...
- A Novel Multi-label Classification Based on PCA and ML-KNN
ICIC Express Letters ICIC International ⓒ2010 ISSN 1881-803X Volume4, Number5, O ...
- Random Forest Classification of Mushrooms
There is a plethora of classification algorithms available to people who have a bit of coding experi ...
- 壁虎书3 Classification
MNIST fetch_openml returns the unsorted MNIST dataset, whereas fetch_mldata() returned the dataset s ...
- 2013:Audio Tag Classification - MIREX Wiki
Contents [hide] 1 Description 1.1 Task specific mailing list 2 Data 2.1 MajorMiner Tag Dataset 2.2 M ...
- an introduction to conditional random fields
1.Structured prediction methods are essentially a combination of classification and graphical modeli ...
随机推荐
- 02_Java基础_第2天(变量、运算符)_讲义
今日内容介绍 1.变量 2.运算符 01变量概述 * A: 什么是变量? * a: 变量是一个内存中的小盒子(小容器),容器是什么?生活中也有很多容器, * 例如水杯是容器,用来装载水:你家里的大衣柜 ...
- 判断字符串中是否存在的几种方案:string.indexof、string.contains、list.contains、list.any几种方式效率对比
我们在做项目时,可能会遇到这样的需求,比如判断,1,2,3,33,22,123, 中是否存在,3,. var str=",1,2,3,33,22,123,"; 一般有几种方式: 1 ...
- Destoon 模板存放规则 及 语法参考
模板存放规则及语法参考 一.模板存放及调用规则 模板存放于系统 template 目录,template 目录下的一个目录例如 template/default/ 即为一套模板 模板文件以 .htm ...
- (转)设置Sysctl.conf用以提高Linux的性能(最完整的sysctl.conf优化方案)
Sysctl是一个允许您改变正在运行中的Linux系统的接口.它包含一些 TCP/IP 堆栈和虚拟内存系统的高级选项, 这可以让有经验的管理员提高引人注目的系统性能.用sysctl可以读取设置超过五百 ...
- 微信小程序 功能函数 地图定位相对直线距离
GetDistance:function(lat1, lng1, lat2, lng2){ // console.log(lat1) var radLat1 = lat1 * Math.PI / ...
- java 两个数组合并
需求:两个字符串合并(如果想去重复,参考下一篇--数组去重复及记录重复个数) //方法一 Arrays类 String[] a = {"A","B"," ...
- flink ha zk集群迁移实践
flink为了保证线上作业的可用性,提供了ha机制,如果发现线上作业失败,则通过ha中存储的信息来实现作业的重新拉起. 我们在flink的线上环境使用了zk为flink的ha提供服务,但在初期,由于资 ...
- linux下sublime text 3安装到配置
1. Sublime Text 3的下载安装 到官方网站上http://www.sublimetext.com/3下载64位(系统位64位)的.deb安装包(http://c758482.r82.cf ...
- mysql安装使用详细教程
1.数据库存储数据的方式与Excel类似. 一.数据库介绍 1.什么是数据库? 数据库(Database)是按照数据结构来组织.存储和管理数据的仓库, 每个数据库都有一个或多个不同的API用于创建,访 ...
- JDBC连接Oracle
数据库的操作是当前系统开发必不可少的开发部分之一,尤其是在现在的大数据时代,数据库尤为重要.但是你真的懂得Java与数据库是怎么连接的么? 先给大家一个数据库连接的简单实例: package com. ...