Everything You Wanted to Know About Machine Learning

翻译了理解机器学习的10个重要的观点,增加了自己的理解。这些原则在大部分情况下或许是这样,可是详细问题详细分析才是王道,不加思索的应用仅仅能是一知半解。

所以张小龙才说‘我说的都是错的’。

note by 王犇


1. How Does Machine Learning Work?
一般来说机器学习算法做这三件事情来建立模型:
  1. A set of possible models to look thorough
  2. A way to test whether a model is good
  3. A clever way to find a really good model with only a few test
a) 确定可行的候选模型集合(搜索空间,这个空间通常会非常大)
b) 确定模型是否可行的方法(效果评价方法)
c) 找到一个有效的方法。利用尽量少的试探得到一个较好的模型(优化算法,model selection)

2. Overfitting Has Many Faces
--小心Overfitting
The general moral of this section of the paper
is to always measure the performance of your classifier on out-of-sample data.
一定要构建測试集来測试你的模型

You cannot do too many of these training and
testing splits.  You should even make some predictions on data you imagine yourself, to see what the model does in certain situations. 
有时须要自己构建測试数据

3. Intuition Fails in High Dimensions
--在高维数据下不要依赖直觉
What this means in practice is that as
you add more and more input fields, you must also add more and more training data to “fill up” the space created
by the additional inputs if you want to use them accurately.
假设你希望通过加入很多其它的特征来提升模型的效果,你必须也同一时候添加很多其它的数据。

否则非常可能噪音会让你的模型效果更差。


4. Theoretical Guarantees Are Not What
They Seem
--理论的误差上界和实践有差异
The only certain way (that we know of now) to
know if an algorithm will model your data well is to try it out.

在尝试之前不要给不论什么模型下定论

5. Feature Engineering is the Key
--特征project尤为重要
The problem here is that no
single input field, or even any single pair of fields, is closely correlated with the objective. 

假设你的输入特征和目标很不相关(比方非线性相关。或者须要复杂变换才干有相关性的)。则你的结果很可能很不好。


It’s when you use your knowledge about the data
to create fields that make machine learning algorithms work better.

这里说的特征project,就是利用你对数据的认识来构建特征,让机器学习算法工作的更好!

 In my career, I would say an average of 70%
of the project’s time goes into feature engineering, 20% goes towards figuring out what comprises a proper and
comprehensive evaluation of the algorithm, and only 10% goes into algorithm selection and tuning.
大部分工作,70%时间在特征project、20%的时间在怎样有效和可理解的评估效果、仅仅有10%的时间在进行算法的选择和调优

原文具了一个原始输入是经纬度的,须要转换为两个城市间距离的样例。

两个经纬度和两者间的距离是须要相当复杂的转换工作。

转换后可以和用户是否愿意在同一天在两个城市间开车具有很强的关联性。


6. More Data Beats A Cleverer Algorithm
--再一次强调了数据的重要性
there’s increasingly
good evidence
 that, in a lot of problems, very simple machine learning techniques can be levered into incredibly
powerful classifiers with the addition of loads of data.

越来越多的证据证明,一些简单的机器学习技术通过添加很多其它的数据能够生成很强大的模型

A big reason for this is because, once you’ve defined your input fields, there’s only so much analytic gymnastics you can do. Computer algorithms trying to learn models have only a relatively few tricks they can do efficiently, and many of them are not so very
different. Thus, as we have
said before
, performance differences between algorithms are typically not large. Thus, if you want better classifiers, you should spend your time:

  1. Engineering better features
  2. Getting your hands on more high-quality data
原因是。一般来说你定义好了你的特征,也就限定了你可以在当中探索的空间(事实上就是说。数据限定了终于效果的天花板,这里面的信息量是有限的,模型和算法是在这个空间下寻找一个更好的解)。

而且事实上非常多模型的原理也都有相似之处。(想想n多的Learning
2 Rank算法)所以假设你希望达到更好的分类器。你能够优先这么做:

1. 更好的特征project
2. 获取很多其它质量更好的数据

7. Learn Many Models, Not Just One
--ensemble的力量!

 One can often make
a more powerful model by learning multiple classifiers over different random subsets of the data.

在多个随机採样的子集中学习多个分类器来达成一个更加强大的模型。(ensemble的力量已经被无数次的证明。近期流行的gbdt。rf都是这个原因。zhangtong给出理论的解释是减少了泛化时的方差)

8. Simplicity Does Not Imply Accuracy
--奥坎姆剃刀原理不总是正确
So too in machine learning. If we have two models
that fit the data equally well, many machine learning algorithms have a way of mathematically preferring the simpler of the two. The folk wisdom here is that a simpler model will perform better on out-of-sample testing data, because it has less parameters
to fit, and thus is less likely to be overfit
一般来看。假设有两个模型对数据的描写叙述能力相同好,那么会倾向于简单的模型(想想正则化)。一般简单的模型在測试集的表现会更好,会更加不easy发生overfitting

One should not take this rule too far. There are many places in machine learning where additional complexity can benefit performance. On top of that, it is not quite accurate to say that model complexity leads to overfitting. More accurate is that the procedure
used to fit all that complexity leads to overfitting if it is not very clever. But there are plenty of cases where the complexity is brought to heel by cleverness in the model fitting process.

Thus, prefer simple models because they are smaller, faster to fit, and more interpretable, but not necessarily because they will lead to better performance; the only way to know that is to evaluate your model on
test data.

也不能过于轻信这个原则。也有非常多地方格外的复杂度会带来额外的收益。

太复杂的模型带来overfitting,这样的说法并不准确。有时额外的复杂度是模型训练中有意而且聪明的选择(复杂的structure也许更好契合了问题,效果和简单模型一样。也许仅仅是数据还不够)。

因此,倾向于简单模型由于他们更小。更好训练。更easy解释,但并不一定由于他们会带来更好的效果。

仅仅有实际測试可以告诉你答案。

8. Representable Does Not Imply Learnable

--可表示不代表可学习

The creators of many machine learning algorithms
are fond of saying that the function representing an accurate prediction on your datais representable by
the learning algorithm. This means that it is possiblefor
the algorithm to build a good model on your data.
通常说。某个算法有可能对你的数据建立一个好的模型就是可表示。(多层神经网络能够表示不论什么函数??)

Unfortunately, this possibility is rarely comforting
by itself. Building a good model may require much more data than you have, or the good model might simply never be found by the algorithm. Just because there’s a good model out there that the algorithm could find
does not mean that it willfind
it.

不幸的是。有可能利用这个算法建立好的模型须要的数据超过了你现有的数据;或者只由于它“能”找到一个好的模型不意味着他“会”找到(或许计算时间太长。搜索空间太大等)

This is another great argument for feature engineering:
If the algorithm can’t find a good model, but you are pretty sure that a good model exists, try engineering features that will make that model a little more obvious to the algorithm.

又回到特征project。假设算法无法找到一个好的模型。但你肯定模型是存在的,能够试试更好的特征表示,让数据更好的被算法所理解

9. Correlation Does Not Imply Causation
--相关性和因果性无关、大数据的三大定理?
The point of this common saying is that modeling
observational data can only show us that two variables are related, but it cannot
tell us the “why”. 

对观測数据建模。只能够告诉我们两个变量有关联,可是不能告诉我们为什么。
比如:有调查显示家里书籍很多其它的孩子,学习成绩更好。

可是书不是成绩好的原因,你不能给那些孩子送书就提升他们的成绩。真正的原因可能是。书籍多的家庭父母的教育程度高,对还自己的教育也相对较好。书不过一个indicators


You should take similar care when interpreting
your models. Just because one thing predicts another doesn’t mean it causes another, and making business (or
public policy) decisions based on some imagined causal relationship should be done with extreme caution.

解释你的模型的时候就要小心,不要错误的把关联性作为因果性放入商业决策中。有可能会犯大错。(统计中的常见问题)

10. The
Big Picture
Machine learning is an awfully powerful tool,
and like any powerful tool,misuses of it can cause a lot of damage.
Understanding how machine learning works and some of the potential pitfalls can go a long way towards keeping you out of trouble.

机器学习非常强大,可是用错误代价也非常高。好的工具在好的project师手里才会发挥作用。

Everything You Wanted to Know About Machine Learning的更多相关文章

  1. 【Machine Learning】KNN算法虹膜图片识别

    K-近邻算法虹膜图片识别实战 作者:白宁超 2017年1月3日18:26:33 摘要:随着机器学习和深度学习的热潮,各种图书层出不穷.然而多数是基础理论知识介绍,缺乏实现的深入理解.本系列文章是作者结 ...

  2. 【Machine Learning】Python开发工具:Anaconda+Sublime

    Python开发工具:Anaconda+Sublime 作者:白宁超 2016年12月23日21:24:51 摘要:随着机器学习和深度学习的热潮,各种图书层出不穷.然而多数是基础理论知识介绍,缺乏实现 ...

  3. 【Machine Learning】机器学习及其基础概念简介

    机器学习及其基础概念简介 作者:白宁超 2016年12月23日21:24:51 摘要:随着机器学习和深度学习的热潮,各种图书层出不穷.然而多数是基础理论知识介绍,缺乏实现的深入理解.本系列文章是作者结 ...

  4. 【Machine Learning】决策树案例:基于python的商品购买能力预测系统

    决策树在商品购买能力预测案例中的算法实现 作者:白宁超 2016年12月24日22:05:42 摘要:随着机器学习和深度学习的热潮,各种图书层出不穷.然而多数是基础理论知识介绍,缺乏实现的深入理解.本 ...

  5. 【机器学习Machine Learning】资料大全

    昨天总结了深度学习的资料,今天把机器学习的资料也总结一下(友情提示:有些网站需要"科学上网"^_^) 推荐几本好书: 1.Pattern Recognition and Machi ...

  6. [Machine Learning] Active Learning

    1. 写在前面 在机器学习(Machine learning)领域,监督学习(Supervised learning).非监督学习(Unsupervised learning)以及半监督学习(Semi ...

  7. [Machine Learning & Algorithm]CAML机器学习系列2:深入浅出ML之Entropy-Based家族

    声明:本博客整理自博友@zhouyong计算广告与机器学习-技术共享平台,尊重原创,欢迎感兴趣的博友查看原文. 写在前面 记得在<Pattern Recognition And Machine ...

  8. machine learning基础与实践系列

    由于研究工作的需要,最近在看机器学习的一些基本的算法.选用的书是周志华的西瓜书--(<机器学习>周志华著)和<机器学习实战>,视频的话在看Coursera上Andrew Ng的 ...

  9. matlab基础教程——根据Andrew Ng的machine learning整理

    matlab基础教程--根据Andrew Ng的machine learning整理 基本运算 算数运算 逻辑运算 格式化输出 小数位全局修改 向量和矩阵运算 矩阵操作 申明一个矩阵或向量 快速建立一 ...

  10. Machine Learning

    Recently, I am studying Maching Learning which is our course. My English is not good but this course ...

随机推荐

  1. ssh登录的时候,根本不给输入密码的机会,直接拒绝,是因为BatchMode的设置

    BatchMode no“BatchMode”如果设为“yes”,passphrase/password(交互式输入口令)的提示将被禁止.当不能交互式输入口令的时候,这个选项对脚本文件和批处理任务十分 ...

  2. 设置MAVEN_OPTS环境变量

    运行mvn命令实际上是执行了Java命令,既然是运行Java,那么运行Java命令可用的参数当然也应该在运行mvn命令时可用.这个时候,MAVEN_OPTS环境变量就能派上用场. 通常需要设置MAVE ...

  3. 环保创业的可行之道——Leo鉴书上66

    近2年,我一直在关注不同企业的发展历程,国内的国外的.看他们成功其中的共性与特性.<蚯蚓创业记>无疑给我开了扇窗--环保企业的怎样发展与壮大.读者还能从书里读出普通年轻人坚持自己梦想最终得 ...

  4. C语言,realloc

    void * realloc ( void * ptr, size_t new_size ); 关于realloc的行为方式,结合源码总结为:1. realloc失败的时候,返回NULL: 2. re ...

  5. Silverlight技术调查(1)——Html向Silverlight传参

    原文 Silverlight技术调查(1)——Html向Silverlight传参 近几日项目研究一个很牛的富文档编辑器DXperience RichEdit组件,调查环境为Silverlight4. ...

  6. jQuery ajax表单提交实现局部刷新

    jQuery Ajax 异步提交 Form 表单,如果使用 get 请求,注意中文乱码问题,jquery 会先使用 iso8859-1 解码,然后发给服务器,如果使用 post 请求,则直接将中文内容 ...

  7. js 常用方法记事本

    1.获取被选中行的名称<tab选项卡中为iframe> /* S 获取首页被选中的选项卡名称 */ var currTab = $("#layout_center_tabs&qu ...

  8. linux-mint下搭建android,angularjs,rails,html5开发环境 - qijie29896的个人空间 - 开源中国社区

    linux-mint下搭建android,angularjs,rails,html5开发环境 - qijie29896的个人空间 - 开源中国社区 http://blog.csdn.net/orzor ...

  9. ufldl学习笔记和编程作业:Feature Extraction Using Convolution,Pooling(卷积和汇集特征提取)

    ufldl学习笔记与编程作业:Feature Extraction Using Convolution,Pooling(卷积和池化抽取特征) ufldl出了新教程,感觉比之前的好,从基础讲起.系统清晰 ...

  10. MapReduce整体架构分析

    继前段时间分析Redis源代码一段时间之后.我即将開始接下来的一段技术学习的征程.研究的技术就是当前很火热的Hadoop,可是一个Hadoop生态圈是很庞大的.所以首先我的打算是挑选当中的一部分模块, ...