在机器学习中,导致overfitting的原因之一是noise,这个noise可以分为两种,即stochastic noise,随机噪声来自数据产生过程,比如测量误差等,和deterministic noise,确定性噪声来自added complexity,即model too complex。这两种类型的造成来源不同,但是对于学习的影响是相似的,large noise总会导致overfitting。


This is a very subtle question!

The most important thing to realize is that in learning, H is fixed and D is given, and so can be assumed fixed. Now we can ask, what is going on in this learning scenario. Here is what we can say:

i) If there is stochastic noise with ‘magnitude’ σ2, then you are in trouble.

ii) If there deterministic noise then you are in trouble.

The stochastic noise can be viewed as one part of the data generation process (eg. measurement errors). The deterministic noise can similarly be viewed as another part of the data generation process, namely f. The deterministic and stochastic noise are fixed. In your analogy, you can increase the stochastic noise by increasing the noise variance and you get into deeper trouble. Similarly, you can increase the deterministic noise by making f more complex and you will get into deeper trouble.

I just need to tell you what ‘trouble’ means. Well, we actually use another word instead of ‘trouble’ - overfitting.

This means you may be likely to make an inferior choice over the superior choice because the inferior choice has lower in-sample error. Doing stuff that looks good in-sample that leads to disasters out-of-sample is the essence of overfitting. An example of this is trying to choose the regularization parameter. If you pick a lower regularization parameter, then you have lower in-sample error, but it leads to higher out-of-sample error - you picked the λ with lowerEinbut it gave higher Eout. We call that overfitting. Underfitting is just the name we give to the opposite process in the context of picking the regularization parameter. Once the regularization parameter gets too high, as you pick a higher λ you get both higher Einand higher Eout. It also turns out that this means you over regularized and obtained an over-simplistic g - i.e. you ‘underfitted’, you didn’t fit the data enough. The underfitting and overfitting are just terms. The substance of what is going on under the hood is how the deterministic and stochastic noise are affecting what you should and should not do in-sample.

Now let’s get back to the subtle part of your question. There is actually another way to decrease the deterministic noise - increase the complexity of H (the other way is to decrease the complexity of f which we discussed above). Now is where the difference with stochastic noise pops up. With stochastic noise, it either goes up or down; if down, then things get better. With deterministic noise, if you just tell me that it went down, I need to ask you how. Did your target function get simpler - if yes, then great, it is just as if the stochastic noise went down. If it is that your H got more complicated, then things get interesting. To understand what is going on, the Bias Variance decomposition helps (bottom of page 125 in the textbook).

Eout=σ2+bias+var

σ2is the direct impact of the stochastic noise. bias is the direct impact of the deterministic noise. The var term is interesting and is the indirect impact of the noise, through H. The var term is mostly controlled by the size of H in relation to the number of data points. So getting back to the point, if you make H more complex, you will decrease the det. noise (bias) but you will increase the var (its indirect impact). Usually the latter dominates (overfitting, not because of the direct impact of the noise, but because of its indirect impact) … unless you are in the underfitting regime when the former dominates.

上面一段主要摘自《learning from data》一书,主要说明的内容是overfitting的含义以及noise对于overfitting的效用。

下面是对overfitting的很好的总结:

VC维大=>模型复杂度高=>error in sample 小=>模型不够平滑=>generalization能力弱=>error out of sample大=>overfitting=>模型并没有卵用。

总的来说,deterministic noise是由于你选择的H中的最好的hypothesis h∗对于不在H中的function f进行估计时的差。在给定x后,这个deterministic noise就确定了。

deterministic function可用来生成伪随机数(pseudo-random generator)。

详细的论述可以参看《learning from data》


2015-8-27

艺少

stochastic noise and deterministic noise的更多相关文章

  1. Perlin Noise 及其应用

    Perlin Noise 可以用来表现自然界中无法用简单形状来表达的物体的形态,比如火焰.烟雾.表面纹路等.要生成 Perlin Noise 可以使用工具离线生成,也可以使用代码运行时生成.最简单常用 ...

  2. python perlin noise

    python 利用 noise 生成纹理. # -*- coding: utf-8 -*- """ Created on Mon Apr 23 20:04:41 2018 ...

  3. GraphicsLab Project 之 Curl Noise

    作者:i_dovelemon 日期:2020-04-25 主题:Perlin Noise, Curl Noise, Finite Difference Method 引言 最近在研究流体效果相关的模拟 ...

  4. 台大《机器学习基石》课程感受和总结---Part 1(转)

    期末终于过去了,看看别人的总结:http://blog.sina.com.cn/s/blog_641289eb0101dynu.html 接触机器学习也有几年了,不过仍然只是个菜鸟,当初接触的时候英文 ...

  5. 过度拟合(overfitting)

    我们之前解决过一个理论问题:机器学习能不能起作用?现在来解决另一个理论问题:过度拟合. 正如之前我们看到的,很多时候我们必须进行nonlinear transform.但是我们又无法确定Q的值.Q过小 ...

  6. 【Regularization】林轩田机器学习基石

    正则化的提出,是因为要解决overfitting的问题. 以Linear Regression为例:低次多项式拟合的效果可能会好于高次多项式拟合的效果. 这里回顾上上节nonlinear transf ...

  7. 【Hazard of Overfitting】林轩田机器学习基石

    首先明确了什么是Overfitting 随后,用开车的例子给出了Overfitting的出现原因 出现原因有三个: (1)dvc太高,模型过于复杂(开车开太快) (2)data中噪声太大(路面太颠簸) ...

  8. 过拟合产生的原因(Root of Overfitting)

    之前在<过拟合和欠拟合(Over fitting & Under fitting)>一文中简要地介绍了过拟合现象,现在来详细地分析一下过拟合产生的原因以及相应的解决办法. 过拟合产 ...

  9. {ICIP2014}{收录论文列表}

    This article come from HEREARS-L1: Learning Tuesday 10:30–12:30; Oral Session; Room: Leonard de Vinc ...

随机推荐

  1. piplinedb 团队加入confluen

    这个消息对于使用pipelinedb 的人来说,可能有点不好,因为官方已经明确说明了,pipelinedb 截止到1.0 版本,将不再维护了, 基本就要靠社区了,但是pipelinedb 团队还是比较 ...

  2. 游戏 DP

    游戏 DP [题意描述] 小喵喵喜欢玩 RPG 游戏.在这款游戏中,玩家有两个属性,攻击和防御,现在小喵喵的攻击和防御都是 1,接下来小喵喵会依次遇到 n 个事件.事件有两种. 1.小喵喵经过修炼,角 ...

  3. 34、spark1.5.1

    一.Spark 1.4.x的新特性 1.Spark Core 1.1 提供REST API供外界开发者获取Spark内部的各种信息(jobs / stages / tasks / storage in ...

  4. 洛谷P1731[NOI1999]生日蛋糕

    题目 搜索+剪枝,主要考察细节和搜索的顺序,首先可以发现所有数据均为整数,所以初始化每层的蛋糕R和H是整数,然后从高层向低层搜索,然后预处理出各层向低层的最小面积和体积用来剪枝. 就可以每层从当前最大 ...

  5. We found potential security vulnerabilities in your dependencies. Only the owner of this reposito...

    删除package-lock.json并同步到git 定义的依赖项./package-lock.json具有已知的安全漏洞 找到一个叫做.gitignore,把package-lock.json贴在这 ...

  6. esp8266 + dht11 + 两路继电器 实现pc远程控制开关机温度监控.并配置zabbix监控

    事因:翻了翻自己之前的硬件小箱子,几年前买的一些小东西,想用用起来. 正好我有些数据放在机器上,有时候需要机器启动,我使用完成后在断开. 其实网络唤醒也能做到,但是机器一直给电也不好,在说家里有小孩A ...

  7. linux设置定时任务的方法步骤

    一,首先登录 二,找到文件夹 三,查看定时任务 crontab -l 四,vi root 编辑定时任务 编辑完成后,点ESC,然后:wq 时间格式 分钟 小时 日期 月份 周 命令 数字范围 0-59 ...

  8. IDEA中新建子模块

    在IDEA中新建子模块简单步骤: 找到父模块 ->new Module ,然后: next之后,输入ArtifactId: next之后,再输入子模块名,其中,要注意,在contentRoot和 ...

  9. T-MAX组--项目冲刺(第二天)

    THE SECOND DAY 项目相关 作业相关 具体描述 所属班级 2019秋福大软件工程实践Z班 作业要求 团队作业第五次-项目冲刺 作业正文 T-MAX组--项目冲刺(第二天) 团队名称 T-M ...

  10. Laravel--文件管理及上传自定义目录及文件名

    laravel 上传 php 需要开启 fileinfo 扩展 先看一个例子: $file = $request->file('shopimg'); $path = $file->stor ...