学习笔记(六): Regularization for Simplicity
目录
Overcrossing?
complete this exercise that explores overuse of feature crosses.
Task 1: Run the model as is, with all of the given cross-product features. Are there any surprises in the way the model fits the data? What is the issue?
overcrossing
▾answer to Task 1.
Surprisingly, the model's decision boundary looks kind of crazy. In particular, there's a region in the upper left that's hinting towards blue, even though there's no visible support for that in the data.
Notice the relative thickness of the five lines running from INPUT to OUTPUT. These lines show the relative weights of the five features. The lines emanating发射 from X1 and X2 are much thicker密集(权重更大) than those coming from the feature crosses. So, the feature crosses are contributing far less to the model than the normal (uncrossed) features.
Task 2: Try removing various cross-product features to improve performance (albeit尽管 only slightly). Why would removing features improve performance?
Removing feature crosses
Removing all the feature crosses
▾answer to Task 2.
Removing all the feature crosses gives a saner合理 model (there is no longer a curved boundary suggestive of overfitting) and makes the test loss converge.
After 1,000 iterations, test loss should be a slightly lower value than when the feature crosses were in play (although your results may vary a bit, depending on the data set).
The data in this exercise is basically linear data plus noise. If we use a model that is too complicated, such as one with too many crosses, we give it the opportunity to fit to the noise in the training data, often at the cost of making the model perform badly on test data.
L₂ Regularization
Consider the following generalization curve, which shows the loss for both the training set and validation set against the number of training iterations.
Figure 1. Loss on training set and validation set.
Figure 1 shows a model in which training loss gradually decreases, but validation loss eventually goes up. In other words, this generalization curve shows that the model is overfitting to the data in the training set. Channeling our inner Ockham奥卡姆, perhaps we could prevent overfitting by penalizing complex models, a principle called regularization.
In other words, instead of simply aiming to minimize loss (empirical risk minimization):
we'll now minimize loss+complexity, which is called structural risk minimization:
Our training optimization algorithm is now a function of two terms: the loss term, which measures how well the model fits the data, and the regularization term, which measures model complexity.
Machine Learning Crash Course focuses on two common (and somewhat related) ways to think of model complexity:
- Model complexity as a function of the weights of all the features in the model.
- Model complexity as a function of the total number of features with nonzero weights. (A later module covers this approach.)
If model complexity is a function of weights, a feature weight with a high absolute value is more complex than a feature weight with a low absolute value.
We can quantify complexity using the L2 regularization formula, which defines the regularization term as the sum of the squares of all the feature weights:
In this formula, weights close to zero have little effect on model complexity, while outlier weights can have a huge impact.
But w3 (bolded above), with a squared value of 25, contributes nearly all the complexity. The sum of the squares of all five other weights adds just 1.915 to the L2 regularization term.
Lambda
Model developers tune the overall impact of the regularization term by multiplying its value by a scalar标量 known as lambda (also called the regularization rate). That is, model developers aim to do the following:
Performing L2 regularization has the following effect on a model
- Encourages weight values toward 0 (but not exactly 0)
- Encourages the mean of the weights toward 0, with a normal (bell-shaped or Gaussian) distribution.
Increasing the lambda value strengthens the regularization effect. For example, the histogram of weights for a high value of lambda might look as shown in Figure 2.
Figure 2. Histogram of weights.
Lowering the value of lambda tends to yield a flatter histogram, as shown in Figure 3.
Figure 3. Histogram of weights produced by a lower lambda value.
When choosing a lambda value, the goal is to strike the right balance between simplicity and training-data fit:
If your lambda value is too high, your model will be simple, but you run the risk of underfitting your data. Your model won't learn enough about the training data to make useful predictions.
If your lambda value is too low, your model will be more complex, and you run the risk of overfitting your data. Your model will learn too much about the particularities of the training data, and won't be able to generalize to new data.
Note: Setting lambda to zero removes regularization completely. In this case, training focuses exclusively on minimizing loss, which poses the highest possible overfitting risk.
The ideal value of lambda produces a model that generalizes well to new, previously unseen data. Unfortunately, that ideal value of lambda is data-dependent, so you'll need to do some tuning.
▾learn about L2 regularization and learning rate.
There's a close connection between learning rate and lambda. Strong L2 regularization values tend to drive feature weights closer to 0. Lower learning rates (with early stopping) often produce the same effect because the steps away from 0 aren't as large. Consequently, tweaking learning rate and lambda simultaneously may have confounding effects.
Early stopping means ending training before the model fully reaches convergence. In practice, we often end up with some amount of implicit early stopping when training in an online(continuous) fashion. That is, some new trends just haven't had enough data yet to converge.
As noted, the effects from changes to regularization parameters can be confounded with the effects from changes in learning rate or number of iterations. One useful practice (when training across a fixed batch of data) is to give yourself a high enough number of iterations that early stopping doesn't play into things.
Examining L2 regularization
Increasing the regularization rate from 0 to 0.3 produces the following effects:
1.Test loss drops significantly.
Note: While test loss decreases, training loss actually increases. This is expected, because you've added another term to the loss function to penalize complexity. Ultimately, all that matters is test loss, as that's the true measure of the model's ability to make good predictions on new data.
2.The delta between Test loss and Training loss drops significantly.
3.The weights of the features and some of the feature crosses have lower absolute values, which implies that model complexity drops.
Check Understanding
1.L2 Regularization
Imagine a linear model with 100 input features:
Which of the following statements are true?
10 are highly informative.
90 are non-informative.
Assume that all features have values between -1 and 1.
(T)L2 regularization will encourage many of the non-informative weights to be nearly (but not exactly) 0.0.
(T)L2 regularization may cause the model to learn a moderate weight for some non-informative features.(Surprisingly, this can happen when a non-informative feature happens to be correlated with the label. In this case, the model incorrectly gives such non-informative features some of the "credit" that should have gone to informative features.)
(F)L2 regularization will encourage most of the non-informative weights to be exactly 0.0.(L2 regularization penalizes larger weights more than smaller weights. As a weight gets close to 0.0, L2 "pushes" less forcefully toward 0.0. )
2.L2 Regularization and Correlated Features
Imagine a linear model with two strongly correlated features; these two features are nearly identical copies of one another but one feature contains a small amount of random noise. If we train this model with L2 regularization, what will happen to the weights for these two features?
(F)One feature will have a large weight; the other will have a weight of almost 0.0.
(T)Both features will have roughly equal, moderate weights.(L2 regularization will force the features towards roughly equivalent weights that are approximately half of what they would have been had only one of the two features been in the model.)
(F)One feature will have a large weight; the other will have a weight of exactly 0.0.
Glossay
1.structural risk minimization (SRM):
An algorithm that balances two goals:
- The desire to build the most predictive model (for example, lowest loss).
- The desire to keep the model as simple as possible (for example, strong regularization).
For example, a function that minimizes loss+regularization on the training set is a structural risk minimization algorithm.
For more information, see http://www.svms.org/srm/.
Contrast with empirical risk minimization.
2.regularization:
The penalty on a model's complexity. Regularization helps prevent overfitting. Different kinds of regularization include:
- L1 regularization
- L2 regularization
- dropout regularization
- early stopping (this is not a formal regularization method, but can effectively limit overfitting)
3.L1 regularization:
A type of regularization that penalizes weights in proportion to the sum of the absolute values of the weights. In models relying on sparse features, L1 regularization helps drive the weights of irrelevant or barely relevant features to exactly 0, which removes those features from the model. Contrast with L2 regularization.
4.L2 regularization:
A type of regularization that penalizes weights in proportion to the sum of the squares of the weights. L2 regularization helps drive outlier weights (those with high positive or low negative values) closer to 0 but not quite to 0. (Contrast with L1 regularization.) L2 regularization always improves generalization in linear models.
5.lambda:
Synonym for regularization rate.
(This is an overloaded term. Here we're focusing on the term's definition within regularization.)
6.regularization rate:
A scalar value, represented as lambda, specifying the relative importance of the regularization function. The following simplified loss equation shows the regularization rate's influence:
minimize(loss function + λ(regularization function))
Raising the regularization rate reduces overfitting but may make the model less accurate.
7.early stopping:
A method for regularization that involves ending model training before training loss finishes decreasing. In early stopping, you end model training when the loss on a validation data set starts to increase, that is, when generalization performance worsens.
学习笔记(六): Regularization for Simplicity的更多相关文章
- java之jvm学习笔记六-十二(实践写自己的安全管理器)(jar包的代码认证和签名) (实践对jar包的代码签名) (策略文件)(策略和保护域) (访问控制器) (访问控制器的栈校验机制) (jvm基本结构)
java之jvm学习笔记六(实践写自己的安全管理器) 安全管理器SecurityManager里设计的内容实在是非常的庞大,它的核心方法就是checkPerssiom这个方法里又调用 AccessCo ...
- Learning ROS for Robotics Programming Second Edition学习笔记(六) indigo xtion pro live
中文译著已经出版,详情请参考:http://blog.csdn.net/ZhangRelay/article/category/6506865 Learning ROS for Robotics Pr ...
- Typescript 学习笔记六:接口
中文网:https://www.tslang.cn/ 官网:http://www.typescriptlang.org/ 目录: Typescript 学习笔记一:介绍.安装.编译 Typescrip ...
- python3.4学习笔记(六) 常用快捷键使用技巧,持续更新
python3.4学习笔记(六) 常用快捷键使用技巧,持续更新 安装IDLE后鼠标右键点击*.py 文件,可以看到Edit with IDLE 选择这个可以直接打开编辑器.IDLE默认不能显示行号,使 ...
- Go语言学习笔记六: 循环语句
Go语言学习笔记六: 循环语句 今天学了一个格式化代码的命令:gofmt -w chapter6.go for循环 for循环有3种形式: for init; condition; increment ...
- 【opencv学习笔记六】图像的ROI区域选择与复制
图像的数据量还是比较大的,对整张图片进行处理会影响我们的处理效率,因此常常只对图像中我们需要的部分进行处理,也就是感兴趣区域ROI.今天我们来看一下如何设置图像的感兴趣区域ROI.以及对ROI区域图像 ...
- Linux学习笔记(六) 进程管理
1.进程基础 当输入一个命令时,shell 会同时启动一个进程,这种任务与进程分离的方式是 Linux 系统上重要的概念 每个执行的任务都称为进程,在每个进程启动时,系统都会给它指定一个唯一的 ID, ...
- # go微服务框架kratos学习笔记六(kratos 服务发现 discovery)
目录 go微服务框架kratos学习笔记六(kratos 服务发现 discovery) http api register 服务注册 fetch 获取实例 fetchs 批量获取实例 polls 批 ...
- Spring Boot 学习笔记(六) 整合 RESTful 参数传递
Spring Boot 学习笔记 源码地址 Spring Boot 学习笔记(一) hello world Spring Boot 学习笔记(二) 整合 log4j2 Spring Boot 学习笔记 ...
- Redis学习笔记六:持久化实验(AOF,RDB)
作者:Grey 原文地址:Redis学习笔记六:持久化实验(AOF,RDB) Redis几种持久化方案介绍和对比 AOF方式:https://blog.csdn.net/ctwctw/article/ ...
随机推荐
- Batch the files in the directory
#!/bin/bash #sourceFolder = /home/bigdatagfts/pl62716/refdata #targetFolder = /home/bigdatagfts/pl62 ...
- LeetCode 148 Sort List 链表上的归并排序和快速排序
Sort a linked list in O(n log n) time using constant space complexity. 单链表排序----快排 & 归并排序 (1)归并排 ...
- poj3233(矩阵快速幂的和)
题目链接:http://poj.org/problem?id=3233 Matrix Power Series Time Limit: 3000MS Memory Limit: 131072K T ...
- JavaScript事件模型及事件代理
事件模型 JavaScript事件使得网页具备互动和交互性,我们应该对其深入了解以便开发工作,在各式各样的浏览器中,JavaScript事件模型主要分为3种:原始事件模型.DOM2事件模型.IE事件模 ...
- kettle5.4ODBC和OCI连接配置
1.kettle 5.4 使用JDBC连接的时候报错(测试不同的数据库,发现只是连接11gRAC 的时候会报JDBC的错误) 具体报错如下 java.sql.SQLException: 建数据库连接出 ...
- 从Flux到Redux详解单项数据流
从Flux到Redux是状态管理工具的演变过程,但两者还是有细微的区别的.但是最核心的都还是观察者模式的应用. 一.Flux 1. Flux的处理逻辑 通俗来讲,应用的状态被放到了store中,组件是 ...
- 【转】《Unity Shader入门精要》冯乐乐著 书中彩图
为方便个人手机学习时候查阅,从网上转来这些彩图. 如属过当行为,联系本人删除. 勘错表 http://candycat1992.github.io/unity_shaders_book/unity_s ...
- Notepad++代码函数快速提示设置
设置——首选项——自动完成
- 开源的SSH框架优缺点分析
开源是3个框架共有的优点 Struts2框架(MVC框架)的优点如下: 1) 实现了MVC模式,层次结构清晰,使程序员只需关注业务逻辑的实现: 2) 丰富的标签库,大大提高了开发的效率: 3) Str ...
- asp.net 在IIS上配置出现的一些问题
1.可能会遇到一下图的错无.请求的内容似乎是脚本.因而将无法由静态文件处理程序来处理---大概的原因是应用程序池选择错误了.如第二幅图如此解决即可 解决方案如下两个图所示. 我遇到了以上的问题之后能也 ...