1. numpy中的几种矩阵相乘:


# x1: axn, x2:nxb np.dot(x1, x2): axn * nxb np.outer(x1, x2): nx1*1xn # 实质为: np.ravel(x1)*np.ravel(x2) np.multiply(x1, x2): [[x1[0][0]*x2[0][0], x1[0][1]*x2[0][1], ...]

2. Bugs' hometown

Many software bugs in deep learning come from having matrix/vector dimensions that don't fit. If you can keep your matrix/vector dimensions straight you will go a long way toward eliminating many bugs.

3. Common steps for pre-processing a new dataset are:

- Figure out the dimensions and shapes of the problem (m_train, m_test, num_px, ...)
- Reshape the datasets such that each example is now a vector of size (num_px \* num_px \* 3, 1)
- "Standardize" the data

4. Unstructured data:

Unstructured data is a generic label for describing data that is not contained in a database or some other type of data structure . Unstructured data can be textual or non-textual. Textual unstructured data is generated in media like email messages, PowerPoint presentations, Word documents, collaboration software and instant messages. Non-textual unstructured data is generated in media like JPEG images, MP3 audio files and Flash video files

5. Chapter of Activation Function:

  • Choice of activation function:

    1. If output is either 0 or 1 -- sigmoid for the output layer and the other units on ReLU.
    2. Except for the output layer, tanh does better than sigmoid.
    3. ReLU ---level up--> leaky ReLU.
  • Why are ReLU and leaky ReLU often superior to sigmoid and tanh?

    -- The derivatives of the former ones is much bigger than 0, so the learning would be much faster.

  • A linear hidden layer is more or less useless, yet the activation function is a exception.

6. Regularization:

Initially, \(J(w, b) = \frac{1}{m} * \sum_{i=1}^{m}{L({\hat{Y}^(i), y^{(i)}}) + \frac{\lambda}{2*m}||w||_2^2}\)

  1. L2 regularization: \(\frac{\lambda}{2*m}\sum_{j=1}^{n_x}||w_j||^2 = \frac{\lambda}{2*m}||w||_2\)

    One aspect that tanh is better thatn sigmoid(in terms of regularization) -- When x is very close to 0, the derivative of tanh(x) is almost linear, while that of the sigmoid(x) is alomst 0.

  2. Dropout:

     Method: Make certain values of weights be zeros randomly, just like -- W= np.multiply(W, C), where C is a 0-1 array.

     Matters need attention: Don't use dropout in test procedure -- Time costly, result randomly.

     Work principle:

       Intuition: Can't rely on any one feature, so have to spread out weights(shrinking weights).

       Besides, you can set different rates of "Dropout", like lower ones on more complex layer, which are called "key prop".

  1. Data augmentation:

     Do some operation on your data images, such as flipping, rotation, zooming, etc, without changing their labels, in order to prevent from over-fitting on some aspects, such as the direction of faces, the size of cats.

  1. Early stopping.

7. Solution to "gradient vanishing or exploding":

     Set WL = np.random.randn(shape) * np.sqrt(\(\frac{2}{n^{[L-1]}}\)) if activation_function == "ReLU"

       else: np.random.randn(shape) * np.sqrt(\(\frac{1}{n^{[L-1]}}\)) or np.sqrt(\(\sqrt{\frac{2}{n^{[L-1]}+n^{[L]}}}\))(Xavier initialization)

8. Gradient Checking:

  1. for i in range(len(\(\theta\))):

    to check if (d\(\theta_{approx}[i] = \frac{J(\theta_1, \theta_2, ..., \theta_i+\epsilon, ...) - J(\theta_1, \theta_2, ..., \theta_i-\epsilon, ...)}{2\epsilon}\)) ?= \(d\theta[i] = \frac{\partial{J}}{\partial{\theta_i}}\)

             <==> \(d\theta_{approx} ?= d\theta\)

             <==> \(\frac{||d\theta_{approx} - d\theta||_2}{||d\theta_{approx}||_2+||d\theta||_2}\) in an accent range: \(10^{-7}\) is great, and \(10^{-3}\) is wrong.

  2. Tips:

    • Only to debug, instead of training.
    • If algorithm fails grad check, look at components(\(db^{[L]}, dw^{[L]}\)) to try to identify bug.
    • Remember regularization.
    • Doesn't work together with dropout.
    • Run at random initialization; perhaps again after some training.

9. Exponentially weighted averages:

  Definition: let \(V_{t} = {\beta}V_{t-1} + (1 - \beta)\theta_t\) (_V_s are the averages, and the _\(\theta\)_s are the initial discrete data).

and \(V_{t} = \frac{V_{t}}{1 - {\beta}^t}\) (To correct initial bias).

  Usage: when it comes to this situation:

Since the average of the distance vertical movement is almost zeros, you can use EWA to average it, prevent it from divergence.

  On iteration t:

    Compute dW on the current mini-batch

    \(v_{dW} = {\beta}v_{dW} + (1 - \beta)dW\)

    \(v_{db} = {\beta}v_{db} + (1 - \beta)db\)

    \(W = W - {\alpha}v_{dW}, b = b - {\alpha}v_{db}\)

    Hyperparameters: \(\alpha\), \({\beta}(=0.9)\)

Deep Learning Specialization 笔记的更多相关文章

  1. Deep Learning论文笔记之(四)CNN卷积神经网络推导和实现(转)

    Deep Learning论文笔记之(四)CNN卷积神经网络推导和实现 zouxy09@qq.com http://blog.csdn.net/zouxy09          自己平时看了一些论文, ...

  2. Deep Learning论文笔记之(八)Deep Learning最新综述

    Deep Learning论文笔记之(八)Deep Learning最新综述 zouxy09@qq.com http://blog.csdn.net/zouxy09 自己平时看了一些论文,但老感觉看完 ...

  3. Deep Learning论文笔记之(六)Multi-Stage多级架构分析

    Deep Learning论文笔记之(六)Multi-Stage多级架构分析 zouxy09@qq.com http://blog.csdn.net/zouxy09          自己平时看了一些 ...

  4. 【deep learning学习笔记】注释yusugomori的DA代码 --- dA.h

    DA就是“Denoising Autoencoders”的缩写.继续给yusugomori做注释,边注释边学习.看了一些DA的材料,基本上都在前面“转载”了.学习中间总有个疑问:DA和RBM到底啥区别 ...

  5. Deep Learning论文笔记之(一)K-means特征学习

    Deep Learning论文笔记之(一)K-means特征学习 zouxy09@qq.com http://blog.csdn.net/zouxy09          自己平时看了一些论文,但老感 ...

  6. Deep Learning论文笔记之(三)单层非监督学习网络分析

    Deep Learning论文笔记之(三)单层非监督学习网络分析 zouxy09@qq.com http://blog.csdn.net/zouxy09          自己平时看了一些论文,但老感 ...

  7. Spectral Norm Regularization for Improving the Generalizability of Deep Learning论文笔记

    Spectral Norm Regularization for Improving the Generalizability of Deep Learning论文笔记 2018年12月03日 00: ...

  8. Deep Learning论文笔记之(四)CNN卷积神经网络推导和实现

    https://blog.csdn.net/zouxy09/article/details/9993371 自己平时看了一些论文,但老感觉看完过后就会慢慢的淡忘,某一天重新拾起来的时候又好像没有看过一 ...

  9. [置顶] Deep Learning 学习笔记

    一.文章来由 好久没写原创博客了,一直处于学习新知识的阶段.来新加坡也有一个星期,搞定签证.入学等杂事之后,今天上午与导师确定了接下来的研究任务,我平时基本也是把博客当作联机版的云笔记~~如果有写的不 ...

随机推荐

  1. 图像Demosaic算法及其matlab实现

    由于成本和面积等因素的限定,CMOS/CCD在成像时,感光面阵列前通常会有CFA(color filter array),如下图所示,CFA过滤不同频段的光,因此,Sensor的输出的RAW数据信号包 ...

  2. CentOS对接GlusterFS

    存储节点部署示例环境,仅供参考 主机名 IP 系统 gfs01 10.10.10.13 CentOS 7.4.1708 gfs02 10.10.10.14 CentOS 7.4.1708 一.Glus ...

  3. redis list 列表 查找 时间复杂度

    http://redisbook.com/preview/intset/content.html 列表对象 列表对象的编码可以是 ziplist 或者 linkedlist . ziplistFind ...

  4. Update Node Using a Package Manager nodesource

    How to Update Node.js to Latest Version (Linux, Ubuntu, OSX, Others) - HostingAdvice.com https://www ...

  5. Quartz 定时任务调度

    一.在Quartz.NET中quartz.properties的配置文件,忽略不修改,考虑下面: var props = new NameValueCollection { { "quart ...

  6. vue开发中的"骚操作"

    前言 在与同事协作开发的过程中,见识到了不少"骚操作".因为之前都没用过,所以我愿称之为"高级技巧"! Vue.extend 在交互过程中,有个需求就是点击图标 ...

  7. springboot开启多线程配置

    一.配置线程池参数 @EnableAsync @Configuration public class TaskExecutorConfig { @Bean public TaskExecutor ta ...

  8. SpringCloud配置刷新机制的简单分析[nacos为例子]

    SpringCloud Nacos 本文主要分为SpringCloud Nacos的设计思路 简单分析一下触发刷新事件后发生的过程以及一些踩坑经验 org.springframework.cloud. ...

  9. 洛谷P2145

    Description 给定一串数字,每个数字代表一种颜色 你可以向这个数字序列里加任意数字,每加一个视为一次操作 当你加入的数字和与它相连的同种数字不少于三个时,他们就会消除 消除后序列的两端自动靠 ...

  10. StreamingContext详解,输入DStream和Reveiver详解

    StreamingContext详解,输入DStream和Reveiver详解 一.StreamingContext详解 1.1两种创建StreamingContext的方式 1.2SteamingC ...