Ho J., Jain A. and Abbeel P. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems (NIPS), 2020.

[Page E. Approximating to the cumulative normal function and its inverse for use on a pocket calculator. Applied Statistics, vol. 26, pp. 75-76, 1977.]

Yerukala R., Boiroju N. K. Approximations to standard normal distribution function. Journal of Scientific and Engineering Research, vol. 6, pp. 515-518, 2015.

diffusion model和变分界的结合.

对抗鲁棒性上已经有多篇论文用DDPM生成的数据用于训练了, 可见其强大.

主要内容

Diffusion models

reverse process

从\(p(x_T) = \mathcal{N}(x_T; 0, I)\)出发:

\[p_{\theta}(x_{0:T}) := p(X_T) \prod_{t=1}^T p_{\theta}(x_{t-1}|x_t), \quad p_{\theta}(x_{t-1}|x_t) := \mathcal{N}(x_{t-1}; \mu_{\theta}(x_{t}, t), \Sigma_{\theta}(x_t, t)),
\]

注意这个过程我们拟合均值\(\mu_{\theta}\)和协方差矩阵\(\Sigma_{\theta}\).

这部分的过程逐步将噪声'恢复'为图片(信号)\(x_0\).

forward process

\[q(x_{1:T}|x_0) := \prod_{t=1}^{T}q(x_t|x_{t-1}), \quad q(x_t|x_{t-1}):= \mathcal{N}(x_t; \sqrt{1 - \beta_t} x_{t-1}, \beta_t I).
\]

其中\(\beta_t\)是可训练的参数或者人为给定的超参数.

这部分为将图片(信号)逐步添加噪声的过程.

变分界

对于参数\(\theta\), 很自然地我们希望通过最小化其负对数似然来优化:

\[\begin{array}{ll}
\mathbb{E}_{p_{data}(x_0)} \bigg[-\log p_{\theta}(x_0) \bigg]
&=\mathbb{E}_{p_{data}(x_0)} \bigg[-\log \int p_{\theta}(x_{0:T}) \mathrm{d}x_{0:T} \bigg] \\
&=\mathbb{E}_{p_{data}(x_0)} \bigg[-\log \int q(x_{1:T}|x_0)\frac{p_{\theta}(x_{0:T})}{q(x_{1:T}|x_0)} \mathrm{d}x_{0:T} \bigg] \\
&=\mathbb{E}_{p_{data}(x_0)} \bigg[-\log \mathbb{E}_{q(x_{1:T}|x_0)} \frac{p_{\theta}(x_{0:T})}{q(x_{1:T}|x_0)} \bigg] \\
&\le -\mathbb{E}_{p_{data}(x_0)}\mathbb{E}_{q(x_{1:T}|x_0)} \bigg[\log \frac{p_{\theta}(x_{0:T})}{q(x_{1:T}|x_0)} \bigg] \\
&= -\mathbb{E}_q \bigg[\log \frac{p_{\theta}(x_{0:T})}{q(x_{1:T}|x_0)} \bigg] \\
&= -\mathbb{E}_q \bigg[\log p(x_T) + \sum_{t=1}^T \log \frac{p_{\theta}(x_{t-1}|x_t)}{q(x_t|x_{t-1})} \bigg] \\
&= -\mathbb{E}_q \bigg[\log p(x_T) + \sum_{t=2}^T \log \frac{p_{\theta}(x_{t-1}|x_t)}{q(x_t|x_{t-1})} + \log \frac{p_{\theta}(x_0|x_1)}{q(x_1|x_0)} \bigg] \\
&= -\mathbb{E}_q \bigg[\log p(x_T) + \sum_{t=2}^T \log \frac{p_{\theta}(x_{t-1}|x_t)}{q(x_{t-1}|x_t, x_0)} \cdot \frac{q(x_{t-1}|x_0)}{q(x_t|x_0)} + \log \frac{p_{\theta}(x_0|x_1)}{q(x_1|x_0)} \bigg] \\
&= -\mathbb{E}_q \bigg[\log \frac{p(x_T)}{q(x_T|x_0)} + \sum_{t=2}^T \log \frac{p_{\theta}(x_{t-1}|x_t)}{q(x_{t-1}|x_t, x_0)} + \log p_{\theta}(x_0|x_1) \bigg] \\
\end{array}
\]

注: \(q=q(x_{1:T}|x_0)p_{data}(x_0)\), 下面另\(q(x_0) := p_{data}(x_0)\).

\[\begin{array}{ll}
\mathbb{E}_q [\log \frac{q(x_T|x_0)}{p(x_T)}]
&= \int q(x_0, x_T) \log \frac{q(x_T|x_0)}{p(x_T)} \mathrm{d}x_0 \mathrm{d}x_T \\
&= \int q(x_0) q(x_T|x_0) \log \frac{q(x_T|x_0)}{p(x_T)} \mathrm{d}x_0 \mathrm{d}x_T \\
&= \int q(x_0) \mathrm{D_{KL}}(q(x_T|x_0) \| p(x_T)) \mathrm{d}x_0 \\
&= \int q(x_{0:T}) \mathrm{D_{KL}}(q(x'_T|x_0) \| p(x'_T)) \mathrm{d}x_{0:T} \\
&= \mathbb{E}_q \bigg[\mathrm{D_{KL}}(q(x'_T|x_0) \| p(x'_T)) \bigg].
\end{array}
\]

\[\begin{array}{ll}
\mathbb{E}_q [\log \frac{q(x_{t-1}|x_t, x_0)}{p_{\theta}(x_{t-1}|x_t)}]
&=\int q(x_0, x_{t-1}, x_t) \log \frac{q(x_{t-1}|x_t, x_0)}{p_{\theta}(x_{t-1}|x_t)} \mathrm{d}x_0 \mathrm{d}x_{t-1}\mathrm{d}x_t\\
&=\int q(x_0, x_t) \mathrm{D_{KL}}(q(x_{t-1}|x_t, x_0)\| p_{\theta}(x_{t-1}|x_t)) \mathrm{d}x_0 \mathrm{d}x_t\\
&=\mathbb{E}_q\bigg[\mathrm{D_{KL}}(q(x'_{t-1}|x_t, x_0)\| p_{\theta}(x'_{t-1}|x_t)) \bigg].
\end{array}
\]

故最后:

\[\mathcal{L} := \mathbb{E}_q \bigg[
\underbrace{\mathrm{D_{KL}}(q(x'_T|x_0) \| p(x'_T))}_{L_T} +
\sum_{t=2}^T \underbrace{\mathrm{D_{KL}}(q(x'_{t-1}|x_t, x_0)\| p_{\theta}(x'_{t-1}|x_t))}_{L_{t-1}}
\underbrace{-\log p_{\theta}(x_0|x_1)}_{L_0}.
\bigg]
\]

损失求解

因为无论forward, 还是 reverse process都是基于高斯分布的, 我们可以显示求解上面的各项损失:

首先, 对于forward process中的\(x_t\):

\[\begin{array}{ll}
x_t
&= \sqrt{1 - \beta_t} x_{t-1} + \sqrt{\beta_t} \epsilon, \: \epsilon \sim \mathcal{N}(0, I) \\
&= \sqrt{1 - \beta_t} (\sqrt{1 - \beta_{t-1}} x_{t-2} + \sqrt{\beta_{t-1}} \epsilon') + \sqrt{\beta} \epsilon \\
&= \sqrt{1 - \beta_t}\sqrt{1 - \beta_{t-1}} x_{t-2} + \sqrt{1 - \beta_t}\sqrt{\beta_{t-1}} \epsilon' + \sqrt{\beta} \epsilon \\
&= \sqrt{1 - \beta_t}\sqrt{1 - \beta_{t-1}} x_{t-2} +
\sqrt{1 - (1 - \beta_t)(1 - \beta_{t-1})} \epsilon \\
&= \cdots \\
&= (\prod_{s=1}^t \sqrt{1 - \beta_s}) x_0 + \sqrt{1 - \prod_{s=1}^t (1 - \beta_s)} \epsilon,
\end{array}
\]

\[q(x_t|x_0) = \mathcal{N}(x_t|\sqrt{\bar{\alpha}_t}x_0, (1 - \bar{\alpha}_t)I), \: \bar{\alpha}_t := \prod_{s=1}^t \alpha_s, \alpha_s := 1 - \beta_s.
\]

对于后验分布\(q(x_{t-1}|x_t, x_0)\), 我们有

\[\begin{array}{ll}
q(x_{t-1}|x_t, x_0)
&= \frac{q(x_t|x_{t-1})q(x_{t-1}|x_0)}{q(x_t|x_0)} \\
&\propto q(x_t|x_{t-1})q(x_{t-1}|x_0) \\
&\propto \exp\Bigg\{-\frac{1}{2 (1 - \bar{\alpha}_{t-1})\beta_t} \bigg[(1 - \bar{\alpha}_{t-1}) \|x_t - \sqrt{1 - \beta_t} x_{t-1}\|^2 + \beta_t \|x_{t-1} - \sqrt{\bar{\alpha}_{t-1}}x_0\|^2 \bigg]\Bigg\} \\
&\propto \exp\Bigg\{-\frac{1}{2 (1 - \bar{\alpha}_{t-1})\beta_t} \bigg[(1 - \bar{\alpha}_t)\|x_{t-1}\|^2 - 2(1 - \bar{\alpha}_{t-1}) \sqrt{\alpha_t} x_t^Tx_{t-1} - 2 \sqrt{\bar{\alpha}_{t-1}} \beta_t \bigg]\Bigg\} \\
\end{array}
\]

所以

\[q(x_{t-1}|x_t, x_0) \sim \mathcal{N}(x_{t-1}|\tilde{u}_t(x_t, x_0), \tilde{\beta}_t I),
\]

其中

\[\tilde{u}_t(x_t,x_0) := \frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1 - \bar{\alpha}_t} x_0 + \frac{\sqrt{\alpha_t}(1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} x_t,
\]
\[\tilde{\beta}_t = \frac{1 - \bar{\alpha}_{t-1}}{1 - \bar{\alpha}_t} \beta_t.
\]

\(L_{t}\)

\(L_T\)与\(\theta\)无关, 舍去.

作者假设\(\Sigma_{\theta}(x_t, t) = \sigma_t^2 I\)为非训练的参数, 其中

\[\sigma_t^2 = \beta_t | \tilde{\beta}_t,
\]

分别为\(x_0 \sim \mathcal{N}(0, I)\)和\(x_0\)为固定值时, 期望下KL散度的最优参数(作者说在实验中二者差不多).

\[L_{t} = \frac{1}{2 \sigma^2_t} \| \mu_{\theta}(x_t, t) - \tilde{u}_t(x_t, x_0)\|^2 +C, \quad t = 1,2,\cdots, T-1.
\]

\[x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon \Rightarrow x_0 = \frac{1}{\sqrt{\bar{\alpha}_t}}x_t - \frac{\sqrt{1 - \bar{\alpha}_t} }{\sqrt{\bar{\alpha}_t}} \epsilon.
\]

所以

\[\begin{array}{ll}
\mathbb{E}_q [L_{t-1} - C]
&= \mathbb{E}_{x_0, \epsilon} \bigg\{
\frac{1}{2 \sigma_t^2} \| \mu_{\theta}(x_t, t) - \tilde{u}_t\big( x_t, (\frac{1}{\sqrt{\bar{\alpha}_t}}x_t - \frac{\sqrt{1 - \bar{\alpha}_t} }{\sqrt{\bar{\alpha}_t}} \epsilon) \big)\|^2 \bigg\} \\
&= \mathbb{E}_{x_0, \epsilon} \bigg\{
\frac{1}{2 \sigma^2_t} \| \mu_{\theta}(x_t, t) -
\frac{1}{\sqrt{\alpha_t}} \big( x_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} \epsilon \big) \bigg\} \\
\end{array}
\]

注: 上式子中\(x_t\)由\(x_0, \epsilon\)决定, 实际上\(x_t = x_t(x_0, \epsilon)\), 故期望实际上是对\(x_t\)求期望.

既然如此, 我们不妨直接参数化\(\mu_{\theta}\)为

\[\mu_{\theta}(x_t, t):=
\frac{1}{\sqrt{\alpha_t}} \big( x_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} \epsilon_{\theta}(x_t, t) \big),
\]

即直接建模残差\(\epsilon\).

此时损失可简化为:

\[\mathbb{E}_{x_0, \epsilon} \bigg\{
\frac{\beta_t^2}{2\sigma_t^2 \alpha_t (1 - \bar{\alpha}_t)} \|\epsilon_{\theta}(\sqrt{\bar{\alpha}_t}x_0 + \sqrt{1 - \bar{\alpha}_t}\epsilon, t) - \epsilon\|^2
\bigg\}
\]

这个实际上时denoising score matching.

类似地, 从\(p_{\theta}(x_{t-1}|x_t)\)采样则为:

\[x_{t-1} =
\frac{1}{\sqrt{\alpha_t}} \big( x_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} \epsilon_{\theta}(x_t, t) \big) + \sigma_t z, \: z \sim \mathcal{N}(0, I),
\]

这是Langevin dynamic的形式(步长和权重有点变化)

注: 这部分见here.

\(L_0\)

最后我们要处理\(L_0\), 这里作者假设\(x_0|x_1\)满足一个离散分布, 首先图片取值于\(\{0, 1, 2, \cdots, 255\}\), 并标准化至\([-1, 1]\). 假设

\[p_{\theta}(x_0|x_1) = \prod_{i=1}^D \int_{\delta_{-}(x_0^i)}^{\delta_+(x_0^i) } \mathcal{N}(x; \mu_{\theta}^i(x_1, 1), \sigma_1^2) \mathrm{d}x, \\
\delta_+(x) =
\left \{
\begin{array}{ll}
+\infty & \text{if } x = 1, \\
x + \frac{1}{255} & \text{if } x < 1.
\end{array}
\right .
\delta_- (x)
\left \{
\begin{array}{ll}
-\infty & \text{if } x = -1, \\
x - \frac{1}{255} & \text{if } x > -1.
\end{array}
\right .
\]

实际上就是将普通的正态分布划分为:

\[(-\infty, -1 + 1/255], (-1 + 1 / 255, -1 + 3/255], \cdots, (1 - 3/255, 1 - 1/255], (1 - 1 / 255, +\infty)
\]

各取值落在其中之一.

在实际代码编写中, 会遇到高斯函数密度函数估计的问题(直接求是做不到的), 作者选择用下列的估计方式:

\[\Phi(x) \approx \frac{1}{2} \Bigg\{1 + \tanh \bigg(\sqrt{2/\pi} (1 + 0.044715 x^2) \bigg) \Bigg\}.
\]

这样梯度也就能够回传了.

注: 该估计属于Page.

最后的算法

注: \(t=1\)对应\(L_0\), \(t=2,\cdots, T\)对应\(L_{1}, \cdots, L_{T-1}\).

注: 对于\(L_t\)作者省略了开始的系数, 这反而是一种加权.

作者在实际中是采样损失用以训练的.

细节

注意到, 作者的\(\epsilon_{\theta}(\cdot, t)\)是有显示强调\(t\), 作者在实验中是通过attention中的位置编码实现的, 假设位置编码为\(P\):

  1. $ t = \text{Linear}(\text{ACT}(\text{Linear}(t * P)))$, 即通过两层的MLP来转换得到time_steps;
  2. 作者用的是U-Net结构, 在每个residual 模块中:
\[x += \text{Linear}(\text{ACT}(t)).
\]
参数
\(T\) 1000
\(\beta_t\) \([0.0001, 0.02]\), 线性增长\(1,2,\cdots, T\).
backbone U-Net

注: 作者在实现中还用到了EMA等技巧.

代码

原文代码

lucidrains-denoising-diffusion-pytorch

Denoising Diffusion Probabilistic Models (DDPM)的更多相关文章

  1. (转) RNN models for image generation

    RNN models for image generation MARCH 3, 2017   Today we’re looking at the remaining papers from the ...

  2. {ICIP2014}{收录论文列表}

    This article come from HEREARS-L1: Learning Tuesday 10:30–12:30; Oral Session; Room: Leonard de Vinc ...

  3. CVPR 2015 papers

    CVPR2015 Papers震撼来袭! CVPR 2015的文章可以下载了,如果链接无法下载,可以在Google上通过搜索paper名字下载(友情提示:可以使用filetype:pdf命令). Go ...

  4. ICLR 2014 International Conference on Learning Representations深度学习论文papers

    ICLR 2014 International Conference on Learning Representations Apr 14 - 16, 2014, Banff, Canada Work ...

  5. cvpr2015papers

    @http://www-cs-faculty.stanford.edu/people/karpathy/cvpr2015papers/ CVPR 2015 papers (in nicer forma ...

  6. Official Program for CVPR 2015

    From:  http://www.pamitc.org/cvpr15/program.php Official Program for CVPR 2015 Monday, June 8 8:30am ...

  7. Machine and Deep Learning with Python

    Machine and Deep Learning with Python Education Tutorials and courses Supervised learning superstiti ...

  8. Notes(一)

    Numerous experimental measurements in spatially complex systems have revealed anomalous diffusion in ...

  9. 斯坦福CS课程列表

    http://exploredegrees.stanford.edu/coursedescriptions/cs/ CS 101. Introduction to Computing Principl ...

随机推荐

  1. 零基础学习java------30---------wordCount案例(涉及到第三种多线程callable)

    知识补充:多线程的第三种方式 来源:http://www.threadworld.cn/archives/39.html 创建线程的两种方式,一种是直接继承Thread,另外一种就是实现Runnabl ...

  2. java生成cron表达式

    bean类: package com.cst.klocwork.service.cron; public class TaskScheduleModel { /** * 所选作业类型: * 1 -&g ...

  3. 【Java基础】JAVA中优先队列详解

    总体介绍 优先队列的作用是能保证每次取出的元素都是队列中权值最小的(Java的优先队列每次取最小元素,C++的优先队列每次取最大元素).这里牵涉到了大小关系,元素大小的评判可以通过元素本身的自然顺序( ...

  4. 10.Vue.js 样式绑定

    Vue.js 样式绑定 Vue.js class class 与 style 是 HTML 元素的属性,用于设置元素的样式,我们可以用 v-bind 来设置样式属性. Vue.js v-bind 在处 ...

  5. 【力扣】有序矩阵中第K小的元素

    给定一个 n x n 矩阵,其中每行和每列元素均按升序排序,找到矩阵中第 k 小的元素.请注意,它是排序后的第 k 小元素,而不是第 k 个不同的元素. 示例: matrix = [ [ 1, 5, ...

  6. 机器学习算法中的评价指标(准确率、召回率、F值、ROC、AUC等)

    参考链接:https://www.cnblogs.com/Zhi-Z/p/8728168.html 具体更详细的可以查阅周志华的西瓜书第二章,写的非常详细~ 一.机器学习性能评估指标 1.准确率(Ac ...

  7. js--生成器总结

    前言 生成器gengrator是es6 新增的函数功能,它允许你定义一个包含自有迭代算法的函数, 同时它可以自动维护自己的状态. 本文来总结一下JavaScript 中生成器的相关知识点. 正文 1. ...

  8. Iphone开源项目汇总

    扫描wifi信息: http://code.google.com/p/uwecaugmentedrealityproject/ http://code.google.com/p/iphone-wire ...

  9. 万字教你如何用 Python 实现线性规划

    摘要:线性规划是一组数学和计算工具,可让您找到该系统的特定解,该解对应于某些其他线性函数的最大值或最小值. 本文分享自华为云社区<实践线性规划:使用 Python 进行优化>,作者: Yu ...

  10. WPF将窗口置于桌面下方(可用于动态桌面)

    WPF将窗口置于桌面下方(可用于动态桌面) 先来看一下效果: 界面元素很简单,就一个Button按钮,然后写个定时器,定时更新Button按钮中的内容为当前时间,下面来介绍下原理,和界面组成. 窗口介 ...