RNN models for image generation

MARCH 3, 2017
 

Today we’re looking at the remaining papers from the unsupervised learning and generative networks section of the ‘top 100 awesome deep learning papers‘ collection. These are:

DRAW: A recurrent neural network for image generation

The networks we looked at yesterday generate a complete image all at once. This “one shot” approach is hard to scale to large images. If you ask a person to draw some visual scene, they will typically do so in a sequential iterative fashion, working from rough outlines to detail, first on one part, then on another.

The Deep Recurrent Attentive Writer (DRAW) architecture represents a shift towards a more natural form of image construction, in which parts of a scene are created independently from others, and approximate sketches are successively refined.

At its core, DRAW is a variational autoencoder (an autoencoder making use of variational inference techniques). The encoder and decoder however are both RNNs (LSTMs):

… a sequence of code samples is exchanged between them; moreover, the encoder is privy to the decoder’s previous outputs, allowing it to tailor the codes it sends according to the decoder’s behavior so far.

The decoder’s outputs are successively added to the distribution that will ultimately generate the data, as opposed to emitting it in a single step. This much takes care of the iterative part of human image construction. To model the phenomenon of working first on one part of the image, and then on another, an attention mechanism is used to restrict both the input region observed by the encoder, and the output region modified by the decoder.

In simple terms, the network decides at each time-step “where to read” and “where to write” as well as “what to write”.

The attention mechanism resembles the selective read and write operations of the Neural Turing Machine that we looked at last year, however it works in 2D. An array of 2D Guassian filters is applied to the image, which yields image patches of smoothly varying location and zoom.

As illustrated [below], the N ×N grid of Gaussian filters is positioned on the image by specifying the co-ordinates of the grid centre and the stride distance between adjacent filters. The stride controls the ‘zoom’ of the patch; that is, the larger the stride, the larger an area of the original image will be visible in the attention patch, but the lower the effective resolution of the patch will be.

Once a DRAW network has been trained, an image can be generated by iteratively picking latent samples and running the decoder to update the canvas matrix. Here we can see how images evolve when a trained DRAW network generates MNIST digits:

The final generated digits are pretty much indistinguishable from the originals.

Here’s another set of generated images, this time from a network trained on a multi-digit Street View House Numbers dataset:

Pixel recurrent neural networks

In contrast to DRAW, Pixel RNNs use a distinctly un-human approach: they model the probability of raw pixel values. The goal of the work is to be able to model natural images on a large scale, but the authors also evaluated Pixel RNNs on good old MNIST, and reported the best result so far (including against DRAW). There are two variations of the basic idea in the paper, a full-fat Pixel RNN architecture, and a simpler Pixel CNN one – both are substantial.

Pixel RNN network has up to twelve two-dimensional LSTM layers:

These layers use LSTM units in their state and adopt a convolution to compute at once all the states along one of the spatial dimensions of the data.

To generate pixel , Pixel conditions on all the previously generated pixels left and above it.

There are two types of these layers: Row LSTM layers apply the convolution along each row of pixels; Diagonal BiLSTM layers apply the convolution along the diagonals of the image (like a bishop moving on a chessboard).

Here’s an illustration of the Row LSTM convolution with a kernel size of 3 (3 pixels wide). Notice how it fails to reach pixels on the far sides of the image in rows close to .

In contrast, the Diagonal BiLSTM’s dependency field covers the entire available context in the image:

Pixel CNN network is fully convolution and has up to fifteen layers, “We observe that CNNs can also be used as a sequence model with a fixed dependency range, by using Masked convolutions.

With Masked Convolution the R,G,B channels for the current pixel are predicted separately. When predicting the R channel, only the pixels left and above can be used for conditioning. But when we predict the G channel, we can also use the predicted value for R at the current pixel. And when we predict the B value, we can use the predicted values for both R and G.

To restrict connections in the network to these dependencies, we apply a mask to the input-to-state convolutions and to other purely convolutional layers in a PixelRNN. We use two types of masks that we indicate with mask A and mask B, as shown [below]. Mask A is applied only to the first convolutional layer in a PixelRNN and restricts the connections to those neighboring pixels and to those colors in the current pixels that have already been predicted. On the other hand, mask B is applied to all the subsequent input-to-state convolutional transitions and relaxes the restrictions of mask A by also allowing the connection from a color to itself.

The Pixel CNN is the fastest architecture, whereas Pixel RNNs with Diagonal BiLSTM layers perform the best in terms of generating likely images. For generation of larger images, Multiscale Pixel RNNs do even better.

The Multi-Scale PixelRNN is composed of an unconditional PixelRNN and one or more conditional PixelRNNs. The unconditional network first generates a smaller s × s image just like a standard PixelRNN described above, where s is a smaller integer that divides n. The conditional network then takes the s × s image as an additional input and generates a larger n × n image.

One fun experiment that shows the power of the method is to occlude the lower part of an image, and then ask a PixelRNN to complete the image. Here are some example generated completions (with the original in the right-hand column for comparison):

Based on the samples and completions drawn from the models we can conclude that the PixelRNNs are able to model both spatially local and long-range correlations and are able to produce images that are sharp and coherent. Given that these models improve as we make them larger and that there is
practically unlimited data available to train on, more computation and larger models are likely to further improve the results.

Auto-encoding variational Bayes

The auto-encoding paper is short and dense, and I don’t feel able to add much value to it, so here’s a very short summary of what to expect if you go on to read it. The opening sentence sets the tone:

How can we perform efficient approximate inference and learning with directed probabilistic models whose continuous latent variables and/or parameters have intractable posterior distributions? [Oh, and we want to do it with large datasets].

I bet you were wondering exactly that in the shower this morning! Let’s unpack this sentence to see what it’s all about:

  • efficient approximate inference and learning’ simply says that we want a training approach which is not too costly to compute.
  • directed probabilistic models‘ are graph models where the probability at some node is conditioned on the probabilities of other nodes, with directed edges in the graph from cause nodes to affected nodes.
  • continuous latent variables‘ suggests that there are true hidden causes for the observed behaviour, which are not directly represented in our model (latent). These are real-valued (i.e., not discrete).
  • intractable posterior distributions‘ means that we can’t calculate an exact answer for the output probability distribution, so we’ll need to approximate it.
  • The use of large datasets further stresses the need for efficiency.

The authors demonstrate a simple and efficient differentiable (i.e., easily trainable) estimator that fits the bill, which they call Stochastic Gradient Variational Bayes (SGVB). Using this in an autoencoder leads to Auto-Encoding Variational Bayes (AEVB).

If you’re feeling brave and want to dig in (or your knowledge of variational Bayesian methods is better than mine!), then these wikipedia entries on Variational Bayesian methods and Kullback-Leibler divergence might come in handy.

(转) RNN models for image generation的更多相关文章

  1. RNN 入门教程 Part 2 – 使用 numpy 和 theano 分别实现RNN模型

    转载 - Recurrent Neural Networks Tutorial, Part 2 – Implementing a RNN with Python, Numpy and Theano 本 ...

  2. The Unreasonable Effectiveness of Recurrent Neural Networks (RNN)

    http://karpathy.github.io/2015/05/21/rnn-effectiveness/ There’s something magical about Recurrent Ne ...

  3. 论文翻译——Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank

    Abstract Semantic word spaces have been very useful but cannot express the meaning of longer phrases ...

  4. (转)Awesome PyTorch List

    Awesome-Pytorch-list 2018-08-10 09:25:16 This blog is copied from: https://github.com/Epsilon-Lee/Aw ...

  5. [2017 - 2018 ACL] 对话系统论文研究点整理

    (论文编号及摘要见 [2017 ACL] 对话系统. [2018 ACL Long] 对话系统. 论文标题[]中最后的数字表示截止2019.1.21 google被引次数) 1. Domain Ada ...

  6. (转) Written Memories: Understanding, Deriving and Extending the LSTM

    R2RT   Written Memories: Understanding, Deriving and Extending the LSTM Tue 26 July 2016 When I was ...

  7. Matlab下多径衰落信道的仿真

    衰落信道参数包括多径扩展和多普勒扩展.时不变的多径扩展相当于一个延时抽头滤波器,而多普勒扩展要注意多普勒功率谱密度,通常使用Jakes功率谱.高斯.均匀功率谱. 多径衰落信道由单径信道叠加而成,而单径 ...

  8. Deep Learning for Information Retrieval

    最近关注了一些Deep Learning在Information Retrieval领域的应用,得益于Deep Model在对文本的表达上展现的优势(比如RNN和CNN),我相信在IR的领域引入Dee ...

  9. Paper Reading - Long-term Recurrent Convolutional Networks for Visual Recognition and Description ( CVPR 2015 )

    Link of the Paper: https://arxiv.org/abs/1411.4389 Main Points: A novel Recurrent Convolutional Arch ...

随机推荐

  1. java三大工厂结果总览

    2018-11-02 21:27:18 开始写 谢谢.Thank you.Salamat Do(撒拉玛特朵).あリがCám o*n(嘉蒙)とゥ(阿里嘎都).감사합니다 (勘三哈咪瘩).terima K ...

  2. Java多线程-----Thread常用方法

    1.public Thread(Runnable target,String name) 创建一个有名称的线程对象 package com.thread.mothed; public class Th ...

  3. Java并发编程:volatile关键字解析-转

    Java并发编程:volatile关键字解析 转自海子:https://www.cnblogs.com/dayanjing/p/9954562.html volatile这个关键字可能很多朋友都听说过 ...

  4. web安全防范之SQL注入攻击、攻击原理和防范措施

    SQL注入 攻击原理 在编写SQL语句时,如果直接将用户传入的数据作为参数使用字符串拼接的方式插入到SQL查询中,那么攻击者可以通过注入其他语句来执行攻击操作,这些攻击包括可以通过SQL语句做的任何事 ...

  5. loadRunner手动关联, web_reg_save_param_regexp()函数正则匹配字符,赋值给变量

    loadRunner写脚本实现登录机票网站,手动关联,获取页面源码中特定字符 手动关联,就是通过函数获取某个步骤生成的字符,赋值给一个变量,这个变量可以作为接下来某个步骤的输入, 以便这个脚本能够在存 ...

  6. (Review cs231n) Gradient Vectorized

    注意: 1.每次更新,都要进行一次完整的forward和backward,想要进行更新,需要梯度,所以你需要前馈样本,马上反向求导,得到梯度,然后根据求得的梯度进行权值微调,完成权值更新. 2.前馈得 ...

  7. 51nod 1130 N的阶乘的长度 V2(斯特林近似)

    输入N求N的阶乘的10进制表示的长度.例如6! = 720,长度为3.   Input 第1行:一个数T,表示后面用作输入测试的数的数量.(1 <= T <= 1000) 第2 - T + ...

  8. centos 文件新建、删除、移动、复制等命令

    创建目录 mkdir 文件名 mkdir /var/www/test cp复制命令 cp命令复制文件从一个位置到另一位置.如果目的地文件存在,将覆复写该文件: 如果目的地目录存在,文件将复制到该目录下 ...

  9. postgresql查询语句

    //查询表名称SELECT tablename FROM pg_tablesWHERE tablename NOT LIKE 'pg%'AND tablename NOT LIKE 'sql_%' O ...

  10. Oracle之数据库的增删改查和格式的修改

    Oracle修改数据 *update语句 格式: update table_name set column1=value1,…[where conditions] 例子: update userinf ...