








大多数基于GAN的模型(Radford et al., 2016; Salimans et al., 2016; Karras et al., 2018)使用卷积层构建图像生成。卷积处理一个局部邻域内的信息,因此单独使用卷积层对图像的长期依赖关系建模在计算上是低效的。在本节中,我们采用了(Wang et al., 2018)的非局部模型来介绍GAN框架中的自注意机制,使得生成器和判别器能够有效地对广泛分离的空间区域之间的关系进行建模。因为它的自我注意模块(参见图2),我们将提出的方法称为自注意生成对抗网络(SAGAN)。

将前一隐含层x∈RC×N的图像特征先变换成两个特征空间f,g来计算注意,其中f (x) = Wfx, g(x) = Wgx:

βj,i表示模型在合成第j个区域时对第i个位置的关注程度。其中,C为通道数,N为前一隐含层特征的特征位置数。注意层的输出为o = (o1,o2,…,oj,…,oN)∈RC×N,其中:

在上述公式中,Wg∈RC̄ ×C、Wf∈RC̄ ×C、Wh∈RC̄ ×C和Wv∈RC̄ ×C是可学习的权重矩阵,用来实现1×1的矩阵。在ImageNet的一些迭代后将通道数量从C̄
 减少到C / k, k = 1, 2, 4, 8时,我们没有注意到任何显著的性能下降的。为了提高内存效率,我们选择在我们所有的实验中设置k =
8(即C̄  = C / 8)。

总结一下,CxN = Cx( WxH
N]大小的s矩阵,表示每个像素点之间的相互关系,可以看成是一个相关性矩阵。h(x)的操作稍微有点不同,输出是[C, N]

然后再使用softmax对s矩阵归一化后得到β矩阵,βj,i表示模型在合成第j像素点时对第i个位置的关注程度,即一个attention map

然后将得到的attention map应用到一个h(x)输出的特征图上,将会对生成的第j个像素造成影响的h(xi)与其对应的影响程度βj,i相乘,然后求和,这就能根据影响程度来生成j像素,将这个结果在进行一层卷积即得到添加上注意的特征图的结果o



& Ye, 2017; Tran et al., 2017; Miyato et al., 2018)



使用的是patch Discriminator

什么是patch Discriminator,比如70*70。之前一直都理解得不对,以为是要将生成的假图像和真图像分成一个个70*70大小的patch输入到判别器中,虽然其实意思就是这个意思,但是实现更加简单。输入还是一整张图片,比如pix2pix中的70*70 patch的判别器的定义为:

import functools
from torch import nn
class NLayerDiscriminator(nn.Module):
"""Defines a PatchGAN discriminator""" def __init__(self, input_nc, ndf=, n_layers=, norm_layer=nn.BatchNorm2d):
"""Construct a PatchGAN discriminator Parameters:
input_nc (int) -- the number of channels in input images
ndf (int) -- the number of filters in the last conv layer
n_layers (int) -- the number of conv layers in the discriminator
norm_layer -- normalization layer
super(NLayerDiscriminator, self).__init__()
if type(norm_layer) == functools.partial: # no need to use bias as BatchNorm2d has affine parameters
use_bias = norm_layer.func == nn.InstanceNorm2d
use_bias = norm_layer == nn.InstanceNorm2d kw =
padw =
sequence = [nn.Conv2d(input_nc, ndf, kernel_size=kw, stride=, padding=padw), nn.LeakyReLU(0.2, True)]
nf_mult =
nf_mult_prev =
for n in range(, n_layers): # gradually increase the number of filters
nf_mult_prev = nf_mult
nf_mult = min( ** n, )
sequence += [
nn.Conv2d(ndf * nf_mult_prev, ndf * nf_mult, kernel_size=kw, stride=, padding=padw, bias=use_bias),
norm_layer(ndf * nf_mult),
nn.LeakyReLU(0.2, True)
] nf_mult_prev = nf_mult
nf_mult = min( ** n_layers, )
sequence += [
nn.Conv2d(ndf * nf_mult_prev, ndf * nf_mult, kernel_size=kw, stride=, padding=padw, bias=use_bias),
norm_layer(ndf * nf_mult),
nn.LeakyReLU(0.2, True)
] # 最终输出为一个值
sequence += [nn.Conv2d(ndf * nf_mult, , kernel_size=kw, stride=, padding=padw)] # output channel prediction map
self.model = nn.Sequential(*sequence) def forward(self, input):
"""Standard forward."""
return self.model(input) if __name__ == "__main__":
d = NLayerDiscriminator()
for module in d.children():


(): Conv2d(, , kernel_size=(, ), stride=(, ), padding=(, ))
(): LeakyReLU(negative_slope=0.2, inplace=True)
(): Conv2d(, , kernel_size=(, ), stride=(, ), padding=(, ), bias=False)
(): BatchNorm2d(, eps=1e-, momentum=0.1, affine=True, track_running_stats=True)
(): LeakyReLU(negative_slope=0.2, inplace=True)
(): Conv2d(, , kernel_size=(, ), stride=(, ), padding=(, ), bias=False)
(): BatchNorm2d(, eps=1e-, momentum=0.1, affine=True, track_running_stats=True)
(): LeakyReLU(negative_slope=0.2, inplace=True)
(): Conv2d(, , kernel_size=(, ), stride=(, ), padding=(, ), bias=False)
(): BatchNorm2d(, eps=1e-, momentum=0.1, affine=True, track_running_stats=True)
(): LeakyReLU(negative_slope=0.2, inplace=True)
(): Conv2d(, , kernel_size=(, ), stride=(, ), padding=(, ))


layer kernel size stride dilation padding input size output size receptive field
1 4 2 1 1 256 128 70
2 4 2 1 1 128 64 34
3 4 2 1 1 64 32 16
4 4 1 1 1 32 31 7
5 4 1 1 1 31 30 4
output         30   1






def f(output_size, ksize, stride):
return (output_size - ) * stride + ksize if __name__ == "__main__":
last_layer = f(output_size=, ksize=, stride=)
# Receptive field:
fourth_layer = f(output_size=last_layer, ksize=, stride=)
# Receptive field:
third_layer = f(output_size=fourth_layer, ksize=, stride=)
# Receptive field:
second_layer = f(output_size=third_layer, ksize=, stride=)
# Receptive field:
first_layer = f(output_size=second_layer, ksize=, stride=)
# Receptive field: print(first_layer)





class GANLoss(nn.Module):
"""Define different GAN objectives.Gan目标函数 The GANLoss class abstracts away the need to create the target label tensor
that has the same size as the input.
""" def __init__(self, gan_mode, target_real_label=1.0, target_fake_label=0.0):
""" Initialize the GANLoss class. Parameters:
gan_mode (str) - - the type of GAN objective. It currently supports vanilla, lsgan, and wgangp.
target_real_label (bool) - - label for a real image
target_fake_label (bool) - - label of a fake image Note: Do not use sigmoid as the last layer of Discriminator.
LSGAN needs no sigmoid. vanilla GANs will handle it with BCEWithLogitsLoss.
super(GANLoss, self).__init__()
self.register_buffer('real_label', torch.tensor(target_real_label))
self.register_buffer('fake_label', torch.tensor(target_fake_label))
self.gan_mode = gan_mode
if gan_mode == 'lsgan':# 最小二乘法损失
self.loss = nn.MSELoss()
elif gan_mode == 'vanilla': # 交叉熵损失
self.loss = nn.BCEWithLogitsLoss()
elif gan_mode in ['wgangp']: # 使用梯度惩罚损失,下面的cal_gradient_penalty函数
self.loss = None
raise NotImplementedError('gan mode %s not implemented' % gan_mode) def get_target_tensor(self, prediction, target_is_real):
"""Create label tensors with the same size as the input. Parameters:
prediction (tensor) - - tpyically the prediction from a discriminator,用来确定生成的目标值的大小
target_is_real (bool) - - if the ground truth label is for real images or fake images Returns:
A label tensor filled with ground truth label, and with the size of the input
""" if target_is_real: # 如果该预测值prediction对应的应该是个真图,则目标值为true
target_tensor = self.real_label
else:# 如果该预测值prediction对应的应该是个假图,则目标值为false
target_tensor = self.fake_label
return target_tensor.expand_as(prediction) #这一步最为关键 def __call__(self, prediction, target_is_real):
"""Calculate loss given Discriminator's output and grount truth labels. Parameters:
prediction (tensor) - - tpyically the prediction output from a discriminator
target_is_real (bool) - - if the ground truth label is for real images or fake images Returns:
the calculated loss.
if self.gan_mode in ['lsgan', 'vanilla']:
target_tensor = self.get_target_tensor(prediction, target_is_real)
loss = self.loss(prediction, target_tensor) # 计算损失
elif self.gan_mode == 'wgangp':
if target_is_real:
loss = -prediction.mean()
loss = prediction.mean()
return loss


2) pix2pixHD

这个则是使用了3个 70*70的patch Discriminator,输入分别为原始图像、下采样一倍的图像和下采样两倍的图像

3) ours

使用了2/3/4个patch Discriminator,不同的patch Discriminator的深度不同,所以感受域也不同,不再是仅使用70*70的情况了,而且深度最大的那个判别器的感受域必须是全局的,用来获得全局信息。而且每个子网在前几层与其他子网共享权重

