出处:2018 AAAI

SourceCode:https://github.com/salu133445/musegan

abstract

(写得不错 值得借鉴)重点阐述了生成音乐和生成图片,视频及语音的不同。首先音乐是基于时间序列的;其次音符在和弦、琶音(arpeggios)、旋律、复音等规则的控制之下的;同时一首歌曲是多track的。总之不能简单堆叠音符。本文基于GAN提出了三种模型来生成音乐:jamming model, the composer model and the hybrid model。作者从摇滚音乐中挑选出了10万个bar来进行训练,生成5个轨道的piano-rolls:bass, drums, guitar, piano and strings。同时作者使用了一些intra-track and inter-track objective metrics来衡量生成的音乐质量(?)。

Introduction:

GAN在文字,图片,视频上取得了巨大的成就,音乐方面也有些进展,但问题在于:

(1)音乐有自己的基于时间的架构,如下图所示:

(2)音乐是多轨道/多乐器的

现代管弦乐(orchestra)常常有4个部分:brass, strings, woodwinds and percussion,摇滚乐队常用的是bass, a drum set, guitars and possibly a vocal,音乐理论要求这些元素按时间展开后harmony并且counterpoint.

(3)musical notes are often grouped into chords,arpeggios or melodies.所以,单音(monophonic)的音乐和NLP的生成都不能直接被引入来生成复调(polyphonic)的音乐。

由于上述三个问题,许多已有工作做了一些简化的处理方式,生成单轨单音音乐,introducing a chronological ordering of notes for polyphonic music,组合单音音乐变成复音音乐等。作者的目标是摒弃这些简化手法,1) harmonic and rhythmic structure, 2) multi-track interdependency, and 3) temporal structure。该模型能够产生音乐from scratch (i.e. without human inputs),也能follow the underlying temporal structure of a track given a priori by human.作者提出了三种方式来处理track之间的交互

(1)每个track独立生成 one generates tracks independently by their private generators (one for each)

(2)所有track由一个生成器生成 another generates all tracks jointly with only one generator

(3)在(1)的基础上,每个track生成时有额外的input信息,以保证harmonious and coordinated

为了突出group的性质,作者关注bars(这点参考了[1]),而不是notes,并使用CNN来提取隐藏特征。

除了刚才提到的测量标准,最后居然找了144个路人来对生成音乐进行评测。

contribution:

接下来介绍了GAN 和WGAN,WGAN-GP并最终选用WGAN-GP。

Proposed Model:

这里再次强调了关注的是Bar[1],并列举了一些理由。

  • 数据表示

使用了multiple-track piano-roll representation表示方式,a piano-roll representation is a binary-valued, scoresheet-like matrix representing the presence of notes over different time steps, and a multiple-track piano-roll is defined as a set of piano-rolls of different tracks。一个有M个track,每个trank有R个time_step,候选bar数量为S的bar记录为X,其数据形式为$X^{RxSxM}$,T个bar则被表示为 $\{X^{t}\}_{t=1}^{T}$。因此每个X的矩阵大小是固定的,有利于CNN训练特征

  • 构建Tranck间的相关性(Interdependency)

提出了三种谱曲方式

Jamming Model--每一个Track拥有自己的一组G和D,及独立的隐空间变量Zi。

Composer Model -- 全局一组G和D,公用Z来生成所有的Track

Hybrid Model -- 混合上面两种模式,每个track一个Gi接受独立的Zi(intra-track random vector)及全局的Z(inter-track random vector)共同组合成的输入向量,同时公用一个D来生成track。与Composer Model相比,混合模式更加灵活,可以在G模型中使用不同的参数(如层数,卷积核大小等),将音轨的独立生成和全局和谐结合起来。

  • 构建时序相关性(Temporal Structure)

上面提到的结构目的在于怎样在不同音轨中生成单个的bar,bar与bar之间的时序关联需要其他的结构来补充生成。作者采用了两种方式:

Generation from Scratch -- 将G分为两个sub network:$G_{temp}$和$G_{bar}$,$G_{temp}$将z映射成一个隐空间向量的序列,作者希望它能承载一些时序信息,随后被送入$G_{bar}$,序列化地生成piano-rolls。

Track-conditional Generation--这种方式假定了各个track的n个bar已经被给定了,即为,这里添加了一个编码器E,负责将映射为(这个也是从[1]里参考得来的)

  • MuseGAN

模型的输入由4部分构成:

an inter-track time-dependent random vectors $z_t$        轨道间全局  时间相关 向量

an inter-track time-independent random vectors z            轨道间全局 时间无关 向量

an intra-track time-independent random vectors $z_i$      轨道内单独 时间无关 向量

an intra-track time-dependent random vectors $z_{i,t}$    轨道内单独  时间相关 向量

从该生成公式上可以清楚地看出,各轨道间的输入变量(分为时间相关和无关)和全局输入变量(分为时间相关和无关)如何结合起来,形成MuseGan生成系统

Dataset

MuseGAN的piano-roll训练数据是基于Lakh MIDI dataset (LMD)[3],原数据集噪声很大,使用了三步来做清理(如下图),midi解析使用了pretty midi[2]

要注意的是,(1)一些track上的note非常稀疏,这里作者对这种不平衡数据做了merge操作(merging tracks of similar instruments by summing their piano-rolls,具体可能需要看代码),对于非bass, drums, guitar, piano and strings 这5类的track统一归纳到string上去.[5,6]中对track类型进行了较好的数据预归类;(2)选取piano-roll时选取higher confidence score in matching的,rock标签,4/4拍;(3)piano-roll的segment采用了state_of_art的方式structural features[7],每4个小节为一个phrase。Notably, although we use our models to generate fixed-length segments only, the track-conditional model is able to generate music of any length according to the input.(4)音域使用C1到C8(钢琴最右键)。

最终输出一首歌的tensor为:4 (bar) × 96 (time step) × 84 (note) × 5 (track)

模型设置:

根据WGAN的理论,update G once every five updates of D and apply batch normalization only to G。其余略。

Objective Metrics for Evaluation:

使用了4个intra-track和1个inter-track(最后一个)度量标准

  • EB: ratio of empty bars (in %)
  • UPC: number of used pitch classes per bar (from 0 to 12)
  • QN: ratio of “qualified” notes (in %) 一个长度不少于3个time_step的音符被认为是qualified的。这个指标可以衡量是否生成的音乐是否过于碎片化。
  • DP, or drum pattern: ratio of notes in 8- or 16-beat patterns, common ones for Rock songs in 4/4 time (in %).
  • TD: or tonal distance [8]. It measures the hamornicity between a pair of tracks. Larger TD implies weaker inter-track harmonic relations.调式距离?

[9]:综述 [10]RNN生成music [11]生成chorales [12]:Song from PI [13]:C-RNN-GAN [14]:seqGAN(combined GANs and reinforcement learning to gen sequences of discrete tokens. It has been applied to generate monophonic music, using the note event representation) [15]:midi_net(convolutional GANs to generate melodies that follows a chord sequence given a priori, either from scratch or conditioned on the melody of previous bars)

[1]Yang, L.-C.; Chou, S.-Y.; and Yang, Y.-H. 2017. MidiNet: A convolutional generative adversarial network for symbolicdomain music generation. In ISMIR.

[2]Raffel, C., and Ellis, D. P. W. 2014. Intuitive analysis, creation and manipulation of MIDI data with pretty midi. In ISMIR Late Breaking and Demo Papers.
[3]Raffel, C., and Ellis, D. P. W. 2016. Extracting ground truth information from MIDI files: A MIDIfesto. In ISMIR.
[4]Raffel, C. 2016. Learning-Based Methods for Comparing Sequences, with Applications to Audio-to-MIDI Alignment and Matching. Ph.D. Dissertation, Columbia University.

[5]Chu, H.; Urtasun, R.; and Fidler, S. 2017. Song from PI: A musically plausible network for pop music generation. In ICLR Workshop.

[6]Yang, L.-C.; Chou, S.-Y.; and Yang, Y.-H. 2017. MidiNet: A convolutional generative adversarial network for symbolic-domain music generation. In ISMIR.

[7]Serrà, J.; Mller, M.; Grosche, P.; and Arcos, J. L. 2012. Unsupervised detection of music boundaries by time series structure features. In AAAI.

[8]Harte, C.; Sandler, M.; and Gasser, M. 2006. Detecting harmonic change in musical audio. In ACM MM workshop on Audio and music computing multimedia.

[9]Briot, J.-P.; Hadjeres, G.; and Pachet, F. 2017. Deep learning techniques for music generation: A survey. arXiv preprint arXiv:1709.01620.

[10] Sturm, B. L.; Santos, J. F.; Ben-Tal, O.; and Korshunova, I. 2016. Music transcription modelling and composition using deep learning. In Conference on Computer Simulation of Musical Creativity.

[11]Hadjeres, G.; Pachet, F.; and Nielsen, F. 2017. DeepBach:A steerable model for Bach chorales generation. In ICML.

[12]Chu, H.; Urtasun, R.; and Fidler, S. 2017. Song from PI: A musically plausible network for pop music generation. In ICLR Workshop.

[13]Mogren, O. 2016. C-RNN-GAN: Continuous recurrent neural networks with adversarial training. In NIPS Worshop on Constructive Machine Learning Workshop.

[14]Yu, L.; Zhang, W.; Wang, J.; and Yu, Y. 2017. SeqGAN: Sequence generative adversarial nets with policy gradient. In AAAI.

[15]Yang, L.-C.; Chou, S.-Y.; and Yang, Y.-H. 2017. MidiNet: A convolutional generative adversarial network for symbolic-domain music generation. In ISMIR.

《MuseGAN: Multi-track Sequential Generative Adversarial Networks for Symbolic Music Generation and Accompaniment》论文阅读笔记的更多相关文章

  1. (转)Introductory guide to Generative Adversarial Networks (GANs) and their promise!

    Introductory guide to Generative Adversarial Networks (GANs) and their promise! Introduction Neural ...

  2. 生成对抗网络(Generative Adversarial Networks,GAN)初探

    1. 从纳什均衡(Nash equilibrium)说起 我们先来看看纳什均衡的经济学定义: 所谓纳什均衡,指的是参与人的这样一种策略组合,在该策略组合上,任何参与人单独改变策略都不会得到好处.换句话 ...

  3. 生成对抗网络(Generative Adversarial Networks, GAN)

      生成对抗网络(Generative Adversarial Networks, GAN)是一种深度学习模型,是近年来复杂分布上无监督学习最具前景的学习方法之一.   GAN 主要包括了两个部分,即 ...

  4. StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks 论文笔记

    StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks  本文将利 ...

  5. 论文笔记之:Semi-Supervised Learning with Generative Adversarial Networks

    Semi-Supervised Learning with Generative Adversarial Networks 引言:本文将产生式对抗网络(GAN)拓展到半监督学习,通过强制判别器来输出类 ...

  6. 《Self-Attention Generative Adversarial Networks》里的注意力计算

    前天看了 criss-cross 里的注意力模型  仔细理解了  在: https://www.cnblogs.com/yjphhw/p/10750797.html 今天又看了一个注意力模型 < ...

  7. Paper Reading: Perceptual Generative Adversarial Networks for Small Object Detection

    Perceptual Generative Adversarial Networks for Small Object Detection 2017-07-11  19:47:46   CVPR 20 ...

  8. SalGAN: Visual saliency prediction with generative adversarial networks

    SalGAN: Visual saliency prediction with generative adversarial networks 2017-03-17 摘要:本文引入了对抗网络的对抗训练 ...

  9. Generative Adversarial Networks,gan论文的畅想

    前天看完Generative Adversarial Networks的论文,不知道有什么用处,总想着机器生成的数据会有机器的局限性,所以百度看了一些别人 的看法和观点,可能我是机器学习小白吧,看完之 ...

随机推荐

  1. POJ 1985 Cow Marathon【树的直径】

    题目大意:给你一棵树,要你求树的直径的长度 思路:随便找个点bfs出最长的点,那个点一定是一条直径的起点,再从那个点BFS出最长点即可 以下研究了半天才敢交,1.这题的输入格式遵照poj1984,其实 ...

  2. 1027 stl

    #include<stdio.h> #include<queue> using namespace std; int main() {  int i,n,m,j,k,a[100 ...

  3. Foundation框架的一些实用方法:替换字符串,去空格,反转

    //定义一个可变字符串, Format后面可以跟字符串类型,也可以传入C语言的字符串数组 NSMutableString *str = [NSMutableString stringWithForma ...

  4. php统计图类库JpGraph

    php统计图类库JpGraph JpGraph官网地址:https://jpgraph.net/. (1)下载类库: 下载地址:https://jpgraph.net/download/. 选择版本, ...

  5. 利用开源工具实现轻量级上网行为审计(来源ispublic.com)

    https://blog.csdn.net/cnbird2008/article/details/5875781

  6. CodeForces 596B Wilbur and Array

    简单题,一个一个操作,最后就是答案. #include<cstdio> #include<cstring> #include<cmath> #include< ...

  7. 10-JS的函数学习

    <html> <head> <title>js的函数学习</title> <meta charset="UTF-8"/> ...

  8. 从头开始学Android之(一)——— Android架构

    从事Android开发已经两年多了,最近项目上特别清闲,刚开始时在闲暇的时候都不知道干嘛,整天混日子.有一天突然有个以前同学找到我,说要我帮忙做一个Android的需求,就是在后台截屏(涉及到服务以及 ...

  9. centos6.5 (linux) 禁用模块 IPV6模块的方法

    装完centos后,默认开启了一些模块.可是有些模块并非我们必须的.比方眼下尚未在中国普及的IPV6 怎样关闭IPV6呢 以下介绍的方法,也能够在关闭其它模块的时候使用 第一步: 查找模块名称 使用命 ...

  10. [外文理解] DDD创始人Eric Vans:要实现DDD原始意图,必须CQRS+Event Sourcing架构。

    原文:http://www.infoq.com/interviews/Technology-Influences-DDD# 要实现DDD(domain drive  design 领域驱动设计)原始意 ...