《SONG FROM PI: A MUSICALLY PLAUSIBLE NETWORK FOR POP MUSIC GENERATION》论文笔记

出处:ICLR 2017

Motivation

提出一个通用的基于RNN的pop music生成模型,在层次结构中封装了先验乐理知识(prior knowledge about how pop music is composed)。bottom layers生成旋律,higher levels生成鼓,和弦等。人工听觉测试的结论优于google提出的模型。并且作者基于该模型加了两个小应用:neural dancing and karaoke, as well as neural story singing.

Introduction

作者从机器学习往艺术领域的渗透开始谈起,目前已经在模仿梵高风格绘画,生成story,莎翁的小说等等方面取得进展,音乐是其中一个分属领域。RNN在自然语言文本处理方面有着自己的优势,在它的基础上完成音乐生成的工作具备可行性。例如[1,2,3,4]。但这些前人的工作基本都是生成单轨道的note,多轨道生成的研究有[5](polyphonic music)。作者希望能将旋律,和弦,鼓及其他乐器轨道同时生成出来,以构成完整意义上的pop song。作者的想法借鉴了yotube视频上基于$pi$的序列弹奏钢琴曲的启迪(https://youtu.be/OMq9he-5HUU),该钢琴曲的一些生成规则使随机不循环数列形成了音乐(shows both the randomness and the regularity of music. On one hand, since any possible digit sequence is a subset of the $pi$  digit sequence, this implies that pleasing music can be created even from a totally random base signal. On the other hand, the composer uses specific rules such as A Harmonic Minor scale and harmonies to convert the digit sequence into a music sheet. It is these rules that play the key role in converting randomness into music.)

Related work

基本上智能谱曲经历的时期是早期的机器学习+乐理[6],到神经网络学习[1,2,3],再到后面的深度学习(RNN)[4,7,8]+淡化乐理

音乐常识

what is note?defines the basic unit that music is composed of

12均分律 Music follows the 12-tone system, i.e., 12 is the cycle length of all notes. The 12 tones are: $C$, $C\^#=D\^b&, &D&,&D\^#=E\^b&, $E$,$F$, $F\^#=G\^b$, &G&, &G\^#=A\^b&, $A$,&A\^#=B\^b&, &B&.

A bar is a short segment of time that corresponds to a specific number of beats (notes).

Scale is a subset of notes.最常见的四种音阶:大小调Major (Minor), 和声小调Harmonic Minor, 旋律小调Melodic Minor and 布鲁斯Blues。如C大调音阶(C major)从c开始The subset of notes specified by C Major is thus C, D, E, F, G, A, and B (a subset of seven notes). All scales types have a subset of seven notes except for Blues which has six. In total we have 48 unique scales, i.e. 4 scale types and 12 possible starting notes. We treat Major and Minor as one type as for a Major scale there is always a Minor that has exactly the same set of notes. In music theory, this is referred to as Relative Minor.(关系小调)

Chord 和弦

The Circle of Fifths 五度音环

利用五度圈可以很容易进行和弦倾向的走向判定(strong chord progression),使整个乐章进行和谐。

模型结构

在生成音乐时,需要将scale作为条件,以便模型选择node。在每个timestep,将旋律melody封装为两个随机变量:key layer和press layer 分别表示按下的key值和duration时间。对于chord和鼓,作者假设它们与旋律是独立的,在每一个timestep,将旋律作为条件,生成chord layer和drum layer。

 

在实验时,作者针对Scale条件做了一些预处理。通常一个类型的音阶只会使用到12均分律中的7个音,或blues使用的6个。在数据集midi_man中采样了100个小时的pop song sample后,作者对所有note做了一个normalization,将首个note都平移至C(其余notes也做相应的平移),这样就便于将所有的歌曲归纳到4种类型的音阶中去。

旋律生成采用了两层的RNN(LSTM)模型,模型基于我们选定的音阶条件来生成音符,第一层为key layer,第二层为press layer。

由于有不同的scale,所以针对不同的scale,参数不一样,要重新训练???? notes的输入输出范围被限定在C3 to C6,鼓励但不限定输出note一定要在scale的范围内,这样就会得到3个全音程(每个12个音符)加上静音共37个输出的范围值。press layer的输出使用softmax(?为什么)

LSTM的输入包括:以one-hot形式编码的上一个时间节点的note输出, Lookback features(由Google Magenta提出,可以使模型更容易记住近期的生成并在将来进行repeat,这里面有一些细节的数据结构,如用来记录一个bar和两个bar之前的输出与当前输出的对应关系之类的,需要看代码细致了解才行),melody profile(表现了high-level music flow,To get the profile for each song, we compute the local
note histogram at each time step with width of two bars, and cluster all local histograms within the song into 10 clusters via k-means. We order the 10 clusters with mean note ordered from low to high as cluster 1 to 10, and apply moving averages on the cluster id sequence to encourage local smoothness. This results in a 10-dimensional one-hot vector representation of the cluster id for each time step. This additional information allows the user to set the melody’s ups and downs of the song.本人理解这个profile定义了旋律的走向是升高还是降低)。使用了增序序列1,2,3...来表示按键的持续时间,作者指出这种方式相对于Magenta的单一note on消息要有优势,This is important, as Waite et al. has extremely unbalanced output distributions dominated by the repeat-of-holding event. We represent press $y_prs_^t$ as a 8-dimensional one-hot vector. The input to our LSTM is$y_prs_^{t-1}$ , concatenated with the 37-dimensional one-hot encoding of the melody key $y_key_^t$.

Chord layer

作者发现 99.19% of the chords belong to one of 72 chord classes (6 types X 12 start notes),且chord is strongly correlated with melody.如下为首音符与和弦的对应关系统计图

 

 

 

[1]Jamshed J. Bharucha and Peter M. Todd. Modeling the perception of tonal structure with neural nets. Computer Music Journal, 13(4):44–53, 1989.

[2]Michael C. Mozer. Neural network music composition by prediction: Exploring the benefits of psychoacoustic constraints and multi-scale processing. Connection Science, 6(2-3), 1996.

[3]Chun-Chi J. Chen and Risto Miikkulainen. Creating melodies with evolving recurrent neural networks. In International Joint Conference on Neural Networks, 2001.

[4]Douglas Eck and Juergen Schmidhuber. A first look at music composition using lstm recurrent neural networks. 2002.

[5]Nicolas Boulanger-lewandowski, Yoshua Bengio, and Pascal Vincent. Modeling temporal dependencies in high-dimensional sequences: Application to polyphonic music generation and transcription.In ICML, 2012.

[6]Michael Chan, John Potter, and Emery Schubert. Improving algorithmic music composition with machine learning. In 9th International Conference on Music Perception and Cognition, 2006.

[7]Semin Kang, Soo-Yol Ok, and Young-Min Kang. Automatic Music Generation and Machine Learning Based Evaluation, pp. 436–443. Springer Berlin Heidelberg, 2012.(复调,但是scale type is enforced)

[8]Allen Huang and Raymond Wu. Deep learning for music. arXiv preprint arXiv:1606.04930, 2016 (2-layer LSTM,able to create chord)

posted on 2018-03-17 16:08  体态的滑翔机  阅读(917)  评论(0编辑  收藏  举报