Pop-Diffuseq paper supplement and generated samples

        Welcome to the supplement page!
        POP-Diffuseq, the digital music content infilling in Midi format. The audio samples on this page are exported from Apple Logic build-in sounds.
        Implement code are available on Github:
        https://github.com/musicai-cakecake/pop-diffuseq
        The trained models, datasets and all unfiltered samples can be downloaded in Google Drive:
        https://drive.google.com/drive/folders/1OWI_sfYmYn2gB13KccO9kkXCP2x8Z-1x?usp=drive_link
        Our MIDI dataset PopBand3k are also in this link.

MIDI generation guided by musical terms.

Figure 1. Pop-Diffuseq: Conditional pop music accompaniment generation. POP-Diffuseq can create pop music scores
guided by musical terms(prompts) such as instrument and tempo, and is qualified to any-to-any instruments generation.

Method and pipeline model configurations

        We present POP-Diffuseq, a pop music infilling model based on classifier-free discrete diffusion. In this work, the midi content (such pitch,onset,duration etc.) is encoded in a syntax pattern of Structure-Aligned (SA) tensor, where rows represent note (chord) units and columns express the attributes of the notes (chords). The encoded tensors will be trained and inferred by the deep learning model. Of course, the music score (midi file) is decoded from inferred tensor via the Structure-Aligned (SA) grammar pattern. The pipeline is shown in Figure 2. This approach is called “representation learning” in textual and symbolic music research.

Pipeline.

Figure 2. Method and pipeline of training and inference.


        The greatest value of this infilling method is to support the human-intelligent interaction of Artificial Intelligence Generated Content (AIGC).
        Some typical scenarios are shown in Figure 3.

Interactive generation by Pop-diffuseq.

Figure 3. Interactive generation by Pop-diffuseq.

Piano and chord conditioned melody generation samples (accompaniment to melody)

        Generate melody based on a given piano track.

Data set: Pop909.

Samples Input AI Generation Human Ground truth
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7

Prompts and chord conditioned accompaniment generation samples (prompts and melody to accompaniment)

        Generate band accompaniment from specified instruments and tempo prompts.

Data set: PopBand3k

Samples Input prompts Input melody AI generation
Sample 1   Guitar, bass, ensemble, drum, percussive, Tempo-Andante  
Sample 2   Ethnic, bass, guitar, drum, synth-pad, Tempo-Andante  
Sample 3   Brass, bass, synth-pad, drum, ensemble, Tempo-Moderato  
Sample 4   Strings, bass, guitar, piano, Tempo-Andante  
Sample 5   Guitar, bass, piano, drum, organ, Tempo-Allegro  

Data set: LMD   Masking strategy: high yield

Samples Input prompts Input melody AI generation
Sample 1   Electric guitar, bass, drum, Tempo-Moderato  
Sample 2   Brass, bass, drum, percussive, Tempo-Andante  
Sample 3   Acoustic guitar, bass, drum, piano, Tempo-Allegro  
Sample 4   Ensemble, synth pad, percussive, Tempo-Moderato  
Sample 5   Ensemble, bass, drum, Tempo-Andante  
Sample 6   Acoustic guitar, electric guitar, bass, percussive, Tempo-Andante  
Sample 7   Brass, bass, percussive, Tempo-Allegro  
Sample 8   electric guitar, bass, percussive, Tempo-Allegro  

Contrast

Samples Input melody POPMAG Our Pop-Diffuseq
Sample 1  
Sample 2  
Sample 3  

Interactive generation

Original music Fill-in drum
Original music Original guitar and bass AI modified guitar and bass Modified music

Dataset and preprocessing

    As stated in the paper, We contribute a pop band MIDI dataset MidiPopBand3k and a modified LMD-matched(LMD) dataset, both of which were used for experiments.
    MidiPopBand3k is a well-chosen collection of 1000 pop MIDI music carefully selected by our musicians. All 1K raw MIDI files in the dataset are obtained from free Internet sources. We then split the music into shorter pieces, resulting in 3K MIDI files. Although the length of music in small clips is shortened, it is sufficient for various music AI tasks. The music collection is carefully selected based on a series of criteria:
    (1) Pop band music. The music must include a vocal melody track and at least three different instrumental accompaniments. In other words, music has at least four parts.
    (2) The entire music must be long enough to be divided into three segments, each of at least 500 notes.
    (3) The category of musical instrument are distinct enough to be identified in all tracks.
    (4) Musical styles range from classic pop, rock, jazz and country etc. Try to avoid rap and electronic music as they are not easy to study together. Oriental music (including Chinese pop, J-pop, K-pop and a small amount of Indian pop) accounts for a larger proportion.

    Our musicians manually annotated the instrument category on each track and proofread the tempo of all the music. MidiPopBand3k is suitable for various AI tasks, including music generation, understanding and recommendation. We make MidiPopBand3k freely available to help research in the field of music artificial intelligence. The design of instrument categories and tempo will be introduced in subsequent chapters.

    LMD-matched is a POP subset of Lakh MIDI dataset(colinraffel.com/projects/lmd), the subset have been matched to entries in the Million Song Dataset. The style of LMD-aligned include folk pop, metal pop, Rock and roll, jazz pop, traditional pop etc., most music belong to western pop music.
    The raw MIDI files are very coarse, have much of mistake in music content, such as no melody, superposition of notes, unaligned beat, empty track or merged track, no instrument label etc. A poor data set will seriously affect the training quality of the model. Some of other AI music works not released the processed midi samples or artificially trim the MIDI data, however, manually processing MIDI content on big data is impossible. For general AI tasks, We contribute a series of algorithms for automatically preprocessing and trimming MIDI content based on music theory. These implementation tools are suitable for most types of music and can easily expand big data.
    Following the same screening criteria and processing procedures as MidiPopBand3k, we finally filtered out a modified LMD dataset of 14K+ MIDI files.

Model configurations

        In order to fairly demonstrate the contributions of the representation, attention algorithm and framework proposed in this paper, the bidirectional encoder structures of our Pop-Diffuseq and Diffuseq comparison group both adopt the benchmark configuration ’bert-base-uncased’ of Hugging-face. Moreover, the hyperparameters and training data are also required to be exactly the same. We set the length of a sequence to 512, which contains 125 notes, 7 prompts and 5 special symbols. 4 sequences are spliced into one piece of music. The hidden size of encoder layers are 768, which is assigned to 12 attention heads. Four output linear layers are shared parameters(actually reuse one dense network in program coding). We employ many accelerated approaches for both the training and sampling stages, such as the fast ODE solver(DPM-solver) for diffusion model, FP16 for GPU acceleration and AI platform Pytorch2.0. The diffusion steps in training stage are 2000, 10 steps in decoding stage. Sqrt noise schedule are used in training stage. We train Pop-Diffuseq model and comparison group on 4 NVIDIA 4090 GPUs, with batch size of 40 sequence. It takes 40k steps for training until convergence.

Supplement for music representation

Instrument

    We adopt all 16 instrument categories according to the MIDI protocol, then divide the guitar into acoustic guitar and electric guitar, and finally the vocal melody track, a total of 18 categories.
All existing musical instruments and sounds are classified into these 18 categories:

Instrument Category     Timbre Cases (program number)     Instrument Category     Timbre Cases (program number)      
Melody grand piano(0) Chord progression grand piano(0)
Piano grand piano(0), Honky-tonk piano(3) Chromatic percussion marimba(12), xylophone(13)
Organ accordion(21), church organ(19) Bass acoustic bass(32), electric bass(33)
Acoustic guitar nylon(24), steel(25) Electric guitar jazz(26), overdriven(29)
Strings violin(40), string ensemble(48) Ensemble synth strings(51), string ensemble(48)
Brass Trumpet(56), brass section(61) Reed bassoon(70), clarinet(71)
Pipe flute(73), piccolo(72) Lead square(80)
pad choir(91) Effects echoes (102)
Ethnic sitar(104), shamisen(106) Percussion synth drum(118), taiko(116)
Drum drum kit(0, is_drum=True) Unknow grand piano(0)

    There are significant differences in the note distribution of multiple instruments. We recommend proactively telling the model which instruments need to be created, and during the training phase, a strong relationship between the instrument label and note events must be established.
    In our experiments with the official SymphonyNet implement, asking the model to grasp note distribution without condition in a multi-instrument dataset was awkward(such the characteristic of strings, ensemble and lead was ambiguous), and using the network to reclassify instruments on the generated tracks produced secondary deviation.
    Therefore, according to our extensive experiments, it is necessary to make the instrument labels appear repeatedly in the representation sequence during the model training process, and preferably as input conditions during the inference process.

Tempo

        Simply put, tempo determines how quickly the score is played, it is a significant global mark on score. The detailed explanation can be found at https://en.wikipedia.org/wiki/Tempo
        Tempo affects the distribution of note onset times and durations, thus having a powerful influence on the emotion and mood. Therefore, we recommend using tempo as input condition for generation rather than letting the AI model choose on its own. We divide the Tempo into 5 types according to bpm range:

Tempo Type     BPM Range
largo <66
andante 66-108
moderato 108-126
allegro 126-168
presto >168

Chord

        In Pop-Diffuseq, Chord progression is the timeline of entire musical scores. Chords include 12 root notes and 9 types:
        Major triad, Minor triad, Augmented triad, Diminished triad, Dominant seventh, Major seventh, Minor seventh, Half-diminished seventh.

Representation demo

        A multi-instrument score is shown in Figure 4, and its corresponding representation sequence is below the score.

score demo

Figure 4. Music score.

        Music content sequence Wm = [[chord, time0, chord-0-, reserve], [melody, time0, C5, dur2], [melody, time4, C5, dur2], [melody, time8, G5, dur2], [melody, time12, G5, dur2], [bass, time0, C4, dur8], [strings, time0, E4, dur2], [strings, time4, G4, dur2], [strings, time8, E5, dur2], [strings, time12, C5, dur2],
[chord, time16, chord-2-m, reserve], [melody, time0, A5, dur2], [melody, time4, A5, dur2], [melody, time8, G5, dur4], [melody, time16, F5, dur2], [melody, time20, F5, dur2], [melody, time24, E5, dur2], [melody, time28, E5, dur2], [bass, time0, F4, dur4], [bass, time8, C4, dur4], [bass, time16, F4, dur4], [bass, time24, C4, dur4], [strings, time0, D5, dur2], [strings, time4, D5, dur2], [strings, time8, C5, dur4], [strings, time16, D5, dur3], [strings, time22, D5, dur1], [strings, time24, C5, dur3], [strings, time30, C5, dur1],
[chord, time32, chord-3-m7, reserve], [melody, time0, D5, dur2], [melody, time4, D5, dur2], [bass, time0, G4, dur4], [strings, time0, D5, dur2], [strings, time4, E5, dur2],
[chord, time8, chord-0-, reserve], [melody, time0, C5, dur4], [bass, time0, C4, dur4], [bass, time0, E4, dur4], [bass, time0, G4, dur4], [strings, time0, C5, dur4]]

Supplement for Loss function

input embedding and output module

        The definitions of functions and parameters are the same as those of the classic text diffusion model(Diffuseq, Difformer). The parameters x and y are common variables, they can be different things in different build tasks. In the controllable filling and accompaniment generation task, the x and y are c (condition) and m (music), respectively, in no particular order.

Input content embedding and output module

         Some notable differences between our model and GPT-like textual models are the contextual (music content) embeddings and the output module.
         As shown in Figure 5 (left), Embedding is used to convert integer event tokens into floating-point embedding vectors. The size of 3D embedding tenser for a piece of music is [𝑖, 𝑗, 𝑑], where 𝑖, 𝑗, 𝑑 are the length of notes plus prompts, attributes and embedding vector.
        At the end of framework, we design an output module that combines multiple linear layers and classification masks for structurally aligned tensors. This module is used to map the encoder hidden state to a certain attribute category absolutely correctly. For example, the hidden vector belonging to the pitch-dimension(third column) in the tensor must be mapped to one of 72 categories of pitch, and not be classified into duration or other categories.
        The output module is shown in figure 5 (right). Linear layer (colored trapezoids utilize a fully connected network to convert a hidden vector into an probabilities sequence of the same length as the dictionary(vocab size), which reflects the probability scores of the tokens. We divide hidden state tensor into 4 blocks by column(attributes), and arrange 4 linear layers to predict their respective probability scores. The cross-modal information is concatenated with the first block (instrument attributes) and fed into the first linear layer. Obtained probabilities is then multiplied bitwise by a 0-1 masking sequence 𝑀𝑎𝑠𝑘 𝑗 . Tokens with probability scores set to 0 will not be selected. Finally, the predictions are concatenated based on the grammatical structure of the representation.

input embedding and output module

Figure 5. Input content embedding method and output module.

Objective evaluation supplement

        (A) Model perplexity (PPL). We calculate token-wise PPL on the Validation set (different with train set). The lower the PPL index, the better. In all the Diffuseq-based comparison sets, we employ the same hyperparameter(layers, heads number of attention, embedding shape, etc.) in different model settings. The total stage of PPL curves are present in Figure 6. Compared to the baseline, our Pop-Diffuseq (green curve) converges faster, has lower PPL in the stable phase, and overfits later. The PPL curves of all settings in the stable phase are shown in the Figure 7. Pop-Diffuseq maintains the steady(variance<0.237) lowest value and is 55.89% lower than the average value of the baseline DiffuSeq (purple curve) over 8 valid recording points. On the contrary, DiffuSeq is unstable(variance>13 in converge phase) and fluctuates greatly throughout the training process. The DiffuSeq with la-attention(red curve) are lower than baseline DiffuSeq, indicating that la-attention can improve the fitting of model but is still unstable. The multi-head output module and modified objective function are helpful for the Pop-Diffuseq. In fact, we first reproduced the decoder-based SymphonyNet with linear transformer (same layers, heads and embedding shape, etc.) and MMR method, but their Loss and PPL metrics could not converge to a suitable interval in accompaniment task.

ppl01

Figure 6. PPL curve for total stage. The lower the PPL index, the better.

ppl02

Figure 7. PPL curves of all settings in the stable phase. The lower the PPL index, the better.

        (B) Objective evaluation for symbolic music. Figure 8-17 shows the pitch class transition matrix (PCTM), pitch classes histogram, average duration and onset interval for generated and reference instruments on test data set. Reference instrument means the human created music in test data set. The result shows that our POP-diffuseq correctly produces musical content under the requirements of instrumental prompts, and the generated content has both the characteristics of instruments and the richness of AI.

objective_eval_01

Figure 8. Objective evaluation for generated and reference guitar.
The generated guitar are present by our model.


objective_eval_01

Figure 9. Objective evaluation for generated and reference guitar.
The generated guitar are present by AMT model.


objective_eval_02

Figure 10. Objective evaluation for generated and reference piano.
The generated piano are present by Our model.


objective_eval_02

Figure 11. Objective evaluation for generated and reference piano.
The generated piano are present by AMT model.


        Figures 8 to 11 show two groups of comparative tests. Two models are compared, one is our model (Pop-diffuseq) and the other is the latest SOTA model AMT which is published in the Transactions on Machine Learning Research, 2024. They are both trained and evaluated on the LMD dataset. The generated and reference samples of AMT are obtained by the official Colab Notebook: https://johnthickstun.com/anticipation/
        Figures 8 and 9 are a pair of guitar tests. Figure 10 and 11 are piano tests. Regarding PTCM, Pop-diffuseq is competent for more complex content, and the images of generated PTCM are richer for AMT, indicating the richness of our model. Both models can learn the characteristics of the instrument because the shape of the generated PTCMs are equivalent to the references. For pitch classes histogram, average duration and onset interval, the more similar the generated histogram is to the reference, the better. The Pop-diffuseq model handles the characteristics of musical instruments better.

objective_eval_03

Figure 12. Objective evaluation for generated and reference strings.


objective_eval_04

Figure 13. Objective evaluation for generated and reference organ.


objective_eval_05

Figure 14. Objective evaluation for generated and reference ensemble.


objective_eval_06

Figure 15. Objective evaluation for generated and reference reed.


objective_eval_07

Figure 16. The pitch class transition matrix for 3 sets of bass.


objective_eval_08

Figure 17. The pitch classes histogram, average duration and onset interval for 3 sets of bass.

        Figures 12 to 17 present various samples of musical instruments created by our model. They are randomly selected, not hand-picked, and you can reproduce them using our public code and platform. These images show that our model can well mimic the characteristics of human-composed musical scores, which is of great value for commercial applications.

In the end

        Thank you so much for reading this, and thank you for your time and patience!
        Wishing you all the best in 2025!
        Happy New Year!