yesterday I failed at making some drone music, just didn't turn out any good.

today I failed at training a neural network (trying to eventually do deep dream stuff for audio with a music vs speech discriminator - currently testing with sine wave vs white noise), just wouldn't work. possibly I had some matrix or other transposed or my architecture was fatally flawed or something else. the vanishing gradient problem bit me as part of this, output was effectively random as the earlier layers wouldn't get going.

wondering what I can fail at next week...

Follow

Energy per Octave per Rhythm via repeated Haar wavelet transform. Raised cosine window for energy per octave, then rectangular window for (energy per octave) per octave (rhythmically). The rhythm 0 is at DC, ie average over all time.

Getting closer to what I want. This time I used a self-organizing map (8x8 cells x11 octaves) to cluster the snippets of audio, then made a 1st-order Markov chain out of the SOM (so an 8x8x8x8 array of weights, the first 8x8 is the past and the last 8x8 is the future).

The attached audio is made by running the Markov chain at random, applying the SOM weights to white noise at each time step.

Audio block size is 1024, overlap factor is 16, raised cosine window function. Each windowed block is converted to octaves via Haar wavelet transform, then each octave is either analysed for RMS energy (analysis pass) or amplified by a factor (synthesis pass). In the synthesis pass the amplified octaves are transformed back to audio, windowed again and overlapped/added with the other blocks.

The synthesis pass generates 5mins of audio in less than 5 seconds on one of my desktop cores, so looking promising for porting to Emscripten to run in browser of low-power mobile devices or wherever.

Switched from Haar wavelets for energy per octave (11 bins), to Discrete Fourier Transform (via the fftw3 library) for energy spectrum (513 bins). Overlap factor 16, raised cosine window.

Enlarged the self-organizing map from 8x8 to 16x16, using Earth-Mover's Distance instead of Euclidean Distance when chosing the best matching unit to update the SOM.

Initial SOM weights initialized via Cholesky decomposition of covariance matrix to generate correlated Gaussian random variates (as before). Using GNU GSL to do the linear algebra and pseudo random number generation.

Still using 1st-order Markov chain for the resynthesis.

Analysis pass takes 16mins per hour of input audio, single threaded. Thinking about parallelism as that's a long wait when experimenting.

Synthesis pass is very quick, less than a second per minute of output audio.

Refs:

http://www.fftw.org/fftw3_doc/The-Halfcomplex_002dformat-DFT.html

https://en.wikipedia.org/wiki/Earth_mover's_distance#Computing_the_EMD

https://en.wikipedia.org/wiki/Self-organizing_map#Algorithm

https://en.wikipedia.org/wiki/Cholesky_decomposition#Monte_Carlo_simulation

claude@mathr@post.lurk.orgStarting from the Energy Per Octave Per Rhythm table, I tried synthesizing speech-like noise by applying the template to white noise. But this didn't work at all well as the white noise had no rhythmic content to speak of, so amplifying it didn't do much (0 * gain = 0).

Feeding back the output to the input, so the noise becomes progressively more rhythmic, worked a lot better - takes a couple of minutes to escape from silence, and then there are about 5 sweet minutes until it goes all choppy with very loud peaks separated by silences. I tested with the feedback delay synchronous to the analysis windows, trying a desynchronized delay next.