yesterday I failed at making some drone music, just didn't turn out any good.
today I failed at training a neural network (trying to eventually do deep dream stuff for audio with a music vs speech discriminator - currently testing with sine wave vs white noise), just wouldn't work. possibly I had some matrix or other transposed or my architecture was fatally flawed or something else. the vanishing gradient problem bit me as part of this, output was effectively random as the earlier layers wouldn't get going.
wondering what I can fail at next week...
Process: for each overlapped windowed segment of audio input, compute energy/octave via Haar wavelet transform. Accumulate count, sum, sum of squared values. At the end of the input, ouput statistics: mean and standard deviation for each octave (normalized to mean 1). Compare output for different input. Think about how to make a classifier based on the output. Think hard about how to make this process differentiable and propagate discrimination results back to changes in the input.
So far I've implemented the timbre stamp algorithm:
c <- haar(control-input)
n <- haar(noise-input)
e <- calculate-energy-per-octave(c)
o <- amplify-octaves-by(n, e)
output <- unhaar(o)
(operating on windowed overlapped chunks)
Attached has a segment of The Archers (BBC Radio 4 serial) as control input, with white noise as noise input. The output is normalized afterwards, otherwise it is very quiet (I suspect because the white noise has little energy in the lower octaves to start with).
Starting from the Energy Per Octave Per Rhythm table, I tried synthesizing speech-like noise by applying the template to white noise. But this didn't work at all well as the white noise had no rhythmic content to speak of, so amplifying it didn't do much (0 * gain = 0).
Feeding back the output to the input, so the noise becomes progressively more rhythmic, worked a lot better - takes a couple of minutes to escape from silence, and then there are about 5 sweet minutes until it goes all choppy with very loud peaks separated by silences. I tested with the feedback delay synchronous to the analysis windows, trying a desynchronized delay next.
Feedback process was too hard to control, so I took a different approach: normalizing the non-DC part of the energy table (by RMS) gives good results in one pass. I suspect the reason it doesn't sound very much like speech is because there is no linkage between the different octaves, they each do their own thing independently.
Getting closer to what I want. This time I used a self-organizing map (8x8 cells x11 octaves) to cluster the snippets of audio, then made a 1st-order Markov chain out of the SOM (so an 8x8x8x8 array of weights, the first 8x8 is the past and the last 8x8 is the future).
The attached audio is made by running the Markov chain at random, applying the SOM weights to white noise at each time step.
Audio block size is 1024, overlap factor is 16, raised cosine window function. Each windowed block is converted to octaves via Haar wavelet transform, then each octave is either analysed for RMS energy (analysis pass) or amplified by a factor (synthesis pass). In the synthesis pass the amplified octaves are transformed back to audio, windowed again and overlapped/added with the other blocks.
The synthesis pass generates 5mins of audio in less than 5 seconds on one of my desktop cores, so looking promising for porting to Emscripten to run in browser of low-power mobile devices or wherever.
Switched from Haar wavelets for energy per octave (11 bins), to Discrete Fourier Transform (via the fftw3 library) for energy spectrum (513 bins). Overlap factor 16, raised cosine window.
Enlarged the self-organizing map from 8x8 to 16x16, using Earth-Mover's Distance instead of Euclidean Distance when chosing the best matching unit to update the SOM.
Initial SOM weights initialized via Cholesky decomposition of covariance matrix to generate correlated Gaussian random variates (as before). Using GNU GSL to do the linear algebra and pseudo random number generation.
Still using 1st-order Markov chain for the resynthesis.
Analysis pass takes 16mins per hour of input audio, single threaded. Thinking about parallelism as that's a long wait when experimenting.
Synthesis pass is very quick, less than a second per minute of output audio.
Hometown is adapted from Mastodon, a decentralized social network with no ads, no corporate surveillance, and ethical design.