yesterday I failed at making some drone music, just didn't turn out any good.
today I failed at training a neural network (trying to eventually do deep dream stuff for audio with a music vs speech discriminator - currently testing with sine wave vs white noise), just wouldn't work. possibly I had some matrix or other transposed or my architecture was fatally flawed or something else. the vanishing gradient problem bit me as part of this, output was effectively random as the earlier layers wouldn't get going.
wondering what I can fail at next week...
Process: for each overlapped windowed segment of audio input, compute energy/octave via Haar wavelet transform. Accumulate count, sum, sum of squared values. At the end of the input, ouput statistics: mean and standard deviation for each octave (normalized to mean 1). Compare output for different input. Think about how to make a classifier based on the output. Think hard about how to make this process differentiable and propagate discrimination results back to changes in the input.
So far I've implemented the timbre stamp algorithm:
c <- haar(control-input)
n <- haar(noise-input)
e <- calculate-energy-per-octave(c)
o <- amplify-octaves-by(n, e)
output <- unhaar(o)
(operating on windowed overlapped chunks)
Attached has a segment of The Archers (BBC Radio 4 serial) as control input, with white noise as noise input. The output is normalized afterwards, otherwise it is very quiet (I suspect because the white noise has little energy in the lower octaves to start with).
Starting from the Energy Per Octave Per Rhythm table, I tried synthesizing speech-like noise by applying the template to white noise. But this didn't work at all well as the white noise had no rhythmic content to speak of, so amplifying it didn't do much (0 * gain = 0).
Feeding back the output to the input, so the noise becomes progressively more rhythmic, worked a lot better - takes a couple of minutes to escape from silence, and then there are about 5 sweet minutes until it goes all choppy with very loud peaks separated by silences. I tested with the feedback delay synchronous to the analysis windows, trying a desynchronized delay next.
Feedback process was too hard to control, so I took a different approach: normalizing the non-DC part of the energy table (by RMS) gives good results in one pass. I suspect the reason it doesn't sound very much like speech is because there is no linkage between the different octaves, they each do their own thing independently.
modulating a drone (my Harmonic Protocol thingy) instead of white noise
Switched from Haar wavelets for energy per octave (11 bins), to Discrete Fourier Transform (via the fftw3 library) for energy spectrum (513 bins). Overlap factor 16, raised cosine window.
Enlarged the self-organizing map from 8x8 to 16x16, using Earth-Mover's Distance instead of Euclidean Distance when chosing the best matching unit to update the SOM.
Initial SOM weights initialized via Cholesky decomposition of covariance matrix to generate correlated Gaussian random variates (as before). Using GNU GSL to do the linear algebra and pseudo random number generation.
Still using 1st-order Markov chain for the resynthesis.
Analysis pass takes 16mins per hour of input audio, single threaded. Thinking about parallelism as that's a long wait when experimenting.
Synthesis pass is very quick, less than a second per minute of output audio.
Hometown is adapted from Mastodon, a decentralized social network with no ads, no corporate surveillance, and ethical design.