Maybe it's not Linear after all, because linear systems can't add frequencies that are not already there, while matrix transformations operating across FFT bins can certainly do that.

Almost time for the start of the

some of the visuals are a bit stroboscopic at the moment

But I think this whole system I've been working on today is , so, not too much scope for interesting behaviour... e.g. with two control inputs, the output is simply a mix of two impulse responses, which isn't so amazing.

I think the power of other algorithms like neural networks comes mainly from the non-linearities.

Just realized the input X and output Y need not be equally dimensioned. So maybe I could make something that's more like a synthesizer than an audio filter...

First successful test with small P, Q. (Follows about 25 unsuccessful tests.)

Audio attached has 3 parts, the first is the target sound (Y), the second is the input sound (X), the third is the VARMA model applied to X (normalized in volume) which should hopefully sound like Y.

P = 0 (one matrix for input mapping, no real moving average here...)
Q = 1 (one matrix for auto-regressive feedback)
D = 129 (number of dimensions for FFT vectors)
T = 18103 (number of input vectors)

Computed of the feedback matrix was 0.9144670749524257, less than 1 so it's stable (good).

The peak level of the output audio was 0.148676, significantly quieter than I expected (most of my previous tests using Python statsmodels VAR had gains around 1.0e+7 or so...).

Now thinking of using software MIDI synthesizer to generate $(x,y)$ pairs: X could be a series of notes played with a piano and Y the same played with a guitar, hard-panned left and right so they are synchronized.

The instrument list is at
midi.org/specifications-old/it

is a pure feedback model, similar to poles in the pole-zero representation of z-transform filters. To add zeros to the model is quite simple:

$$y_t = \sum_{i=1}^p A_i y_{t - i} + \sum_{i=0}^q B_i x_{t-i}$$

and then add the $x$ vectors and $B$ matrices to the least squares stuff. I think the jargon for this is , for Vector Auto-Regressive Moving Average.

I worked it through and coded up most of it before realizing a : I need the $x$ input data corresponding to the $y$ output data to do the regression to estimate the $A$ and $B$ parameter matrices, while so far I've just been using WAV files as $y$ series, which means I don't have any $x$.

The stuff I've been doing can be summarized as
$$y_t = \sum_{i=1}^p A_i y_{t - i}$$
where each $y_t$ is a $D$-vector and each $A_i$ is a $D \times D$ matrix.

Given an input series of $y$ values, the $A_i$ can be calculated by minimization:

$$Y = [ y_p ; y_1; \ldots ; y_{T - 1} ] \\ X = [ y_{p-1}, y_{p-2}, \ldots, y_0 ; y_p, y_{p-1}, \ldots, y_1 ; \ldots ; y_{T-2}, y_{T-3}, \ldots, y_{T-1 - p} ] \\ A = \left(X^T X\right)^{-1} X^T Y$$
where the matrices have these initial dimensions:
Y : (T - p) × D
X : (T - p) × (p × D)
A : (p × D) × D
then reshape $A$ to $p × (D × D)$

In my case all values are complex (and ^T is conjugate transpose) because each $y$ is an FFT of a block of input data - I'm using FFT size $256$ (making $D = 129 = 256/2+1$ unique bins for real input) overlapped 4x.

More research on got me digging into which is the magnitude of the largest-magnitude . This is analogous to the pole radius in regular single variable representation: if it's less than 1 all should be fine, bigger than 1 and it becomes unstable.

So I'm now normalizing all the feedback coefficient matrices by the largest spectral radius among them and a bit more, so that the new largest radius is less than 1, and it seems stable even in the presence of morphing.

The attached is heavily dynamics compressed, as it was a bit peaky otherwise.

Abandoned interpolation/morphing. Doesn't work.

Made the current working part stereo.

Ouch. seems to do hard of audio to +/-1.0 even when using float wav files? I may have to rethink my reliance on it for recording my live sets if this is confirmed by further testing.

Also nearly blew my ears off when my script made hard clipped noise instead of normalizing gain from 10,000,000 to 1. seems to work correctly in my brief tests (just using it for converting stereo to dual mono and back).

With normalization, it doesn't explode, but it gets stuck in bad-sounding tones without much change. Took over 100mins to render less than 100secs of audio, too...

Still not working properly. Brief tests shows the interpolation is probably fine (and the result doesn't blow up for $s = 0$ or $s = 1$), but as soon as $s$ is even slightly in between the filter becomes incredibly unstable and explodes, overflowing to infinity then NaN after about 0.2 seconds of output audio.

The interpolated matrices all have Frobenius norms smaller than the endpoints, but I don't know if that means anything.

It's also very slow, needing to do 5.5k matrix exponentials of 129x129 complex matrices for every second of output audio.

I think I got interpolation between feedback coefficient matrices working (thanks to SciPy, which has polar decomposition, log and exp for complex matrices) but it blew up almost immediately, so I'm re-adding the normalization code in the hopes of getting some output that's more than a brief click followed by NaNs.

Trying to make morphing drones modulated by a third signal or something, not sure what it'll turn into...

audio corresponding to previous screenshot

More experiments with vector autoregression, this time on FFTs.

Had to patch a local copy of Python statsmodels library to get it to handle complex data without going completely wrong (just changed some .T to .conj().T and hoped for the best). Seems to work ok.

This screenshot is of a model fitted to "aum" spoken by flite, timestretched in audacity to 100x its length. The model is being stimulated by impulses at ascending frequency bins, you can see that the impulse responses across the frequency bands are correlated in interesting ways. Indeed when listening to it, it sounds more complex than a simple ringing filter.

But that's not too surprising, given that the model has $129 * (129 * 8 + 1) = 133257$ complex parameters, which is about the same amount of data as 3 seconds of audio. The 8 here is the number of delayed feedback taps, the 1 is for a constant trend term, the 129 is the number of frequency bins output from rfft~ at block size 256.

There's another $129 * 129$ complex matrix of the covariance of the residuals, this can be used to simulate the whole process by generating correlated noise. But I'm not using that here.

two posts because "error cannot attach a video to a post that already contains images"

ported the vector autoregression simulation part from numpy to C so it can work in a streaming fashion. not yet hooked it up to JACK driver but that is the plan. need to add a gain control because the f32 audio files are averaging +30dB.

Python code for vector auto-regression


import pandas as pd
from statsmodels.tsa.api import VAR

input_data.index = pd.to_datetime(input_data.index, unit='ms')
count = 100000
model = VAR(input_data)
for i in range(10):
print(i + 1)
result = model.fit(i + 1)
fc = result.simulate_var(steps=count)
output_data = pd.DataFrame(fc, index=None, columns=input_data.columns)
output_data.to_csv(str(i + 1) + ".csv", header=False, index=False, sep=' ')


based on this thorough article, but I just noticed my adblocker reports 30+ items so it's probably nightmarish without protection...
machinelearningplus.com/time-s

Playing around with different input sources. This one used a 30sec sine log sweep from 20Hz to 20kHz, with a history lag size of 5. Lag 6 is a bit more choppy. Lag 8 sounds much more sweep-like, but it becomes unstable and the volume rises exponentially. Lag 7 and 9 exploded to inf.