Meanwhile my new code (which generalizes 10s of formulas which were previously hardcoded) started at 87% slower than the hardcoded versions, and is now only 30% slower for the common cases of simple formulas on `-march=native`, and only 10-20% slower on generic `x86_64` (no architecture options). (Generic is slower than native, so that's not a win.)
Trying to figure out exactly what is slowing it down is not so easy without profiling tools. (Valgrind fails with SIGILL, so that's no use either.)
Oh dear. My benchmarks were flawed (too small iteration count compared to overheads of other parts of the code). It's actually around 5x slower, which is pretty terrible...
Welcome to post.lurk.org, an instance for discussions around cultural freedom, experimental, new media art, net and computational culture, and things like that.