benchmarking fractals + OpenCL 

Implemented guessing in the OpenCL path. Probably renders the edge pixels incorrectly if the image size is not a multiple of 2 in both dimensions.

Benchmark:

Mandelbrot power 2
center 0+0i zoom 4
1920x1080 200000 iterations

```
(seconds) guessing
no yes
CPU/native 67.7 17.4
CPU/OpenCL 86.3 22.8
GPU/OpenCL 25.8 10.3
```

I suspect the OpenCL is worse than it could be due to pointless work in the inner loop preparing for 2x2 matrix derivatives, when only 2x1 complex analytic derivatives are used, and this isn't optimized away.

Something like this but with 4 additional current+next variables for the Jacobian that are never actually updated by the formula or needed after the loop:

```
double current = input;
for (i = 0; i < maxiters; ++i)
{
double next = 0;
// formula copy/pasted here
// uses current, updates next
current = next;
}
output = current;
```

Show thread

benchmarking fractals + OpenCL 

- I haven't yet ported glitch analysis to OpenCL, so for each reference it copies all the raw iteration data to the device and back again, instead of keeping it on the device until rendering is complete, and transferring only new reference information; the way it is now will be very slow for large images with lots of references

- OpenCL does not support x87 long double (used for larger exponent range, rather than higher precision), so zooms between ~e300 and ~e4900 need to use floatexp (a double in 0.5..1.0 with separate exponent int), which is very very much slower (the original CPU implementation for these zoom levels will almost always be way faster)

- I got crashes when running OpenCL in a background thread, even with a mutex protecting it, so at the moment it runs on the main WIN32 user interface thread which inhibits interactive responsiveness (in Wine it just doesn't update the GUI until it is done, I don't know how Microsoft Windows behaves)

- the OpenCL needs double precision (fp64) support, which is hard to get in consumer GPUs (whose fp32 performance is much greater). But fp32 exponent range is much smaller, only usable to ~e30 or so, so floatexp techniques would be needed sooner.

Show thread

benchmarking fractals + OpenCL 

Conclusions from some of (using Buffalo power 2 fractal formula, 1920x1080, 100k iterations, unzoomed, both builtin/hardcoded version and reimplemented using the hybrid formula designer).

- CPU (without OpenCL) does well when there is high iteration count and lots of interior, guessing (if a pixel is between two neighbouring interior pixels, assume it is also interior) cuts the time taken to as little as 1/4 (the OpenCL version does not yet support guessing - added to TODO list)

- with guessing disabled, CPU/OpenCL (via POCL) is only better than regular CPU for hybrids (I think this is because POCL's load balancing is poor for irregular workloads like fractal images, so towards the end many cores are idle)

- GPU/OpenCL (via ROCm) does very well (AMD RX 580 GPU seems about 4x faster than AMD Ryzen 2700X CPU)

- overhead of hybrid (vs builtin) is 3x-4x for CPU (interpreted), falling to 1.5x-2.3x for OpenCL (compiled) (hopefully OpenCL hybrid overhead can be reduced further in future)

- overhead of derivatives (vs no derivatives) is 3x for CPU, falling to 2x (builtin) or 1.5x-2.7x (hybrid) for OpenCL

- the first time the formula is compiled for an OpenCL device gives a ~5s additional overhead, but later images are unaffected due to caching

claude boosted

Does anyone here have experience teaching old age pensioners how to make audio recordings (of themselves singing) with their phones?

Trying to do a shield-friendly choir project.

Turns out I just needed to `make SIMD=0` with no code changes necessary. Now build from clean takes ~15mins, which is still long but most rebuilds during development hopefully don't need to recompile all of it.

Show thread

A build from clean of takes ~45mins wall-clock time on my ancient laptop. A large chunk of it is in `-DPASSA` of the `formula.cpp`, which corresponds to "perturbation with SIMD and derivatives". Will rip out all the SIMD things now and see how much it improves.

Show thread

`exrsubsample` implemented in ~30mins, mostly by copy/paste of `exrtactile`:
code.mathr.co.uk/exrtact/commi
(it crashed at my first attempt, had left some factor multipliers in the wrong places, but it's all ok now.)

Show thread

I spilled ~15ml (about half of the 30ml bottle I opened for the first time today) of purple fountain pen ink on my desk. Desk cleaned up ok but my fingers are a bit stained. Luckily the spill was relatively small in area and nothing was damaged.

Pen piston cartridge is successfully filled (after washing the last remains of the previous brown ink out of the nib and cartridge with lots of warm water) and it seems to write smoothly. This is my old simple fountain pen, my newer fancy one is using green ink.

Project idea: a subsampler for EXR image files, so I can render high resolution source keyframes subsample them all for more comfortable interactive video sequencing, then swap back to the high resolution directory for the final render.

There is an exrmaketiled tool, but I don't know if reading tiled mipmapped EXR will be as efficient. Almost certainly much more complicated. To be investigated...

A generic C++ build on Intel Core2Duo runs about half the speed as the POCL CPU OpenCL implementation. Maybe I can drop my attempts at (which benefit strongly from non-portable -march=native, as I haven't figured out runtime CPU detection and compiling multiple versions) and punt that to the runtime compiler(s). A fallback in case of no OpenCL might still be handy though...

Show thread
Show more
post.lurk.org

Welcome to post.lurk.org, an instance for discussions around cultural freedom, experimental, new media art, net and computational culture, and things like that.