I had trouble with one PDF. It had double-page spreads as JPEG, but `pdfimages` extracted each twice which doubled the file size (I guess they were referenced with cropping or something in the PDF). Moreover they were rotated, and colour files despite the content being monochrome. These incantations solved it for me, albeit doing twice as much work as necessary:
pdfimages -all input.pdf tmp
exiftran -i -2 *.jpg
for i in tmp-???.jpg ; do jpegtran -grayscale -copy none -crop 877x1240+0+0 $i > $i-l.jpg ; done
for i in tmp-???.jpg ; do jpegtran -grayscale -copy none -crop 877x1240+876+0 $i > $i-r.jpg ; done
ls tmp-*.jpg-?.jpg > tmp.list
tesseract tmp.list output pdf
`jpegtran` is from `libjpeg-progs` on Debian-based Linux distributions. Not sure where `exiftran` is from, I already had it installed. It's better to use `jpegtran` than ImageMagick `convert` or other tools because it doesn't recompress (which is a lossy operation).
Welcome to post.lurk.org, an instance for discussions around cultural freedom, experimental, new media art, net and computational culture, and things like that.