I need some help on how to scrape/download a more complex html Manual --> the Manual of ; a great audio editing Floss project.
Someone I work with has an offline computer for editing.

-- wkhtmltopdf for example does not want to work.
-- Neither does html2ps

Thanks for boosting - even better, replying !

@wendy Ask @frankiezafe I can't remember what he is using exactly, but he does that a lot.

@xuv @wendy the best of the best is httrack ( - if you are at hacktiris, just pass by :)

@frankiezafe @xuv
Meh, j'ai que la premiere page - ca sera pour un peu plus tard - merci! drole de logiciel tres windows annee 90 look ;-)

@wendy @xuv il faut que tu règles la "profondeur" de la recherche (combien de niveau d'arborescence tu l'autorise à visiter + si tu acceptes qu'il prenne les liens externes)

@xuv @wendy il faut le bidouiller un peu (voir "option" dans le 2eme panneau) - je te fais un zip une fois fini :)

@xuv @wendy il en est à >235 fichiers le temps que je post le commentaire!

@frankiezafe @xuv Je suis un peu cuit dans la tete - tout prreeenndd dduuu temmpppsss

@frankiezafe @xuv qu'est+ce qu'il faut cocher ? J'ai tout lu, mais pas 'vu'..

@wendy @xuv ouf, je viens de vérifier, httrack a récupérer >2.9Gb de données, les exécutables et tout, je reconfigure et je relance :)

@wendy @xuv ça clone à nouveau - il faut lancer ./webhttrack, créer un projet, ajouter l'url (screenshot 1) et configurer le téléchargement (screenshot 2)

@frankiezafe @xuv ahaaa très cryptique cette description. Mais je comprends maintenant. I was afk..

@wendy can’t tell if it’ll be any good for you, but I use

wget --recursive --page-requisites --html-extension --convert-links --restrict-file-names=windows --no-parent --tries=5 --waitretry=1 --read-timeout=5 --timeout=10 --domains $DOMAIN $URL

@piggo woo looks funky and nice - will try it now!

@wendy Did you try using this page: seems like a single html file, should work really nicely.

@rra hey, yes, it works, scraping/saving all - but I have a 654 megabyte pdf with wkhtmltopdf :unacceptable: . I can probably make it lighter - but I do not have the TOC - meh!!

and thanks, nice find..
In the meantime @frankiezafe gave a hand with httrack. 25 megabytes (more than 600 pages, this manual..)

@wendy If you need to render to PDF I have had good luck lately with weasyprint¹ but good (bad?) old chromium will also let you print PDF directly to a file (and this works from the CLI in headless mode).


Sign in to participate in the conversation

Welcome to, an instance for discussions around cultural freedom, experimental, new media art, net and computational culture, and things like that.