Follow


I need some help on how to scrape/download a more complex html Manual --> the Manual of ; a great audio editing Floss project.
manual.ardour.org/toc/
Someone I work with has an offline computer for editing.

-- wkhtmltopdf for example does not want to work.
-- Neither does html2ps

Thanks for boosting - even better, replying !

@wendy Ask @frankiezafe I can't remember what he is using exactly, but he does that a lot.

@xuv @wendy the best of the best is httrack (http://www.httrack.com/) - if you are at hacktiris, just pass by :)

@frankiezafe @xuv
Meh, j'ai que la premiere page - ca sera pour un peu plus tard - merci! drole de logiciel tres windows annee 90 look ;-)

@wendy @xuv il faut que tu règles la "profondeur" de la recherche (combien de niveau d'arborescence tu l'autorise à visiter + si tu acceptes qu'il prenne les liens externes)

@xuv @wendy il faut le bidouiller un peu (voir "option" dans le 2eme panneau) - je te fais un zip une fois fini :)

@xuv @wendy il en est à >235 fichiers le temps que je post le commentaire!

@frankiezafe @xuv Je suis un peu cuit dans la tete - tout prreeenndd dduuu temmpppsss

@frankiezafe @xuv qu'est+ce qu'il faut cocher ? J'ai tout lu, mais pas 'vu'..

@wendy @xuv ouf, je viens de vérifier, httrack a récupérer >2.9Gb de données, les exécutables et tout, je reconfigure et je relance :)

@wendy @xuv ça clone à nouveau - il faut lancer ./webhttrack, créer un projet, ajouter l'url (screenshot 1) et configurer le téléchargement (screenshot 2)

@frankiezafe @xuv ahaaa très cryptique cette description. Mais je comprends maintenant. I was afk..

@wendy can’t tell if it’ll be any good for you, but I use

wget --recursive --page-requisites --html-extension --convert-links --restrict-file-names=windows --no-parent --tries=5 --waitretry=1 --read-timeout=5 --timeout=10 --domains $DOMAIN $URL

@piggo woo looks funky and nice - will try it now!

@wendy Did you try using this page: manual.ardour.org/ardourmanual seems like a single html file, should work really nicely.

@rra hey, yes, it works, scraping/saving all - but I have a 654 megabyte pdf with wkhtmltopdf :unacceptable: . I can probably make it lighter - but I do not have the TOC - meh!!

and thanks, nice find..
In the meantime @frankiezafe gave a hand with httrack. 25 megabytes (more than 600 pages, this manual..)

@wendy If you need to render to PDF I have had good luck lately with weasyprint¹ but good (bad?) old chromium will also let you print PDF directly to a file (and this works from the CLI in headless mode).

1. https://weasyprint.org/

Sign in to participate in the conversation
post.lurk.org

Welcome to post.lurk.org, an instance for discussions around cultural freedom, experimental, new media art, net and computational culture, and things like that.