Skip to content
larsp.de
Go back

Archiving CMS websites to static files with httrack

Using httrack to archive a CMS website to keep only as static site.

When to Archive CMS Sites

When a website built with a content management system like Drupal or WordPress is no longer updated with content or a campaign has ended, the webpages sometimes need to be archived for reference or remain online without further changes. It’s not always feasible to upgrade all CMS components along the way. A major version change might require expensive custom module upgrades for a site no longer in production use. Archiving to static files is a practical solution.

Using the httrack tool to archive a website

There are several options for archiving websites (see Awesome Web Archiving List). The httrack command line tool is a preferred option. On macOS using Homebrew, install it with:

brew install httrack

These optimal httrack options for mirroring:

httrack http://SITE_TO_ARCHIVE -O DESTINATION_DIR \
  -N "%h%p/%n/index%[page].%t" \
  -WqQ%v --robots=0 --footer ''

The tool will prompt you if external links should be followed.

Post-Processing Steps

Relative links can be rewritten afterwards (e.g., “about.html” to “about”). This is optional but useful if you want to preserve URL paths for inbound links.

find . -name "*.html" -type f -print0 \
  | xargs -0 perl -i -pe "s/\/index.html/\//g"

Copy the homepage index from index/index.html to the site root and change include paths and links in it (remove ”../” everywhere).

If the source site uses HTTP authentication, provide username and password as part of the URL: username:password@your.url

Hosting Archived Sites

The resulting files can be served from inexpensive static web hosting like Netlify or GitHub Pages.

References


Share this post on:

Previous Post
Syncthing
Next Post
Barcode PDFs with Ruby on Rails