Skip to content
larsp.de
Go back

Archiving CMS websites to static files with httrack

Using httrack to archive a CMS website to keep it online as a static site.

When to Archive CMS Sites

When a website built with a content management system like Drupal or WordPress is no longer updated with content or a campaign has ended, the webpages sometimes need to be archived for reference or remain online without further changes. It’s not always feasible to upgrade all CMS components along the way. A major version change might require expensive custom module upgrades for a site no longer in production use. Archiving to static files is a practical solution.

Using the httrack tool to archive a website

There are several options for archiving websites (see Awesome Web Archiving List). The httrack command-line tool is a preferred option. On macOS using Homebrew, install it with:

brew install httrack

These are good options for mirroring:

httrack http://SITE_TO_ARCHIVE -O DESTINATION_DIR \
  -N "%h%p/%n/index%[page].%t" \
  -WQ%v --robots=0 --footer ''

What the flags do:

The tool will prompt you if external links should be followed.

To avoid hammering the source server, throttle the crawl with --max-rate=25000 (bytes per second) or limit concurrent connections with -c2.

What this approach can’t capture

Anything that depends on a live backend will not work in the static mirror: logged-in areas, search forms, comment submission, contact forms, and most JavaScript-driven dynamic content. Plan to either remove or replace those before going live.

Post-Processing Steps

Relative links can be rewritten afterwards (e.g., “about.html” to “about”). This is optional but useful if you want to preserve URL paths for inbound links.

find . -name "*.html" -type f -print0 \
  | xargs -0 perl -i -pe "s/\/index.html/\//g"

Because of the -N template, the homepage ends up as index/index.html rather than at the root. Move it up and strip the now-incorrect ../ prefixes from include paths and links:

cp index/index.html index.html
perl -i -pe 's/\.\.\///g' index.html

If the source site uses HTTP Basic Auth, provide username and password as part of the URL: username:password@your.url

Hosting Archived Sites

The resulting files can be served from inexpensive static web hosting like Netlify, Cloudflare Pages, or GitHub Pages.

Alternative: wget —mirror

For very simple sites, wget --mirror --convert-links --adjust-extension --page-requisites --no-parent https://example.com can be enough and avoids the post-processing dance. httrack tends to handle messy CMS output (Drupal, WordPress) better, but wget is worth trying first if the site is small.

References


Share this post on:

Previous Post
Syncthing
Next Post
Barcode PDFs with Ruby on Rails