Scraping websites with wget

There are many tools out there to download/scrape websites, i.e. curl, httrack, sitesucker, deepvacuum (which is actually a GUI wrapper for wget) and probably more.

I find wget to be one of the most useable tools to get an entire website. Make sure to use the option --convert-links, which converts any links between the subpages into local relative URLs, otherwise the links would still point to the original site, making local browsing impossible. Also use --restrict-file-names=windowsto ensure safe file names for your respective OS. To put it in a nutshell, this are the arguments I use with wget to make a local copy of an entire website:

wget -H -r --level=5 --restrict-file-names=windows --convert-links -e robots=off http://example.org

or

wget --mirror --convert-links --adjust-extension --page-requisites --no-parent http://example.org

--mirror – Makes (among other things) the download recursive. --convert-links – convert all the links (also to stuff like CSS stylesheets) to relative, so it will be suitable for offline viewing. --adjust-extension – Adds suitable extensions to filenames (html or css) depending on their content-type. --page-requisites – Download things like CSS style-sheets and images required to properly display the page offline. --no-parent – When recursing do not ascend to the parent directory. It useful for restricting the download to only a portion of the site.

Alternatively, the command above may be shortened:

wget -mkEpnp http://example.org