Skip to content
/ wwwsave Public

Save a local copy of a single Web page or a complete site!

License

Notifications You must be signed in to change notification settings

m5n/wwwsave

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

64 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

wwwsave

Still a work in progress

Usage

Usage: ./wwwsave [options]

Options:

--agent ua     User agent to load pages as
                   (default: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:49.0) Gecko/20100101 SlimerJS/0.10.0)
-f             Force appending data to existing output directory
                   (will add new files but not update existing files)
-h             Show this message
-o dir         Directory to save pages to
                   (default: "./wwwsave-<site>")
-p pwd         Password for login
-r             Resume interrupted save
-s site        Enable login & personal content discovery
                   (see below for supported sites)
-u name        Username for login
--url url      Single page to save
                   (use -s to save an entire site)
-v             Run verbosely
                   (default: false)
--version      Show version
--view size    Browser viewport resolution in pixels (format: wxh)
                   (default: 1280x1024)

To save a single public page:

$ ./wwwsave --url http://www.example.com
$ ./wwwsave --url http://www.example.com/path/to/page.html

To save all personal content on a site requiring login:

$ ./wwwsave -s site -u myname -p '$3cr3t'

To save a single page on a site requiring login:

$ ./wwwsave -s site -u myname -p '$3cr3t' --url http://myname.example.com

The following sites are supported for use with the -s option:

  • livejournal
  • pinterest

To view the downloaded content:

  • Load <output directory>/index.html in your browser

  • Start a local web server in the <output directory> and load its default URL in your browser, e.g.

    $ cd <output directory>

    $ python -m SimpleHTTPServer 8000

    (Load http://localhost:8000 in your browser.)

Because browsers employ various security measures, accessing content even from your own machine may not be allowed. The second option above will have the best results and does not require any changes to your browser settings.

Adding authentication for other sites

Copy one of the existing config/*.json files and provide values for the site you're interested in. See the Site Config File Explained Wiki page for more information.

Framework choice

Finding the right scraping framework wasn't easy. I initially wanted this to be a Ruby project, so Mechanize seemed like the logical choice. But as many modern web sites dynamically alter the page HTML using JavaScript, Mechanize fell through as it does not execute JavaScript.

Then I realized a browser testing framework may work better. Using an actual browser, Watir captures exactly what the user sees. I even improved performance by using Typhoeus for downloading the in-page resources. In the end, though, Watir cannot be instructed to save an image on a page. (A hybrid approach where Watir saves only the page HTML and Mechanize/Typhoeus saves all assets (JS, CSS, images, etc) also didn't work as HttpOnly cookies are (rightly) not exposed outside of the Watir internals and so cannot be accessed. (Sites like LiveJournal require HttpOnly cookies to access certain assets, e.g. scrapbook images, so this inability was a show-stopper.))

Then I realized my approach was inefficient: the browser already downloaded all assets, so why download them again programmatically? Looking at headless browsers with full JavaScript support, PhantomJS seemed promising, but it does not give access to the response body--not yet perhaps in v2.2. Luckily, SlimerJS has added support for accessing the response body, so the Ruby code was ported to JavaScript.

Unfortunately I ran into an issue with SlimerJS where certain requests were aborted (LiveJournal scrapbook images), resulting in a non-zero Content-Length but 0-length response body. There's no workaround and with PhantomJS being abandoned, am looking at using headless Firefox or Chrome.

Developer setup

  1. Install Firefox: http://getfirefox.com/

  2. Install Slimer.JS: http://slimerjs.org/

  3. Do any additonal Slimer setup, if needed: http://docs.slimerjs.org/current/installation.html#setup

  4. Show the usage:

    $ ./wwwsave -h