Skip to content

grey-land/warc-browser

Repository files navigation

warc-browser

a cli toolkit for working with web archives.

warc-browser uses DevTools protocol to automate compatible web browsers, captures all content for given wep page (html, css, js, images, videos, pdfs, ...) and stores the results in .warc file. It came out of need for quickly archiving web pages in a scriptable manner.

Installation

make build
./warc-browser --help

Usage

Archive a url running browser in headless mode.

warc-browser --output-dir /tmp/archives browser --headless archive --url http://example.com

Attach to a running browser, list available tabs, then capture specific tab.

# Start chromium browser with remote debugging enabled
chromium --remote-debugging-port=9222 --url https://duckduckgo.com/?q=web+archive
# List tabs of chromium
warc-browser browser -a
# Archive first tab
warc-browser browser -a archive -t 0

Start a web server serving simple ui, to visualize collected archives.

warc-browser ui

Open your browser at localhost:8080.


software used

  1. github.com/go-rod/rod web automation framework for browser automation
  2. github.com/nlnwa/gowarc for composing warc records
  3. github.com/webrecorder/replayweb.page for visualizing records in web ui.
coverage: 61.2% of statements