a cli toolkit for working with web archives.
warc-browser uses DevTools protocol to automate compatible web browsers, captures all content for given wep page (html, css, js, images, videos, pdfs, ...) and stores the results in .warc file. It came out of need for quickly archiving web pages in a scriptable manner.
make build
./warc-browser --help
Archive a url running browser in headless mode.
warc-browser --output-dir /tmp/archives browser --headless archive --url http://example.com
Attach to a running browser, list available tabs, then capture specific tab.
# Start chromium browser with remote debugging enabled
chromium --remote-debugging-port=9222 --url https://duckduckgo.com/?q=web+archive
# List tabs of chromium
warc-browser browser -a
# Archive first tab
warc-browser browser -a archive -t 0
Start a web server serving simple ui, to visualize collected archives.
warc-browser ui
Open your browser at localhost:8080.
software used
- github.com/go-rod/rod web automation framework for browser automation
- github.com/nlnwa/gowarc for composing warc records
- github.com/webrecorder/replayweb.page for visualizing records in web ui.
coverage: 61.2% of statements