A collection of scripts to help with various digital archiving tasks.
web crawling/wget_log_reader.py
: Script for reading and analyzing wget log filesweb crawling/web_archive_validator.py
: Validates web archive filesweb crawling/crt-scraper.py
: Web scraping utilityweb crawling/extract_qa.py
: QA extraction utility for web archives
pypreservica scripts/
: Contains scripts for interacting with Preservica's API:a_get_metadata.py
: Retrieves metadata from Preservicab_delete_metadata.py
: Deletes metadata from Preservica assetsc_add_metadata_from_csv.py
: Adds metadata from CSV filesd_update_xip_from_csv.py
: Updates XIP metadata from CSVdownload_preservica_assets.py
: Downloads assets from Preservica
semaphore-helper.py
: Uses Semaphore's CLSClient to auto-classify documents and sorts by topic score
Contains older or archived versions of scripts
Contains scripts and configurations for browsertrix-crawler
Contains scripts for downloading and processing content from the Internet Archive
Contains scripts for managing and organizing files
Contains tools for working with sitemaps, including:
- Script to produce a plain list of URLs from an XML sitemap (outputs to .txt, .html, or terminal)