Name		Name	Last commit message	Last commit date
Latest commit History 128 Commits
archived scripts		archived scripts
browsertrix-crawler files and scripts		browsertrix-crawler files and scripts
csv-reviewer		csv-reviewer
downloading items from internet archive		downloading items from internet archive
file-management		file-management
pypreservica scripts		pypreservica scripts
sitemap tools		sitemap tools
web crawling		web crawling
.gitignore		.gitignore
README.md		README.md
semaphore-helper.py		semaphore-helper.py

Repository files navigation

digital-archiving-scripts

A collection of scripts to help with various digital archiving tasks.

Main Scripts

Web Crawling and Validation

web crawling/wget_log_reader.py: Script for reading and analyzing wget log files
web crawling/web_archive_validator.py: Validates web archive files
web crawling/crt-scraper.py: Web scraping utility
web crawling/extract_qa.py: QA extraction utility for web archives

Preservica Integration

pypreservica scripts/: Contains scripts for interacting with Preservica's API:
- a_get_metadata.py: Retrieves metadata from Preservica
- b_delete_metadata.py: Deletes metadata from Preservica assets
- c_add_metadata_from_csv.py: Adds metadata from CSV files
- d_update_xip_from_csv.py: Updates XIP metadata from CSV
- download_preservica_assets.py: Downloads assets from Preservica

Other Utilities

semaphore-helper.py: Uses Semaphore's CLSClient to auto-classify documents and sorts by topic score

Directory Structure

archived scripts/

Contains older or archived versions of scripts

browsertrix-crawler files and scripts/

Contains scripts and configurations for browsertrix-crawler

downloading items from internet archive/

Contains scripts for downloading and processing content from the Internet Archive

file-management/

Contains scripts for managing and organizing files

sitemap tools/

Contains tools for working with sitemaps, including:

Script to produce a plain list of URLs from an XML sitemap (outputs to .txt, .html, or terminal)

About

No description, website, or topics provided.

Report repository

Releases

No releases published

Packages

No packages published

Languages