The video OCR processor for Richmond Sunlight.
Richmond Sunlight’s standalone pipeline for finding Virginia General Assembly video, downloading it, generating screenshots, extracting transcripts/bill metadata/speaker metadata, and uploading finalized assets to the Internet Archive.
The worker stack mirrors the main richmondsunlight.com repo: PHP 8.x, Composer-managed dependencies (vendor dir lives under includes/), and the shared Log/Database helpers. Core modules:
- Scraper (
bin/scrape.php) — collects House/Senate metadata (floor + committee) from Granicus and YouTube, and persists JSON snapshots understorage/scraper/. - Sync + fetchers (
bin/fetch_videos.php,bin/generate_screenshots.php) — reconcile scraped data against thefilestable, download MP4s to S3, and create screenshot manifests for downstream analysis. - Analysis workers (
bin/generate_transcripts.php,bin/detect_bills.php,bin/detect_speakers.php) — populatevideo_transcriptandvideo_indexby parsing captions, OCRing chyrons, and mapping speakers. Each script understands both an enqueue mode (for the lightweight control plane) and a worker mode (for the GPU instance). Speaker detection uses AWS Transcribe for floor videos (House and Senate floor sessions) but skips diarization for committee videos due to cost constraints. - Raw text resolution (
bin/resolve_raw_text.php) — resolves OCR-extracted text invideo_indexto database references using fuzzy matching for legislators and strict validation for bills. Runs after speaker/bill detection to link raw text entries topeople.idandbills.id. - Archival (
bin/upload_archive.php) — pushes S3-hosted assets and captions to the Internet Archive and updatesfiles.pathwith the IA URL.
Job orchestration is handled via JobDispatcher. In production the dispatcher speaks to the FIFO queue rs-video-harvester.fifo (SQS); in Docker/tests it falls back to an in-memory queue so the full pipeline can run locally without AWS credentials.
When deployed, this will not run any video analysis unless /home/ubuntu/video-processor.txt is present (the contents of the file are immaterial). This is to allow it to run on a dual-server configuration, with the scraper etc. running on Machine, reserving video analysis for a more powerful instance.
Sample MP4 fixtures are pulled from video.richmondsunlight.com/fixtures (see bin/fetch_test_fixtures.php) so integration tests exercise real video. Most CLI tests synthesize temporary MP4s via ffmpeg when needed.
- PHP 8.1+ with
ext-pdo,ext-json,ext-simplexml. - Composer (for installing/updating dependencies under
includes/vendor). - ffmpeg and curl for screenshot/transcription stages.
- Docker (optional) for the provided test environment (
docker-run.sh,docker-tests.sh). - AWS credentials (staging/production) for S3, SQS, and Transcribe (for speaker diarization of floor videos).
Environment constants mirror the main site and are declared in includes/settings.inc.php. At minimum define:
PDO_DSN,PDO_USERNAME,PDO_PASSWORD(point at staging DB for local work).AWS_ACCESS_KEY,AWS_SECRET_KEY,AWS_REGION,SQS_QUEUE_URL.OPENAI_KEY(for transcript generation fallback),IA_ACCESS_KEY,IA_SECRET_KEY, optionalSLACK_WEBHOOK.
For Docker development, override those values via environment variables or mount a tailored includes/settings.inc.php.
-
To stand up the Docker environment with all tooling (PHP, ffmpeg, OpenAI CLI deps, etc.):
./docker-run.sh # builds/starts the container ./docker-tests.sh # runs the test suite inside Docker
Use
./docker-stop.shto shut the container down when finished. Inside Docker the dispatcher automatically uses the in-memory queue, and no writes go to production resources as long as you pointPDO_*/AWS_*at staging.If you have an OpenAPI key that you want to use, specify that in
includes/settings.inc.php(which will be copied over automatically fromdeploy/settings-docker.inc.phpon first run, or you can copy it manually and add the OpenAPI key at the same time.)
-
Native host:
./includes/vendor/bin/phpunit
Certain suites require ffmpeg and the downloaded MP4 fixtures; the tests self-skip if prerequisites are missing.
-
Docker (recommended for parity with CI):
./docker-tests.sh
That script ensures the container is running, verifies dependencies, and launches PHPUnit inside the container. A full run currently executes 40+ unit/integration tests (see tests/Integration/* for queue smoke tests).
bin/scrape.php— gather fresh metadata from the public House/Senate sources (Granicus and YouTube).bin/fetch_videos.php --limit=N— download any missing videos to S3 (video.richmondsunlight.com).bin/generate_screenshots.php --enqueue|--limit=N— enqueue screenshot jobs (control plane) or run worker mode (analysis box).bin/generate_transcripts.php,bin/detect_bills.php,bin/detect_speakers.php— same enqueue/worker pattern for analysis.bin/resolve_raw_text.php --file-id=N|--limit=N|--dry-run— resolve OCR text to database references (legislators, bills).bin/upload_archive.php --limit=N— upload S3-hosted files + captions to the Internet Archive, updating thefilestable.
All scripts bootstrap via bin/bootstrap.php, which wires up the shared Log, PDO connection, and JobDispatcher. Use the --enqueue flag when running on the lightweight instance so work is pushed to SQS; omit it on the worker to poll/process jobs directly.
To speed up video processing on GPU-capable instances, parallel worker scripts are available for each stage. These scripts launch multiple workers that pull jobs from the SQS queue simultaneously, dramatically reducing processing time.
Run the entire pipeline with parallel workers at each stage:
./bin/pipeline_parallel.shThis script:
- Scrapes and imports new video metadata
- Downloads videos to S3 (3 parallel workers)
- Generates screenshots (4 parallel workers)
- Processes transcripts, bills, and speakers simultaneously (9 total workers)
- Uploads to Internet Archive (2 parallel workers)
Configure worker counts via environment variables:
SCREENSHOT_WORKERS=8 TRANSCRIPT_WORKERS=5 JOBS_PER_WORKER=10 ./bin/pipeline_parallel.shFor catching up on specific stages or debugging, use the individual parallel scripts:
# Screenshots (default: 4 workers, 5 jobs each)
./bin/generate_screenshots_parallel.sh [workers] [jobs_per_worker]
# Transcripts (default: 3 workers)
./bin/generate_transcripts_parallel.sh [workers] [jobs_per_worker]
# Bill Detection (default: 3 workers)
./bin/detect_bills_parallel.sh [workers] [jobs_per_worker]
# Speaker Detection (default: 3 workers)
./bin/detect_speakers_parallel.sh [workers] [jobs_per_worker]
# Internet Archive Uploads (default: 2 workers)
./bin/upload_archive_parallel.sh [workers] [jobs_per_worker]Examples:
# Catch up on screenshot backlog with 8 workers processing 10 jobs each
./bin/generate_screenshots_parallel.sh 8 10
# Run transcripts and bill detection simultaneously
./bin/generate_transcripts_parallel.sh 3 5 &
./bin/detect_bills_parallel.sh 3 5 &
waitNote: Parallel processing requires SQS in production. The in-memory queue (used in Docker/local development) does not support safe parallel processing, as it lacks message locking to prevent duplicate work.
- Manually install on Ubuntu EC2 instance in
~/video-processor/ - Run
deploy/deploy.sh - Configure the Archive.org setup on the server with
ia configure - Recommended: Set up a CloudWatch alarm to shut down the instance if it's idle (it could get expensive fast)
Detailed feature documentation is available in docs/features/:
- YouTube Scraping - Setup and usage of the YouTube video harvester for Virginia Senate videos
- Raw Text Resolution - OCR text matching to database references (legislators and bills)
- Database Migrations - SQL schema changes and migration guide
For LLM development assistance, see llm-instructions.md.