Skip to content

Latest commit

 

History

History
155 lines (119 loc) · 5.55 KB

File metadata and controls

155 lines (119 loc) · 5.55 KB

Changelog

All notable changes to the Statement gem will be documented in this file.

[2.3.0] - 2024-11-17

Added

Error Handling & Reliability

  • Comprehensive logging system with configurable log levels (DEBUG, INFO, WARN, ERROR, FATAL)
  • Automatic retry logic with exponential backoff for network errors (max 2 retries)
  • Graceful error handling - scrapers return empty arrays instead of crashing
  • User-agent headers to reduce 403 Forbidden errors
  • Timeout handling (30 second default) for unresponsive sites

Command-Line Interface

  • New statement CLI executable for running scrapers from command line
  • Support for specifying Congress number (defaults to 119th Congress)
  • Filter by type: members, committees, or all
  • Run individual scrapers by name: --scraper shaheen
  • Single URL scraping mode: --url <url>
  • Multiple output formats: JSON, CSV, and pretty-printed text
  • Performance metrics tracking with --performance flag
  • Configurable log levels via --log-level flag
  • List all available scrapers with --list-scrapers

Testing & Performance Framework

  • Automated testing suite (bin/test_scrapers) for all 185 scrapers
  • Performance metrics collection (execution time, success rate, error details)
  • Multiple report formats: text, JSON, CSV, and HTML
  • Identifies slow scrapers (>5 seconds) for optimization
  • Tracks scraper status: success, warning, failed, error
  • Sample results preview in test output

Code Organization

  • New ScraperBase class providing common functionality:
    • open_html(url, retries): Robust HTML fetching with retries
    • safe_scrape(name, &block): Error-wrapped scraper execution
    • parse_date(date_string, format): Safe date parsing with error handling
    • build_result(...): Standardized result hash construction
  • Logger module for consistent logging across all components
  • ScraperRegistry module for future scraper metadata management
  • ScraperTester class for comprehensive scraper testing

Documentation

  • New MAINTAINER_GUIDE.md with comprehensive maintenance instructions
  • Performance report generation for tracking scraper health
  • Troubleshooting guide for common issues
  • Scraper pattern documentation for adding new members

Changed

  • Scraper class now inherits from ScraperBase
  • All scraper methods now use improved error handling
  • open_html method now includes:
    • User-agent headers to reduce blocking
    • Automatic retry logic
    • Comprehensive error logging
    • Timeout protection
  • Version bumped to 2.3 (from 2.2 implicitly)

Improved

  • Stability: Gem no longer crashes on individual scraper failures
  • Observability: All errors are logged with full context
  • Performance: Automatic identification of slow scrapers
  • Maintainability: Modular architecture for easier updates
  • Debugging: Debug log level available for troubleshooting

Fixed

  • Scrapers no longer crash entire process on errors
  • Better handling of 403 Forbidden responses
  • More robust date parsing with error recovery
  • Network timeout handling

Technical Details

New Files

  • lib/statement/logger.rb: Logging system (42 lines)
  • lib/statement/scraper_base.rb: Base class with error handling (107 lines)
  • lib/statement/scraper_registry.rb: Scraper registry (77 lines)
  • lib/statement/scraper_tester.rb: Testing framework (432 lines)
  • bin/statement: CLI executable (321 lines)
  • bin/test_scrapers: Test runner (33 lines)
  • MAINTAINER_GUIDE.md: Comprehensive maintenance documentation
  • CHANGELOG.md: This file

Modified Files

  • lib/statement.rb: Added new module requires
  • lib/statement/scraper.rb: Inherits from ScraperBase, improved open_html
  • statement.gemspec: Updated to include new executables

Known Issues

  • Many congressional websites now block automated scrapers (403 Forbidden errors)
    • This is expected behavior and handled gracefully
    • Consider using RSS feeds where available
  • 2 scrapers not yet implemented: emmer, porter
    • Listed in member_methods but methods not defined
    • Will be added in future update

Migration Guide

For users upgrading from version 2.2 or earlier:

  1. No breaking changes - All existing code continues to work
  2. New features are opt-in - Logging is automatic but configurable
  3. CLI is optional - Can still use gem programmatically as before

Optional: Configure logging in your application:

# Set custom log level
Statement::Logger.setup(log_level: Logger::INFO)

# Or disable logging output
Statement::Logger.setup(output: File.open('/dev/null', 'w'))

Performance Impact

  • Average scraper execution time: ~3.4 seconds (includes retries)
  • Retry logic adds ~2-4 seconds for failed requests
  • Logging overhead: negligible (<10ms per scraper)
  • Memory usage: unchanged

Testing

All changes tested with:

  • Ruby 3.3.6
  • 185 scraper methods (129 members, 56 committees)
  • Multiple site structures and response codes
  • Various error conditions (timeout, 403, 404, network errors)

Credits

Updates and improvements by Derek Willis with assistance from Claude (Anthropic).

Recommendations for Future Versions

  1. Move to RSS feeds primarily: More reliable than HTML scraping
  2. Add member metadata: Include name, party, state, district
  3. Content extraction: Pull full text and summaries
  4. Caching layer: Reduce repeated requests
  5. Rate limiting: Respect congressional website servers
  6. OAuth integration: For protected APIs
  7. Database backend: Store historical data
  8. API mode: Provide JSON API for scraped data

[2.2.0] and Earlier

See git history for changes in previous versions.