Statement Gem - Maintainer's Guide

Overview

This guide provides information for maintaining and extending the Statement gem for parsing congressional press releases and statements.

Version 2.3 Updates

Major Improvements

Error Handling & Logging
- Added comprehensive logging system that doesn't crash on errors
- All errors are logged with timestamps and severity levels
- Scrapers return empty arrays instead of crashing on failures
- Automatic retry logic with exponential backoff for network errors
Command-Line Interface
- New statement CLI for running scrapers
- Support for filtering by congress number and type (members/committees)
- Multiple output formats (JSON, CSV, pretty print)
- Performance metrics tracking
Testing & Performance Framework
- Automated testing suite for all scrapers
- Performance metrics collection
- Multiple report formats (text, JSON, CSV, HTML)
- Identifies slow and broken scrapers
Code Organization
- Modular architecture with ScraperBase class
- Improved maintainability
- Better separation of concerns

Architecture

Key Components

lib/statement/
├── logger.rb                 # Logging system
├── scraper_base.rb          # Base class with error handling
├── scraper_registry.rb      # Scraper registry (future use)
├── scraper_tester.rb        # Testing and performance framework
├── scraper.rb               # All scraper implementations (3217 lines)
├── feed.rb                  # RSS feed parsing
├── utils.rb                 # Utility functions
├── facebook.rb              # Facebook integration
├── tweets.rb                # Twitter integration
└── version.rb               # Version number

bin/
├── statement                # CLI for running scrapers
└── test_scrapers           # Testing and performance tool

Scraper Base Class

All scrapers now inherit from ScraperBase which provides:

open_html(url, retries = 0): Fetches and parses HTML with automatic retries
safe_scrape(name, &block): Wraps scraper execution with error handling
parse_date(date_string, format = nil): Safe date parsing
build_result(...): Helper for building result hashes

Logging

The logging system provides different severity levels:

Statement::Logger.debug("Detailed debugging info")
Statement::Logger.info("General information")
Statement::Logger.warn("Warning messages")
Statement::Logger.error("Error messages")
Statement::Logger.fatal("Fatal errors")

Configure log level:

Statement::Logger.setup(log_level: Logger::INFO, output: $stdout)

Using the CLI

Basic Usage

# Run all member scrapers (default: 119th Congress)
statement --type members

# Run all committee scrapers
statement --type committees

# Run everything
statement --type all

# Run a specific scraper
statement --scraper shaheen

# Scrape a single URL
statement --url "https://example.com/rss"

# Save to file with performance metrics
statement --type members --output results.json --performance

# Change output format
statement --type members --format pretty
statement --type members --format csv --output results.csv

# List all available scrapers
statement --list-scrapers

# Adjust logging level
statement --type members --log-level debug

Testing Scrapers

Running Tests

# Test all scrapers and generate report
ruby -I lib bin/test_scrapers

# Test only members or committees
ruby -I lib bin/test_scrapers --members
ruby -I lib bin/test_scrapers --committees

# Generate different report formats
ruby -I lib bin/test_scrapers --json --output=report.json
ruby -I lib bin/test_scrapers --csv --output=report.csv
ruby -I lib bin/test_scrapers --html --output=report.html

Interpreting Performance Reports

The performance report shows:

Success Rate: Percentage of working scrapers
Response Time: Average, min, max scraper execution times
Error Details: Specific errors for each failed scraper
Recommendations: Scrapers needing attention

Example output:

SUMMARY
Total scrapers tested:         185
Successful:                    145 (78.4%)
Warnings (empty results):      15 (8.1%)
Failed:                        23 (12.4%)
Errors:                        2 (1.1%)

Common Issues

403 Forbidden Errors

Many congressional websites now block automated scrapers with 403 errors. This is expected behavior.

Solutions:

Sites return proper error pages that include reasons
The gem now logs these and continues instead of crashing
Consider using RSS feeds where available (see Statement::Feed.from_rss)
Some sites may work with different user agents (already configured in ScraperBase)

Missing Scrapers

Some members listed in member_methods may not have corresponding scraper implementations (e.g., emmer, porter).

To add a missing scraper:

Find the member's press release page
Determine the page structure (inspect HTML)
Add scraper method to scraper.rb:

def self.member_name(page=1)
  results = []
  url = "https://membername.house.gov/press-releases?page=#{page}"
  doc = open_html(url)
  return [] if doc.nil?

  doc.css('.press-release-item').each do |row|
    results << {
      source: url,
      url: row.at_css('a')['href'],
      title: row.at_css('.title').text.strip,
      date: parse_date(row.at_css('.date').text),
      domain: 'membername.house.gov'
    }
  end

  results
end

Add to member_methods array (line ~40)
Add to member_scrapers method call list (line ~65)

Slow Scrapers

Scrapers taking >5 seconds may need optimization:

Check if page has pagination (reduce items per page)
Verify URL is correct and loads quickly in browser
Check if site is blocking/throttling requests
Consider caching for frequently accessed pages

Date Parsing Errors

If dates aren't parsing correctly:

# Use specific date format
date = parse_date(date_string, "%m/%d/%Y")

# Or try American date format (already included via american_date gem)
date = Date.parse("12/25/2024") # Works as MM/DD/YYYY

Adding New Scrapers

Step-by-Step Process

Identify the member/committee
- Get their official website
- Find their press releases page
- Note if they have an RSS feed

If RSS feed available:

# No need for a custom scraper, use:
Statement::Feed.from_rss('https://member.gov/rss.xml')

If custom scraper needed:

a. Inspect the HTML structure b. Identify patterns (CSS selectors for title, URL, date) c. Write the scraper method d. Test it e. Add to member_methods/committee_methods array

Test the scraper:

ruby -I lib -e "require 'statement'; puts Statement::Scraper.new_method.inspect"

Scraper Patterns

The gem uses several common patterns for similar website structures:

document_query_new: House members using DocumentQuery.aspx
media_body: Sites with .media-body CSS class
article_block: Sites with .ArticleBlock structure
senate_drupal: Senate sites using Drupal CMS
react: Sites built with React (slower, need JS rendering)

When adding a member, check if they use one of these patterns first.

Performance Optimization

Best Practices

Use pagination wisely: Don't scrape more pages than needed
Implement caching: For frequently accessed data
Batch operations: Use Statement::Feed.batch(urls) for multiple RSS feeds
Error handling: Always use the base class methods
Logging: Use appropriate log levels (INFO for normal, DEBUG for development)

Monitoring

Run regular performance tests:

# Weekly performance check
ruby -I lib bin/test_scrapers --output=reports/weekly-$(date +%Y%m%d).txt

# Compare performance over time
diff reports/weekly-20241110.txt reports/weekly-20241117.txt

Output Format

Standard Result Format

Each scraper returns an array of hashes:

[
  {
    source: "https://member.gov/press",     # Source URL
    url: "https://member.gov/press/123",    # Press release URL
    title: "Member Announces New Bill",     # Title
    date: #<Date: 2024-11-17>,             # Date object
    domain: "member.gov"                    # Domain name
  },
  # ... more results
]

Suggested Improvements

Consider adding:

Member metadata:

{
  # ... existing fields ...
  member_name: "John Doe",
  state: "CA",
  district: "12",
  party: "D",
  chamber: "House"
}

Content extraction:

{
  # ... existing fields ...
  summary: "First paragraph or excerpt",
  full_text: "Complete press release text",
  topics: ["healthcare", "education"]
}

Social media links:

{
  # ... existing fields ...
  twitter_url: "https://twitter.com/member/status/123",
  facebook_url: "https://facebook.com/member/posts/123"
}

Troubleshooting

Common Problems

Problem: "LoadError: cannot load such file"

# Solution: Run with proper load path
ruby -I lib script.rb
# Or install gem first
bundle install

Problem: "All scrapers returning nil"

# Check if sites are accessible
curl -I https://member.gov/press

# Try with user agent
curl -H "User-Agent: Mozilla/5.0" https://member.gov/press

Problem: "Date parsing errors"

# Use american_date gem (already included)
require 'american_date'
Date.parse("11/17/2024") # Works as MM/DD/YYYY

Maintenance Schedule

Weekly

Run performance tests
Check for broken scrapers
Update any scrapers with website changes

Monthly

Review error logs
Update member list against official House/Senate rosters
Check for new members needing scrapers

Quarterly

Full scraper audit
Performance optimization review
Update dependencies

Annually

Major version updates
Architecture review
Documentation updates

Current Status (as of 2024-11-17)

Based on the latest performance report:

Total scrapers: 185
Working: 0 (most sites blocking automated access)
Need attention: 185 (mostly due to 403 errors)
Missing: 2 (emmer, porter - not implemented)

Recommendations

Prioritize RSS feeds: More reliable than HTML scraping
Focus on high-value members: Leadership, committee chairs
Partner with congressional offices: Get official data access
Consider alternate approaches: APIs, official data feeds

Contributing

When adding or updating scrapers:

Test thoroughly with the test suite
Update member_methods/committee_methods arrays
Document any special considerations
Run performance tests to ensure no degradation
Update this guide if adding new patterns

Resources

Support

For issues or questions:

Check existing GitHub issues
Review this guide and documentation
Run performance tests to identify specific problems
Create new GitHub issue with details

Last Updated: 2024-11-17 Version: 2.3 Maintainer: Derek Willis

FilesExpand file tree

MAINTAINER_GUIDE.md

Latest commit

History