Skip to content

Scrape English RCV lists #1125

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

Conversation

tillprochaska
Copy link
Collaborator

@tillprochaska tillprochaska commented Mar 16, 2025

  • Set up worker
  • Remove suffix for procedure stage (e.g. "***I)
  • Ensure that the language of the returned document is English, raise DataNotAvailableException otherwise

@tillprochaska tillprochaska force-pushed the rcv-list-english-titles branch 3 times, most recently from aa1de2f to def7cb9 Compare March 23, 2025 08:58
Sometime around 2024, the Parliament has stopped including multilingual (French/English/German) titles in the RCV lists. Instead, the French version now includes only the French title. That means that in most cases the displayed title on HowTheyVote.eu is now coming from the OEIL procedure page and not from the RCV list. That works pretty well in general as the procedure info is usually available before the vote takes place, and for most votes, there’s a corresponding OEIL procedure page.

However, in some cases it doesn’t work. This commit fixes that by scraping the English version of the RCV list. We already had a scraper for these some time ago, but weren’t actually using it anymore and the old scraper also doesn’t work with the current structure of the RCV lists, so this is basically a new scraper, just reusing the old name.
@tillprochaska tillprochaska force-pushed the rcv-list-english-titles branch from def7cb9 to 7aef87e Compare March 23, 2025 09:30
@tillprochaska
Copy link
Collaborator Author

tillprochaska commented Apr 4, 2025

Closing this for now. Even when requesting the English version of the RCV lists, the French version is returned if the English translations aren’t yet available. While the root XML tag has a lang attribute, it doesn’t relate to the actual language of the document contents, and there doesn’t seem to be another way to check whether the contents are in English or French (besides running language detection).

We’ll soon start scraping the VOT table, which also include the English titles (besides other data). The VOT tables seem to be published with a bit more delay than the English RCV lists, but that seems like a fair trade-off to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant