Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

List of Data Sources batch generator #63

Open
josh-chamberlain opened this issue Apr 2, 2024 · 0 comments
Open

List of Data Sources batch generator #63

josh-chamberlain opened this issue Apr 2, 2024 · 0 comments
Labels

Comments

@josh-chamberlain
Copy link
Contributor

josh-chamberlain commented Apr 2, 2024

Context

This is an alternative to using common_crawler to find a batch of URLs.

related to #54

In addition to agency homepages, but maybe more complicated, we could use data sources with record_type="List of Data Sources" to generate possible URLs

Requirements

  • use our API to get data sources with record_type="List of Data Sources"
  • select the most promising-seeming ones or
  • create an all-purpose crawler which is good taking a website like that and getting all the URLs on the page that might be data sources
    • it can be aggressive and get false positives, because the end goal is to run it through the identification pipeline
  • generate a batch in Hugging Face
  • run this crawler periodically, maybe monthly

Docs

  • What docs should be updated? Link to related docs changes in the PR.

Open questions

we should be mindful of duplicates! after the first time this runs, we're going to get some. in general, when using things like common crawl, we should likely avoid running duplicate URLs through the identification pipeline

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant