Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scrape from OpenReview #2567

Open
fnielsen opened this issue Dec 4, 2024 · 1 comment
Open

Scrape from OpenReview #2567

fnielsen opened this issue Dec 4, 2024 · 1 comment
Assignees
Labels
enhancement some suggestions to improve Scholia

Comments

@fnielsen
Copy link
Collaborator

fnielsen commented Dec 4, 2024

Is your feature request related to a problem? Please describe.
Scrape paper data from OpenReview.net, e.g., https://openreview.net/forum?id=aVh9KRZdRk

Describe the solution you'd like
I would like a new module called, openreview.py that works as a module as well as a script for getting metainformation from the OpenReview.net site to a paper dictionary that can be set to the qs.py module for formatting in the QuickStatement format. The title, the authors, the date of publication, the OpenReview submission ID, the PDF link, the license should be extracted if possible. The functionality should be separated, so that there is a download function which gives HTML, the HTML should be sent to a function that returns a dictionary. The lxml.etree has been used before for the extraction. The requests has been used for downloading information. The HTTP UserAgent is set up via the config module in Scholia.

It is not quite clear if the venue (proceedings) can be matched. For instance "NeurIPS 2024 oral" could be matched to Wikidata's Q130601483 item. But to make this general would be hard.

For the script version the input could be the URL to an OpenReview.net paper or a paper submission ID.

I would like the module and each function to be documented in the numpy docstyle, so that the module can be validated with the pydocstyle program. In Scholia, with have Parameters, Returns, and Examples description in the docstring. We test the test the docstring with tox.

Describe alternatives you've considered
None

Additional context
There is already scraping module for, e.g., OJS, see https://github.com/WDscholia/scholia/blob/master/scholia/scrape/ojs.py I would like a similar approach for OpenReview.net. Wikidata has already the property OpenReview.net submission ID at https://www.wikidata.org/wiki/Property:P8968. This should be added.

@fnielsen fnielsen added the enhancement some suggestions to improve Scholia label Dec 4, 2024
@fnielsen fnielsen self-assigned this Dec 4, 2024
@fnielsen
Copy link
Collaborator Author

fnielsen commented Dec 4, 2024

A problem with scraping https://openreview.net/forum?id=0g0X4H8yN4I

fnielsen added a commit that referenced this issue Dec 4, 2024
Style check
Quickstatement fix
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement some suggestions to improve Scholia
Projects
None yet
Development

No branches or pull requests

1 participant