Scrape from OpenReview #2567

fnielsen · 2024-12-04T17:57:43Z

Is your feature request related to a problem? Please describe.
Scrape paper data from OpenReview.net, e.g., https://openreview.net/forum?id=aVh9KRZdRk

Describe the solution you'd like
I would like a new module called, openreview.py that works as a module as well as a script for getting metainformation from the OpenReview.net site to a paper dictionary that can be set to the qs.py module for formatting in the QuickStatement format. The title, the authors, the date of publication, the OpenReview submission ID, the PDF link, the license should be extracted if possible. The functionality should be separated, so that there is a download function which gives HTML, the HTML should be sent to a function that returns a dictionary. The lxml.etree has been used before for the extraction. The requests has been used for downloading information. The HTTP UserAgent is set up via the config module in Scholia.

It is not quite clear if the venue (proceedings) can be matched. For instance "NeurIPS 2024 oral" could be matched to Wikidata's Q130601483 item. But to make this general would be hard.

For the script version the input could be the URL to an OpenReview.net paper or a paper submission ID.

I would like the module and each function to be documented in the numpy docstyle, so that the module can be validated with the pydocstyle program. In Scholia, with have Parameters, Returns, and Examples description in the docstring. We test the test the docstring with tox.

Describe alternatives you've considered
None

Additional context
There is already scraping module for, e.g., OJS, see https://github.com/WDscholia/scholia/blob/master/scholia/scrape/ojs.py I would like a similar approach for OpenReview.net. Wikidata has already the property OpenReview.net submission ID at https://www.wikidata.org/wiki/Property:P8968. This should be added.

The text was updated successfully, but these errors were encountered:

fnielsen · 2024-12-04T19:42:34Z

A problem with scraping https://openreview.net/forum?id=0g0X4H8yN4I

Style check Quickstatement fix

fnielsen added the enhancement some suggestions to improve Scholia label Dec 4, 2024

fnielsen self-assigned this Dec 4, 2024

fnielsen added a commit that referenced this issue Dec 4, 2024

Further fix for #2567

be8bdf7

Style check Quickstatement fix

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scrape from OpenReview #2567

Scrape from OpenReview #2567

fnielsen commented Dec 4, 2024 •

edited

Loading

fnielsen commented Dec 4, 2024

Scrape from OpenReview #2567

Scrape from OpenReview #2567

Comments

fnielsen commented Dec 4, 2024 • edited Loading

fnielsen commented Dec 4, 2024

fnielsen commented Dec 4, 2024 •

edited

Loading