You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the solution you'd like
I would like a new module called, openreview.py that works as a module as well as a script for getting metainformation from the OpenReview.net site to a paper dictionary that can be set to the qs.py module for formatting in the QuickStatement format. The title, the authors, the date of publication, the OpenReview submission ID, the PDF link, the license should be extracted if possible. The functionality should be separated, so that there is a download function which gives HTML, the HTML should be sent to a function that returns a dictionary. The lxml.etree has been used before for the extraction. The requests has been used for downloading information. The HTTP UserAgent is set up via the config module in Scholia.
It is not quite clear if the venue (proceedings) can be matched. For instance "NeurIPS 2024 oral" could be matched to Wikidata's Q130601483 item. But to make this general would be hard.
For the script version the input could be the URL to an OpenReview.net paper or a paper submission ID.
I would like the module and each function to be documented in the numpy docstyle, so that the module can be validated with the pydocstyle program. In Scholia, with have Parameters, Returns, and Examples description in the docstring. We test the test the docstring with tox.
Is your feature request related to a problem? Please describe.
Scrape paper data from OpenReview.net, e.g., https://openreview.net/forum?id=aVh9KRZdRk
Describe the solution you'd like
I would like a new module called,
openreview.py
that works as a module as well as a script for getting metainformation from the OpenReview.net site to a paper dictionary that can be set to the qs.py module for formatting in the QuickStatement format. The title, the authors, the date of publication, the OpenReview submission ID, the PDF link, the license should be extracted if possible. The functionality should be separated, so that there is a download function which gives HTML, the HTML should be sent to a function that returns a dictionary. Thelxml.etree
has been used before for the extraction. Therequests
has been used for downloading information. The HTTP UserAgent is set up via theconfig
module in Scholia.It is not quite clear if the venue (proceedings) can be matched. For instance "NeurIPS 2024 oral" could be matched to Wikidata's Q130601483 item. But to make this general would be hard.
For the script version the input could be the URL to an OpenReview.net paper or a paper submission ID.
I would like the module and each function to be documented in the numpy docstyle, so that the module can be validated with the pydocstyle program. In Scholia, with have Parameters, Returns, and Examples description in the docstring. We test the test the docstring with tox.
Describe alternatives you've considered
None
Additional context
There is already scraping module for, e.g., OJS, see https://github.com/WDscholia/scholia/blob/master/scholia/scrape/ojs.py I would like a similar approach for OpenReview.net. Wikidata has already the property OpenReview.net submission ID at https://www.wikidata.org/wiki/Property:P8968. This should be added.
The text was updated successfully, but these errors were encountered: