[Epic] Importing of arXiv data to Wikibase/MediaWiki #338

Hyper-Node · 2022-11-29T23:03:17Z

Epic description in words:

Importing and parsing arXiv Metadata and Full- and Sourcetexts.
The Metadata will be imported the MaRDI Portal knowledge graph in Wikibase and/or to Mediawiki Pages.
Full- and Sourcetexts will be parsed and formula will be extracted to the KG or MathSearch-indexing.

Current state is to harvest Metadata with OAI-PMH and obtain the fulltexts and sourcetexts through S3-Buckets.
https://arxiv.org/help/bulk_data

This is defined mostly, some steps are still draft.

Epic issues:

Create OAI PMH Prototype in Python which harvests arXive Metadata (Johannes)
Create S3 Client which is able to obtain Fulltext and Source-Text data in python on Mardi0X. This involves synchronization with ZIB-Admins. (Eloi)
Check on Metadata, Fulltext and Source-Text data to define our data model and decide if caching is necessary (Eloi, Johannes)
(if caching necessary) Create a cache like database and deploy it to our ecosystem (Johannes)
(if caching necessary) Upgrade OAI-PMH component to write to cache (Johannes)
(if caching necessary) Upgrade Full-/sourcetext component to write to cache (Eloi)
(if caching necessary) Write a reader component for the cache in Wikibase-Integrator (Eloi)
(to be defined more accurately) Write a document parser for full-texts and source-texts in python (Johannes)
(to be defined more accurately) Write an importer component for arXiv Metadata to Wikibase (Eloi)

Initial questions

Are there any existing mappings, technologies for retrieving and interpreting ArXiv data ?
Parsing formula to the knowledge graph from data? (arXiv source files available through: https://arxiv.org/help/bulk_data_s3)
Parsing Metadata of documents to knowledge graph (Readily available through the arXiv API)
Citation extraction from pdf through: https://github.com/kermitt2/grobid
Wikidata entity recognition in abstracts through: https://github.com/kermitt2/entity-fishing
Entity creation using WikibaseIntegrator

Additional Info:

Corresponding Milestones:

A4b, A4a, F1, F2, C3, D2, D3

Related bugs:

Epic acceptance criteria:

first criterion

Checklist for this epic:

the main MaRDI project has been assigned as project
report has been created

Hyper-Node added the epic label Nov 29, 2022

Hyper-Node self-assigned this Nov 29, 2022

Hyper-Node changed the title ~~[Epic][Draft] Importing of ArXiVe Data~~ [Epic][Draft] Importing of ArXiv Data Nov 29, 2022

Hyper-Node changed the title ~~[Epic][Draft] Importing of ArXiv Data~~ [Epic][Draft] Importing of ArXiv Document-Data to Wikibase/MediaWiki Jan 27, 2023

Hyper-Node assigned eloiferrer Jan 31, 2023

Hyper-Node changed the title ~~[Epic][Draft] Importing of ArXiv Document-Data to Wikibase/MediaWiki~~ [Epic] Importing of arXiv data to Wikibase/MediaWiki Jan 31, 2023

physikerwelt mentioned this issue Mar 6, 2024

Import arXiv MaRDI4NFDI/MaRDIRoadmap#34

Open

2 tasks

physikerwelt unassigned Hyper-Node Mar 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Epic] Importing of arXiv data to Wikibase/MediaWiki #338

[Epic] Importing of arXiv data to Wikibase/MediaWiki #338

Hyper-Node commented Nov 29, 2022 •

edited

Loading

[Epic] Importing of arXiv data to Wikibase/MediaWiki #338

[Epic] Importing of arXiv data to Wikibase/MediaWiki #338

Comments

Hyper-Node commented Nov 29, 2022 • edited Loading

Hyper-Node commented Nov 29, 2022 •

edited

Loading