You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Importing and parsing arXiv Metadata and Full- and Sourcetexts.
The Metadata will be imported the MaRDI Portal knowledge graph in Wikibase and/or to Mediawiki Pages.
Full- and Sourcetexts will be parsed and formula will be extracted to the KG or MathSearch-indexing.
Current state is to harvest Metadata with OAI-PMH and obtain the fulltexts and sourcetexts through S3-Buckets. https://arxiv.org/help/bulk_data
This is defined mostly, some steps are still draft.
Epic issues:
Create OAI PMH Prototype in Python which harvests arXive Metadata (Johannes)
Create S3 Client which is able to obtain Fulltext and Source-Text data in python on Mardi0X. This involves synchronization with ZIB-Admins. (Eloi)
Check on Metadata, Fulltext and Source-Text data to define our data model and decide if caching is necessary (Eloi, Johannes)
(if caching necessary) Create a cache like database and deploy it to our ecosystem (Johannes)
(if caching necessary) Upgrade OAI-PMH component to write to cache (Johannes)
(if caching necessary) Upgrade Full-/sourcetext component to write to cache (Eloi)
(if caching necessary) Write a reader component for the cache in Wikibase-Integrator (Eloi)
(to be defined more accurately) Write a document parser for full-texts and source-texts in python (Johannes)
(to be defined more accurately) Write an importer component for arXiv Metadata to Wikibase (Eloi)
Initial questions
Are there any existing mappings, technologies for retrieving and interpreting ArXiv data ?
Hyper-Node
changed the title
[Epic][Draft] Importing of ArXiv Document-Data to Wikibase/MediaWiki
[Epic] Importing of arXiv data to Wikibase/MediaWiki
Jan 31, 2023
Epic description in words:
Importing and parsing arXiv Metadata and Full- and Sourcetexts.
The Metadata will be imported the MaRDI Portal knowledge graph in Wikibase and/or to Mediawiki Pages.
Full- and Sourcetexts will be parsed and formula will be extracted to the KG or MathSearch-indexing.
Current state is to harvest Metadata with OAI-PMH and obtain the fulltexts and sourcetexts through S3-Buckets.
https://arxiv.org/help/bulk_data
This is defined mostly, some steps are still draft.
Epic issues:
Initial questions
Additional Info:
Corresponding Milestones:
Related bugs:
Epic acceptance criteria:
Checklist for this epic:
The text was updated successfully, but these errors were encountered: