Paperscape data
This dataset contains reference link data for the arXiv online repository of scientific papers (http://arxiv.org) as extracted by the Paperscape project (http://paperscape.org). It spans from the beginning of the arXiv (1991) to the end of 2013.
The file format is CSV i.e. one line per entry, with 7 fields separated by semicolons:
arxiv-id;comma-separated-arxiv-categories;num-found-refs;num-total-refs;comma-separated-refs;comma-separated-authors;title
If you use this data please give credit to: Damien P. George and Rob Knegjens, Paperscape, and link to the DOI 10.5281/zenodo.10052. A suggested BibTeX entry:
@misc{paperscape,
author = {Damien P. George and Robert Knegjens},
title = {Paperscape},
howpublished = {\url{http://paperscape.org}},
doi = {10.5281/zenodo.10052},
}
Reference extraction
There are 903,346 entries with a total of 27,361,907 references, of which we were able to match 11,131,443 with an arXiv id (40.68%). Note that not every reference has a corresponding arXiv id, nor have we managed to identify all references that do. For references without an explicit arXiv reference we attempt a journal reference or DOI look-up. The following table summarizes the extraction success per category:
Category | Papers | ArXiv refs | Total refs | Avg. arXiv | Avg. total | Ratio |
---|---|---|---|---|---|---|
hep-lat | 12030 | 246914 | 303572 | 20.52 | 25.23 | 81.34% |
hep-ph | 81847 | 2653420 | 3309781 | 32.42 | 40.44 | 80.17% |
hep-th | 65870 | 1839214 | 2310299 | 27.92 | 35.07 | 79.61% |
hep-ex | 14002 | 256484 | 385942 | 18.32 | 27.56 | 66.46% |
nucl-th | 20379 | 459222 | 718589 | 22.53 | 35.26 | 63.91% |
gr-qc | 30671 | 620316 | 983748 | 20.22 | 32.07 | 63.06% |
nucl-ex | 6625 | 115093 | 203433 | 17.37 | 30.71 | 56.58% |
astro-ph | 153681 | 2654078 | 6803464 | 17.27 | 44.27 | 39.01% |
quant-ph | 44456 | 459549 | 1249300 | 10.34 | 28.10 | 36.78% |
q-alg | 1177 | 5033 | 18699 | 4.28 | 15.89 | 26.92% |
cond-mat | 154699 | 1192443 | 4592914 | 7.71 | 29.69 | 25.96% |
math-ph | 16990 | 109999 | 431454 | 6.47 | 25.39 | 25.49% |
solv-int | 844 | 3754 | 16288 | 4.45 | 19.30 | 23.05% |
atom-ph | 68 | 221 | 1484 | 3.25 | 21.82 | 14.89% |
physics | 51412 | 169195 | 1154258 | 3.29 | 22.45 | 14.66% |
nlin | 10602 | 39028 | 285058 | 3.68 | 26.89 | 13.69% |
acc-phys | 47 | 68 | 516 | 1.45 | 10.98 | 13.18% |
chao-dyn | 1770 | 4282 | 39003 | 2.42 | 22.04 | 10.98% |
math | 161842 | 259361 | 3185027 | 1.60 | 19.68 | 8.14% |
adap-org | 306 | 367 | 4703 | 1.20 | 15.37 | 7.80% |
funct-an | 320 | 325 | 4528 | 1.02 | 14.15 | 7.18% |
alg-geom | 1209 | 899 | 12634 | 0.74 | 10.45 | 7.12% |
dg-ga | 562 | 510 | 7791 | 0.91 | 13.86 | 6.55% |
q-fin | 2815 | 4357 | 66920 | 1.55 | 23.77 | 6.51% |
supr-con | 69 | 202 | 3140 | 2.93 | 45.51 | 6.43% |
patt-sol | 452 | 632 | 9960 | 1.40 | 22.04 | 6.35% |
comp-gas | 140 | 129 | 2057 | 0.92 | 14.69 | 6.27% |
chem-ph | 129 | 204 | 3503 | 1.58 | 27.16 | 5.82% |
mtrl-th | 165 | 217 | 4729 | 1.32 | 28.66 | 4.59% |
plasm-ph | 28 | 13 | 327 | 0.46 | 11.68 | 3.98% |
q-bio | 8014 | 8123 | 214569 | 1.01 | 26.77 | 3.79% |
cs | 53519 | 24195 | 886831 | 0.45 | 16.57 | 2.73% |
stat | 5688 | 3517 | 137536 | 0.62 | 24.18 | 2.56% |
bayes-an | 11 | 2 | 91 | 0.18 | 8.27 | 2.20% |
cmp-lg | 894 | 75 | 9467 | 0.08 | 10.59 | 0.79% |
ao-sci | 13 | 2 | 292 | 0.15 | 22.46 | 0.68% |
License
This Paperscape data is made available under the Open Database License: http://opendatacommons.org/licenses/odbl/1.0/. Any rights in individual contents of the database are licensed under the Database Contents License: http://opendatacommons.org/licenses/dbcl/1.0/ - See more at: http://opendatacommons.org/licenses/odbl/#sthash.INIaa7OJ.dpuf