Scholarly Content Retrieval System (SCRS)

A Python-based scholarly content retrieval system that takes JSON RPC-style requests to search for and retrieve academic papers, prioritizing PDF links, structured text, and metadata.

Overview

The SCRS is designed to make it easy to programmatically access scientific information in Python projects. It provides a standardized interface that searches across multiple scholarly databases (Google Scholar via SerpAPI, PubMed/NCBI, arXiv, and OpenAIRE) and returns consistent response objects with PDF links as the primary goal, followed by structured content or available metadata.

Key Features

Search academic publications using natural language queries across multiple sources
- Google Scholar (via SerpAPI)
- PubMed/NCBI E-utilities
- arXiv API
- OpenAIRE API
Automatically resolve PDF links using Unpaywall API
Filter by source, date range, journal, and more
Prioritize results with available PDF links
Retrieve detailed document information by ID or DOI
JSON RPC-style interface for easy integration
Standardized response format across all sources
PDF-first result prioritization
Comprehensive error handling with retry logic
Rate limiting protection

Installation

# Clone the repository
git clone https://github.com/Jamoxidase/SCRS_OaPDF
cd SCRS_OaPDF

# Install dependencies
pip install requests tenacity

Configuration

The system requires API keys for the services it uses. Set them as environment variables:

# Required for Google Scholar search (via SerpAPI)
export SERP_API_KEY="your_serp_api_key_here"

# Required for PubMed/NCBI E-utilities and Unpaywall
export PUBMED_EMAIL="[email protected]"

# Optional for higher rate limits with PubMed/NCBI
export PUBMED_API_KEY="your_pubmed_api_key_here"

# Optional if different from PUBMED_EMAIL
export UNPAYWALL_EMAIL="[email protected]"

You can obtain these API keys from:

SerpApi for Google Scholar access
NCBI for PubMed API key (email is required even without API key)
No API key is required for arXiv and OpenAIRE
Unpaywall only requires an email

Usage

Multi-Source Search with PDF Resolution

import os
import json
from scholarly_retrieval import process_scholarly_request

# Set your API keys
os.environ["SERP_API_KEY"] = "your_serp_api_key_here"
os.environ["PUBMED_EMAIL"] = "[email protected]"
os.environ["PUBMED_API_KEY"] = "your_pubmed_api_key_here"  # Optional

# Search across multiple sources
search_request = {
    "method": "search",
    "params": {
        "query": "quantum computing",
        "sources": ["google_scholar", "arxiv", "pubmed", "openaire"],
        "year_from": 2020,
        "limit": 5,
        "pdf_only": True,
        "resolve_pdfs": True  # Use Unpaywall to find PDF links
    },
    "id": 1
}

# Process request
result = process_scholarly_request(search_request)
print(json.dumps(result, indent=2))

Source-Specific Search

# Search only in arXiv
arxiv_request = {
    "method": "search",
    "params": {
        "query": "machine learning",
        "sources": ["arxiv"],
        "limit": 5
    },
    "id": 2
}

arxiv_result = process_scholarly_request(arxiv_request)
print(json.dumps(arxiv_result, indent=2))

Get Document by DOI

# Get document details by DOI with PDF resolution
document_request = {
    "method": "get_document",
    "params": {
        "doi": "10.1038/s41746-019-0191-0",  # Example DOI
        "resolve_pdf": True
    },
    "id": 3
}

document_result = process_scholarly_request(document_request)
print(json.dumps(document_result, indent=2))

Get Document by ID and Source

# Get document details by result ID
document_request = {
    "method": "get_document",
    "params": {
        "result_id": "result_id_from_search",
        "source": "arxiv"  # Helps route to the correct API
    },
    "id": 4
}

document_result = process_scholarly_request(document_request)
print(json.dumps(document_result, indent=2))

API Reference

JSON RPC Request Format

All requests follow this format:

{
  "method": "search|get_document",
  "params": {
    // Method-specific parameters
  },
  "id": 1 // Optional request ID
}

Search Method

Parameters:

Parameter	Type	Required	Description
query	string	Yes	The search query text
sources	array	No	List of acceptable sources (default: ["google_scholar", "arxiv", "pubmed", "openaire"])
year_from	integer	No	Start year for publication filter
year_to	integer	No	End year for publication filter
journal	string	No	Filter by journal name
limit	integer	No	Number of results (default: 10)
offset	integer	No	Results offset for pagination (default: 0)
pdf_only	boolean	No	Return only results with PDF links (default: false)
full_text_only	boolean	No	Return only results with full text (default: false)
resolve_pdfs	boolean	No	Attempt to resolve PDF links using Unpaywall (default: true)

Example:

{
  "method": "search",
  "params": {
    "query": "machine learning",
    "sources": ["google_scholar"],
    "year_from": 2020,
    "year_to": 2023,
    "journal": "Nature",
    "limit": 10,
    "offset": 0,
    "pdf_only": true,
    "full_text_only": false
  },
  "id": 1
}

Get Document Method

Parameters:

Parameter	Type	Required	Description
result_id	string	*	The unique identifier for the result (required if no DOI)
doi	string	*	DOI of the document (can be used instead of result_id)
source	string	No	Source name to help route the request (e.g., "arxiv", "pubmed")
resolve_pdf	boolean	No	Attempt to resolve PDF link using Unpaywall (default: true)

*Either result_id or doi must be provided

Example with Result ID:

{
  "method": "get_document",
  "params": {
    "result_id": "abc123defg",
    "source": "google_scholar"
  },
  "id": 2
}

Example with DOI:

{
  "method": "get_document",
  "params": {
    "doi": "10.1038/s41746-019-0191-0"
  },
  "id": 3
}

Response Format

All responses follow this JSON RPC format:

{
  "jsonrpc": "2.0",
  "result": {
    // Method-specific result
  },
  "id": 1 // Same as request ID
}

For errors:

{
  "jsonrpc": "2.0",
  "error": {
    "code": -32000,
    "message": "Error message"
  },
  "id": 1 // Same as request ID
}

Search Result Format

{
  "query": "string",
  "total_results": integer,
  "results": [
    {
      "title": "string",
      "authors": ["string"],
      "publication_date": "string",
      "journal": "string",
      "snippet": "string",
      "doi": "string",
      "pdf_available": boolean,
      "pdf_url": "string",
      "full_text_available": boolean,
      "full_text": "string",
      "abstract": "string",
      "citation_count": integer,
      "source": "string",
      "source_url": "string",
      "result_id": "string"
    }
  ],
  "pagination": {
    "current_page": integer,
    "total_pages": integer,
    "has_next": boolean,
    "has_previous": boolean
  }
}

Document Result Format

{
  "title": "string",
  "authors": ["string"],
  "publication_date": "string",
  "journal": "string",
  "abstract": "string",
  "doi": "string",
  "pdf_available": boolean,
  "pdf_url": "string",
  "full_text_available": boolean,
  "full_text": "string",
  "citation_count": integer,
  "references": [
    {
      "title": "string",
      "authors": ["string"],
      "publication_date": "string",
      "journal": "string"
    }
  ],
  "source": "string",
  "source_url": "string"
}

Error Codes

Code	Description
-32600	Invalid Request
-32601	Method not found
-32602	Invalid params
-32603	Internal error (configuration)
-32000	Generic server error
-32001	API error (external services)
-32002	Resource not found
-32003	Rate limit exceeded

Current Features

Multiple scholarly sources:
- Google Scholar (via SerpAPI)
- PubMed/NCBI E-utilities
- arXiv API
- OpenAIRE API
PDF resolution via Unpaywall API
Flexible filtering options
Result normalization across all sources
Rate limiting protection with exponential backoff
LRU caching for API responses
Comprehensive error handling

Future Enhancements

Add more scholarly sources (CORE, Semantic Scholar, etc.)
Add persistent caching with Redis or similar
Implement asynchronous processing for parallel API calls
Implement citation networks

License

MIT

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
README.md		README.md
cli_search.py		cli_search.py
example_useage.py		example_useage.py
requirements.txt		requirements.txt
scholarly_retrieval.py		scholarly_retrieval.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scholarly Content Retrieval System (SCRS)

Overview

Key Features

Installation

Configuration

Usage

Multi-Source Search with PDF Resolution

Source-Specific Search

Get Document by DOI

Get Document by ID and Source

API Reference

JSON RPC Request Format

Search Method

Get Document Method

Response Format

Search Result Format

Document Result Format

Error Codes

Current Features

Future Enhancements

License

Contributing

About

Releases

Packages

Languages

Jamoxidase/SCRS_OaPDF

Folders and files

Latest commit

History

Repository files navigation

Scholarly Content Retrieval System (SCRS)

Overview

Key Features

Installation

Configuration

Usage

Multi-Source Search with PDF Resolution

Source-Specific Search

Get Document by DOI

Get Document by ID and Source

API Reference

JSON RPC Request Format

Search Method

Get Document Method

Response Format

Search Result Format

Document Result Format

Error Codes

Current Features

Future Enhancements

License

Contributing

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages