Skip to content

Harden arXiv retrieval against batch API failures in calculate-and-send#268

Open
reoLantern wants to merge 4 commits into
TideDra:mainfrom
reoLantern:main
Open

Harden arXiv retrieval against batch API failures in calculate-and-send#268
reoLantern wants to merge 4 commits into
TideDra:mainfrom
reoLantern:main

Conversation

@reoLantern

Copy link
Copy Markdown
Contributor

arxiv Python 库请求 https://export.arxiv.org/api/query?...id_list=... 时返回了 HTTP 406,导致任务直接退出。

ArxivRetriever 内实现了“限流 + 分批 + 重试”,并且把 406 场景改成降级不退出:

  • 分批:按 20 个 paper ID 一批请求 arXiv API。
  • 限流:批次之间 sleep(3);降级到单篇请求时每篇之间 sleep(1)
  • 重试:
    • arxiv.Client(num_retries=10, delay_seconds=10) 保留库内重试;
    • 429 增加批级重试(最多 5 次,30s 线性退避)。
  • 406/其他 HTTP 错误:批请求失败后转为单篇请求;单篇仍失败的 ID 只记录 warning 并跳过,不再让 workflow 直接失败。

测试了一下,不影响正常运行。

Copilot AI review requested due to automatic review settings June 30, 2026 02:41

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens ArxivRetriever against arXiv batch API failures (notably HTTP 406 / 429) by batching ID lookups, rate-limiting between requests, retrying 429s at the batch level, and degrading to per-paper requests when a batch request fails—skipping unretrievable papers with warnings rather than failing the workflow.

Changes:

  • Batch arXiv API queries in chunks of 20 paper IDs, with inter-batch sleep.
  • Add batch-level retry/backoff for HTTP 429 and per-paper fallback for other HTTP errors (including 406), skipping IDs that still fail.
  • Add a pytest case to validate the batch-error → per-paper fallback behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
src/zotero_arxiv_daily/retriever/arxiv_retriever.py Adds batching, throttling, 429 retry/backoff, and per-paper fallback/skip logic for resilient arXiv retrieval.
tests/retriever/test_arxiv_retriever.py Adds a test to ensure batch HTTP errors trigger per-paper fallback and skipped IDs are warned/omitted.

Comment on lines 142 to 145
try:
batch = list(client.results(search))
bar.update(len(batch))
raw_papers.extend(batch)
Comment on lines +166 to +178
batch = []
for index, paper_id in enumerate(batch_ids):
try:
batch.extend(list(client.results(arxiv.Search(id_list=[paper_id]))))
except arxiv.HTTPError as paper_exc:
logger.warning(
f"Skipping arXiv paper {paper_id} due to API error status {paper_exc.status}"
)
if index + 1 < len(batch_ids):
sleep(1)
bar.update(len(batch))
raw_papers.extend(batch)
break
Comment on lines +148 to +160
if exc.status == 429:
if attempt < max_batch_retries - 1:
wait = batch_retry_delay * (attempt + 1)
logger.warning(
f"arXiv API 429 on batch {i // 20}, "
f"retry {attempt + 1}/{max_batch_retries} in {wait}s"
)
sleep(wait)
continue
logger.warning(
f"arXiv API 429 on batch {i // 20} after {max_batch_retries} retries. "
"Falling back to per-paper requests."
)


def test_arxiv_retriever_falls_back_to_per_paper_on_batch_http_error(config, mock_feedparser, monkeypatch):
monkeypatch.setattr("zotero_arxiv_daily.retriever.base.sleep", lambda _: None)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants