Harden arXiv retrieval against batch API failures in calculate-and-send#268
Open
reoLantern wants to merge 4 commits into
Open
Harden arXiv retrieval against batch API failures in calculate-and-send#268reoLantern wants to merge 4 commits into
calculate-and-send#268reoLantern wants to merge 4 commits into
Conversation
Harden arXiv retrieval against batch API failures in `calculate-and-send`
Contributor
There was a problem hiding this comment.
Pull request overview
This PR hardens ArxivRetriever against arXiv batch API failures (notably HTTP 406 / 429) by batching ID lookups, rate-limiting between requests, retrying 429s at the batch level, and degrading to per-paper requests when a batch request fails—skipping unretrievable papers with warnings rather than failing the workflow.
Changes:
- Batch arXiv API queries in chunks of 20 paper IDs, with inter-batch sleep.
- Add batch-level retry/backoff for HTTP 429 and per-paper fallback for other HTTP errors (including 406), skipping IDs that still fail.
- Add a pytest case to validate the batch-error → per-paper fallback behavior.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
src/zotero_arxiv_daily/retriever/arxiv_retriever.py |
Adds batching, throttling, 429 retry/backoff, and per-paper fallback/skip logic for resilient arXiv retrieval. |
tests/retriever/test_arxiv_retriever.py |
Adds a test to ensure batch HTTP errors trigger per-paper fallback and skipped IDs are warned/omitted. |
Comment on lines
142
to
145
| try: | ||
| batch = list(client.results(search)) | ||
| bar.update(len(batch)) | ||
| raw_papers.extend(batch) |
Comment on lines
+166
to
+178
| batch = [] | ||
| for index, paper_id in enumerate(batch_ids): | ||
| try: | ||
| batch.extend(list(client.results(arxiv.Search(id_list=[paper_id])))) | ||
| except arxiv.HTTPError as paper_exc: | ||
| logger.warning( | ||
| f"Skipping arXiv paper {paper_id} due to API error status {paper_exc.status}" | ||
| ) | ||
| if index + 1 < len(batch_ids): | ||
| sleep(1) | ||
| bar.update(len(batch)) | ||
| raw_papers.extend(batch) | ||
| break |
Comment on lines
+148
to
+160
| if exc.status == 429: | ||
| if attempt < max_batch_retries - 1: | ||
| wait = batch_retry_delay * (attempt + 1) | ||
| logger.warning( | ||
| f"arXiv API 429 on batch {i // 20}, " | ||
| f"retry {attempt + 1}/{max_batch_retries} in {wait}s" | ||
| ) | ||
| sleep(wait) | ||
| continue | ||
| logger.warning( | ||
| f"arXiv API 429 on batch {i // 20} after {max_batch_retries} retries. " | ||
| "Falling back to per-paper requests." | ||
| ) |
|
|
||
|
|
||
| def test_arxiv_retriever_falls_back_to_per_paper_on_batch_http_error(config, mock_feedparser, monkeypatch): | ||
| monkeypatch.setattr("zotero_arxiv_daily.retriever.base.sleep", lambda _: None) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
arxiv Python 库请求 https://export.arxiv.org/api/query?...id_list=... 时返回了 HTTP 406,导致任务直接退出。
在
ArxivRetriever内实现了“限流 + 分批 + 重试”,并且把 406 场景改成降级不退出:20个 paper ID 一批请求 arXiv API。sleep(3);降级到单篇请求时每篇之间sleep(1)。arxiv.Client(num_retries=10, delay_seconds=10)保留库内重试;429增加批级重试(最多 5 次,30s 线性退避)。测试了一下,不影响正常运行。