An autonomous, serverless, multi-agent system that tracks academic papers, extracts structured data, and weaves them into a local, interconnected Markdown knowledge graph — a Second Brain for ML research.
Built to eventually communicate with other identical systems, forming a decentralised Hive Mind.
┌────────────────────────────────────────────┐
│ Triggers │
└─────────────────────┬──────────────────────┘
│
┌────────────▼────────────┐
│ Federation Agent │ ← consumes external public_feed.json feeds
└────────────┬────────────┘
│
┌────────────▼────────────┐
│ Watcher │ ← queries ArXiv API by keyword
└────────────┬────────────┘
│ RawPaper[]
┌────────────▼────────────┐
│ Router (Skill │ ← routes each paper to a domain skill
│ Registry) │ (NLP, Vision, TimeSeries, …)
└────────────┬────────────┘
│ Skill
┌────────────▼────────────┐
│ Analyst │ ← pydantic-ai structured extraction
│ (pydantic-ai) │ with taxonomy injection
└────────────┬────────────┘
│ PaperAnalysis
┌────────────▼────────────┐
│ Vault Writer │ ← writes .md to tmp_vault/
│ │ generates concept stubs
│ │ updates public_feed.json
└────────────┬────────────┘
│ atomic move
┌────────────▼────────────┐
│ /vault │ ← permanent, file-based knowledge graph
│ papers/ concepts/ │
│ datasets/ │
└─────────────────────────┘
research-cruise/
├── .github/
│ └── workflows/
│ └── autonomous-tracker.yml # CI/CD pipeline
├── vault/
│ ├── papers/ # One .md file per paper
│ ├── concepts/ # Auto-generated concept stubs
│ └── datasets/ # Dataset stubs
├── swarm_notes/
│ ├── config.py # Configuration & env vars
│ ├── vault_manager.py # Staging pattern (tmp_vault → vault)
│ ├── watcher.py # Configurable paper-source watcher
│ ├── router.py # Skill registry router
│ ├── analyst.py # pydantic-ai extraction agent
│ ├── vault_writer.py # Markdown writer + public_feed.json
│ ├── federation.py # Hive Mind federation agent
│ └── main.py # Pipeline orchestrator
- Python 3.11+
- An LLM API key
# Install dependencies
uv sync
# Set your API key in .env file
export LLM_API_KEY="sk-..."
export PAPER_SOURCE="semantic_scholar"
export SEMANTIC_SCHOLAR_API_KEY="..."
# prepare configs in configs/ folder
...
# Run the pipeline
python -m swarm_notes.mainUse the example in configs folder to create your own version.
Since May 2025 biorxiv.org is protected by Cloudflare, which blocks direct PDF
downloads. When paper_source is biorxiv or medrxiv and enable_domain_expert
is true, the pipeline uses the
paperscraper library to fall back to the
biorxiv TDM (Text & Data Mining) API, which serves PDFs from an AWS S3 bucket.
This requires an AWS IAM key with read-only S3 access:
- Log in to the AWS IAM console.
- Create a new user (or access key for an existing user).
- Attach the
AmazonS3ReadOnlyAccessmanaged policy. - Generate an Access Key ID and Secret Access Key.
- Add them to your
.envfile (see.env.example):
AWS_ACCESS_KEY_ID=AKIA...
AWS_SECRET_ACCESS_KEY=...Without these credentials the domain-expert full-text step will be silently skipped for biorxiv/medrxiv papers.
For CI/CD, add AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY as repository secrets
(same place as LLM_API_KEY).
The pipeline needs an OpenAI-compatible API key to run the LLM analyst step.
- Open your forked repository on GitHub.
- Go to Settings → Secrets and variables → Actions.
- Click New repository secret.
- Set Name to
LLM_API_KEYand Secret to your API key (e.g.sk-...). - Click Add secret.
Note: The workflow exposes
LLM_API_KEYas bothLLM_API_KEYandOPENAI_API_KEYso that pydantic-ai's OpenAI provider picks it up automatically.
Every successful run updates public_feed.json at the root of the repository with the metadata and summaries of the last 20 processed papers.
To subscribe to another agent's feed, pass their raw public_feed.json URL:
export FEDERATION_FEEDS="https://raw.githubusercontent.com/alice/research-cruise/main/public_feed.json,https://raw.githubusercontent.com/bob/research-cruise/main/public_feed.json"
python -m swarm_notes.mainConflict resolution: If an external feed contains a review of a paper that already exists locally, the local metadata is preserved. The external summary is appended under a ### External Perspectives section:
### External Perspectives
> "Transformers are over-engineered for this dataset." - @Agent_alice
> *(Retrieved 2024-01-15)*Each paper note uses hybrid YAML frontmatter (CSL-compatible fields + custom fields):
---
# CSL-compatible fields
title: "Attention Is All You Need"
author:
- literal: "Ashish Vaswani"
issued:
date-parts:
- [2017, 6, 12]
url: "https://arxiv.org/abs/1706.03762"
# Custom fields
arxiv_id: "1706.03762"
domain: "nlp"
tags:
- "transformer"
- "attention-mechanism"
architectures:
- "encoder-decoder"
datasets:
- "WMT 2014"
skill: "NLPSkill"
processed_at: "2024-01-15T06:00:00Z"
---Body sections: Summary, Key Contributions, Key Concepts (with relative links to ../concepts/), Datasets, Limitations, Links.
taxonomy.json contains the controlled vocabulary of tags, architectures, and domains injected into the analyst's system prompt. This prevents LLM hallucination and keeps metadata consistent. Edit taxonomy.json to add new terms.
MIT — see LICENSE.