MÍMIR is a generative biology framework designed to "dream" novel peptide binders for specific human proteins. We are not just predicting properties; we are generating de novo biological matter using ESM-3.
- The Engine: We fine-tune ESM-3 using LoRA. We do not train from scratch.
- The Paradigm: Use Masked Language Modeling, not Causal LM.
- Wrong: "Predict the next amino acid."
- Right: "Sculpt the sequence from noise (Parallel Iterative Decoding)."
- The Anchor: Generation is Target-Conditioned. Every sequence starts with a
<TARGET_ID>token (UniProt Accession), acting as the prompt that steers the model's latent space.
- Package Manager: Strict usage of
uv. - Execution: Always run via
uv run scripts/....
- Scripts: We prefer standalone scripts in
scripts/over complex monolithic package logic. - Data flow:
datasets/generation ->data/(csv)train.py->checkpoints/sample_peptides.py-> Generation
- Type Hints: Mandatory on all functions (params + return).
- Docstrings: Google-style, on every public function/class.
- Simplicity: Prefer readable, explicit code over clever abstractions.
- Module: Always use Python's
loggingmodule, neverprint(). - Destination: Log to
sys.stdoutvialogging.StreamHandler(sys.stdout). - Format:
"%(asctime)s - %(levelname)s - %(message)s". - Setup: Call
logging.basicConfig(...)insidemain()after parsing args, never at module level (prevents third-party library noise at import time). - Verbose flag: Pass
level=logging.INFO if args.verbose else logging.WARNINGdirectly tobasicConfig. Do not uselogging.disable(). - Levels: Use
logger.error()for failures,logger.warning()for recoverable issues,logger.info()for progress. Never duplicate the same message at two levels. - Noisy libraries: Silence with
logging.getLogger("httpx").setLevel(logging.WARNING)etc. at module level.
All scripts follow a consistent pattern for argument handling:
1. Centralized Config
All dataset paths are managed through a single config.json file located at the root of each run directory (e.g., data/run78-v2/config.json). This eliminates repetitive path arguments across scripts.
- Every run directory contains exactly one
config.jsonwith all dataset paths - All scripts that consume or produce dataset paths must accept
--configinstead of individual path arguments - Use
load_config()frommimir.config:from mimir.config import load_config config = load_config(args.config) # Access paths via config.features_fingerprints, config.binders_merged, etc.
- Forbidden: Individual path arguments like
--fingerprints-lmdb,--binders-lmdb,--associations-csvare not allowed - Exceptions: Training outputs (checkpoints, logs) and test fixtures remain CLI args:
--checkpoint-dir,--resume-from,-o
2. Argument Conventions
- Naming: kebab-case (
--min-length,--num-workers) - Input/Output: Always
required=True, never default paths. Use-oshorthand for--output - Verbose:
-v/--verbose,action="store_true" - Defaults: Document in help string (e.g.
"default: 4")
3. Function Design
- Execution pattern:
main()contains only argparse,basicConfig, input validation, and a call to the business logic function - Required params first:
output: Pathbefore optional params likemin_len: int = 4 - Match CLI names: CLI
--min-length→ function parammin_len - String encoding: Use
utf-8for all.encode()calls (LMDB keys, hashing, etc.)
- 3-Layer Pattern: Async scripts must follow a 3-layer pattern:
async def _run(...)for async logic, a public sync wrapper callingasyncio.run(_run(...)), andmain()for CLI parsing.
- LMDB: Key encoding must be
utf-8. Define map size as a module-level constantLMDB_MAP_SIZE. Read-only opens usereadonly=True, lock=False. - Input validation: Scripts with file path inputs must validate existence early in
main()usinglogger.error()+sys.exit(1). - Multiprocessing: Use
spawncontext. Worker globals in_UPPER_SNAKE. Settorch.set_num_threads(1)in worker init. - Warnings: Suppress third-party warnings at module level with
warnings.filterwarnings(), never inline inside functions.
- Order: stdlib → third-party → local (
from mimir...,from scripts...). - No inline imports: All imports at module top level.
- Style: Prefer
from X import Yoverimport X as Y.
- Section markers: Use
# ---...---dividers with section names (e.g.# Constants,# Main). - No redundant comments: Don't comment what the code already says.
- No old-style type comments: Use proper type hints, not
# type: dict. - Trailing whitespace: No trailing blank lines at end of file.