mozzarellm

Analyze gene clusters using Large Language Models (LLMs) to identify biological pathways and prioritize novel genes.

Overview

Mozzarellm is a Python package that leverages Large Language Models to analyze gene clusters for pathway identification and novel gene discovery. It provides a streamlined way to process gene cluster data, classify genes by their pathway involvement, and prioritize candidates for experimental validation.

Features

🧬 Pathway Identification: Automatically identify the dominant biological process or pathway in gene clusters
🔍 Gene Classification: Categorize genes as established pathway members, uncharacterized, or having novel potential roles
🏆 Prioritization: Assign importance scores to novel gene candidates to guide experimental follow-up
🚀 Multi-Provider Support: Use OpenAI (GPT-4/o), Anthropic (Claude), or Google (Gemini) models
📊 Structured Output: Generate JSON and CSV files with detailed analysis at gene and cluster levels

Installation

# From PyPI (once published)
pip install mozzarellm

# From GitHub
pip install git+https://github.com/cheeseman-lab/mozzarellm.git

# For development
git clone https://github.com/cheeseman-lab/mozzarellm.git
cd mozzarellm
pip install -e .

Create a .env file with your API keys:

OPENAI_API_KEY=your_openai_key_here
ANTHROPIC_API_KEY=your_anthropic_key_here
GOOGLE_API_KEY=your_google_key_here

Quick Start

The easiest way to get started is to run the example notebook:

# Navigate to the examples directory
cd examples

# Launch Jupyter notebook
jupyter notebook example_notebook.ipynb

Basic Python Usage

import os
import pandas as pd
from mozzarellm import analyze_gene_clusters, reshape_to_clusters
from mozzarellm.prompts import ROBUST_SCREEN_CONTEXT, ROBUST_CLUSTER_PROMPT
from mozzarellm.configs import DEFAULT_OPENAI_REASONING_CONFIG

# Load sample data
sample_data = pd.read_csv("sample_data.csv")

# Process data into cluster format
cluster_df, gene_features = reshape_to_clusters(
    input_df=sample_data, 
    uniprot_col="uniprot_function", 
    verbose=True
)

# Run analysis
results = analyze_gene_clusters(
    # Input data
    input_df=cluster_df,
    # Model configuration
    model_name="o4-mini",  # or other supported models
    config_dict=DEFAULT_OPENAI_REASONING_CONFIG,
    # Analysis context and prompts
    screen_context=ROBUST_SCREEN_CONTEXT,
    cluster_analysis_prompt=ROBUST_CLUSTER_PROMPT,
    # Gene annotations
    gene_annotations_df=gene_features,
    # Options
    batch_size=1,
    save_outputs=True,  # Set to False to avoid writing files
    outputs_to_generate=["json", "clusters", "flagged_genes"]
)

# Explore results
print(results.keys())
cluster_analysis = results["cluster_df"]
gene_analysis = results["gene_df"]

Command Line (deprecated)

Command line usage is currently deprecated as we migrate to a more flexible API-first approach.

Input Format

Your input should be a CSV/TSV file with gene clusters:

cluster_id	genes	uniprot_function
1	BRCA1;TP53;PTEN;MLH1	DNA repair; tumor suppressor; metabolism
2	MYC;MAX;MYCN;E2F1;RB1	transcription factors; cell cycle control

Required columns:

A column with cluster identifiers (default: cluster_id)
A column with gene symbols (default: genes)
Optionally, a column with UniProt function descriptions (helps with analysis)

Configuration

Mozzarellm comes with default configurations for supported LLM providers:

from mozzarellm.configs import (
    DEFAULT_OPENAI_REASONING_CONFIG,  # For OpenAI models with reasoning
    DEFAULT_OPENAI_CONFIG,            # For standard OpenAI models
    DEFAULT_ANTHROPIC_CONFIG,         # For Anthropic Claude models
    DEFAULT_GEMINI_CONFIG             # For Google Gemini models
)

Available models include:

OpenAI: o4-mini, o3-mini, gpt-4.1, gpt-4o
Anthropic: claude-3-7-sonnet-20250219, claude-3-5-haiku-20241022
Google: gemini-2.5-pro-preview-03-25, gemini-2.5-flash-preview-04-17

Python API

Analyze Gene Clusters

from mozzarellm import analyze_gene_clusters
from mozzarellm.configs import DEFAULT_OPENAI_REASONING_CONFIG

results = analyze_gene_clusters(
    # Input options
    input_df=cluster_df,               # DataFrame with cluster data
    # Model configuration
    model_name="o4-mini",              # Model to use
    config_dict=DEFAULT_OPENAI_REASONING_CONFIG,  # Configuration
    # Analysis context 
    screen_context=SCREEN_CONTEXT,     # Context about the experiment
    cluster_analysis_prompt=PROMPT,    # Custom analysis prompt
    # Gene annotations
    gene_annotations_df=gene_features, # Gene annotations DataFrame
    # Processing options
    batch_size=1,                      # Clusters per API call
    # Output options
    output_file="results/analysis",    # Output path prefix (optional)
    save_outputs=True,                 # Whether to write files
    outputs_to_generate=["json", "clusters", "flagged_genes"]  # Output types
)

Process Gene-Level Data

from mozzarellm import reshape_to_clusters

cluster_df, gene_features = reshape_to_clusters(
    input_df=gene_level_data,          # Gene-level input DataFrame
    gene_col="gene_symbol",            # Gene ID column
    cluster_col="cluster",             # Cluster assignment column
    uniprot_col="uniprot_function",    # Column with gene annotations
    verbose=True                       # Show processing information
)

Output Structure

When save_outputs=True, the analysis produces three types of output files:

JSON (*_clusters.json): Complete analysis with raw LLM responses
Cluster CSV (*_clusters.csv): One row per cluster with pathway assignments and statistics
Gene CSV (*_flagged_genes.csv): One row per gene with detailed rationales and scores

When save_outputs=False, the same data is returned in a dictionary:

{
    'clusters_dict': {...},    # Raw analysis results
    'json_data': {...},        # Processed JSON with metadata
    'cluster_df': pandas.DataFrame(...),  # Cluster-level analysis
    'gene_df': pandas.DataFrame(...)      # Gene-level analysis
}

Examples

See the examples/example_notebook.ipynb file for a complete walkthrough of:

Loading and processing gene data
Running analysis with different models
Exploring and visualizing results

License

MIT License - See LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 70 Commits
data		data
examples		examples
mozzarellm		mozzarellm
results		results
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

mozzarellm

Overview

Features

Installation

Quick Start

Basic Python Usage

Command Line (deprecated)

Input Format

Configuration

Python API

Analyze Gene Clusters

Process Gene-Level Data

Output Structure

Examples

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

cheeseman-lab/mozzarellm

Folders and files

Latest commit

History

Repository files navigation

mozzarellm

Overview

Features

Installation

Quick Start

Basic Python Usage

Command Line (deprecated)

Input Format

Configuration

Python API

Analyze Gene Clusters

Process Gene-Level Data

Output Structure

Examples

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages