ETHOS Transformer for EHR Data

This repository implements an ETHOS-like transformer model for Electronic Health Record (EHR) data, based on the paper "Zero shot health trajectory prediction using transformer" by Renc et al. The implementation provides a complete pipeline for processing OMOP format EHR data, training a transformer model, and performing zero-shot inference.

Overview

ETHOS (Enhanced Transformer for Health Outcome Simulation) is a novel application of transformer architecture for analyzing high-dimensional, heterogeneous, and episodic health data. The model processes Patient Health Timelines (PHTs) - detailed, tokenized records of health events - to predict future health trajectories using zero-shot learning.

Features

Data Processing: Convert OMOP format EHR data to tokenized Patient Health Timelines
Large Dataset Optimization: Memory management and chunked processing for datasets several GB in size
Transformer Model: Implementation of ETHOS architecture with learnable positional encodings
Training Pipeline: Complete training script with validation, checkpointing, and visualization
Zero-shot Inference: Predict mortality, readmission, SOFA scores, and length of stay without task-specific training
Timeline Generation: Generate future patient health trajectories
Comprehensive Analysis: Analyze patient timelines and generate insights

Installation

Clone the repository:

git clone <repository-url>
cd cursor_transformer

Install dependencies:

pip install -r requirements.txt

Create necessary directories:

mkdir -p processed_data models logs plots

Data Preparation

OMOP Format

Place your OMOP data in a directory with the following structure:

your_omop_data/
├── person/
│   ├── part_0.parquet
│   ├── part_1.parquet
│   └── ...
├── visit_occurrence/
│   ├── part_0.parquet
│   └── ...
├── condition_occurrence/
├── drug_exposure/
├── procedure_occurrence/
├── measurement/
├── observation/
└── death/

Note: The code expects parquet files organized in subdirectories by table name. Each table subdirectory should contain one or more parquet files.

Usage

1. Data Processing

First, process your OMOP data to create tokenized Patient Health Timelines:

# Use default path (omop_data/)
python data_processor.py

# Specify custom OMOP data path
python data_processor.py --data_path /path/to/your/omop_data

# Use dataset tag for isolation (recommended for multiple datasets)
python data_processor.py --data_path /path/to/omop_data --tag aou_2023

# Specify custom output directory
python data_processor.py --data_path /path/to/omop_data --output_dir /path/to/output

# Adjust memory limit for large datasets
python data_processor.py --data_path /path/to/omop_data --memory_limit 16.0

# Force reprocessing (useful for debugging)
python data_processor.py --data_path /path/to/omop_data --tag aou_2023 --force_reprocess

Command line options:

--data_path: Path to OMOP data directory (default: omop_data/)
--tag: Dataset tag for isolating different datasets (e.g., aou_2023, mimic_iv, eicu)
--output_dir: Output directory for processed data (default: processed_data/ or processed_data_{tag}/)
--memory_limit: Memory limit in GB (default: 8.0)
--force_reprocess: Force reprocessing even if data exists

Dataset Isolation with Tags: The tag system allows you to work with multiple datasets simultaneously:

# Process All of Us 2023 data
python data_processor.py --data_path ~/omop_data_2023 --tag aou_2023

# Process MIMIC-IV data
python data_processor.py --data_path ~/mimic_iv --tag mimic_iv

# Process eICU data
python data_processor.py --data_path ~/eicu --tag eicu

This creates separate directories:

processed_data_aou_2023/ - All of Us 2023 processed data
processed_data_mimic_iv/ - MIMIC-IV processed data
processed_data_eicu/ - eICU processed data

2. Training

Train the ETHOS transformer model (memory-friendly, multi-GPU enabled):

# Optional: select multiple GPUs and improve allocator behavior
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

python train.py \
  --data_dir processed_data \
  --batch_size 8 \
  --grad_accum_steps 4 \
  --max_seq_len 1024 \
  --use_amp \
  --max_epochs 100 \
  --learning_rate 3e-4 \
  --device cuda

# Tagged dataset
python train.py --tag aou_2023 --batch_size 8 --grad_accum_steps 4 --max_seq_len 1024 --use_amp --device cuda

# Custom data directory
python train.py --data_dir processed_data_aou_2023 --batch_size 8 --grad_accum_steps 4 --max_seq_len 1024 --use_amp --device cuda

Training options:

--tag: Dataset tag to use (automatically finds processed_data_{tag}/)
--data_dir: Directory containing processed data (default: processed_data/)
--batch_size: Training batch size (default: 32)
--grad_accum_steps: Gradient accumulation steps (default: 1)
--max_seq_len: Max sequence length per sample (default: from config)
--use_amp: Enable mixed precision
--max_epochs: Maximum training epochs (default: 100)
--learning_rate: Learning rate (default: 3e-4)
--device: Device to use (auto/cuda/cpu, default: auto)
--resume: Resume from checkpoint

Tag-based Training:

# Train on All of Us 2023 data
python train.py --tag aou_2023 --batch_size 8 --grad_accum_steps 4 --max_seq_len 1024 --use_amp --max_epochs 200

# Train on MIMIC-IV data  
python train.py --tag mimic_iv --batch_size 8 --grad_accum_steps 4 --max_seq_len 1024 --use_amp --max_epochs 100

# Models are saved to separate directories:
# - models/aou_2023/
# - models/mimic_iv/

3. Inference

Run inference with the trained model:

# Basic inference
python inference.py --model_path models/best_checkpoint.pth --patient_id 12345

# Inference with tagged dataset
python inference.py --tag aou_2023 --model_path models/aou_2023/best_checkpoint.pth

# Inference with custom data directory
python inference.py --model_path models/best_checkpoint.pth --data_dir processed_data_aou_2023

Inference options:

--tag: Dataset tag to use (automatically finds processed_data_{tag}/)
--model_path: Path to trained model checkpoint
--data_dir: Directory containing processed data (default: processed_data/)
--patient_id: Specific patient ID to analyze (optional)
--output_dir: Directory for inference results (default: inference_results/ or inference_results_{tag}/)

Tag-based Inference:

# Analyze All of Us 2023 patients
python inference.py --tag aou_2023 --model_path models/aou_2023/best_checkpoint.pth

# Analyze MIMIC-IV patients
python inference.py --tag mimic_iv --model_path models/mimic_iv/best_checkpoint.pth

# Results are saved to separate directories:
# - inference_results_aou_2023/
# - inference_results_mimic_iv/

4. Example Workflow

Run the complete workflow example:

# Basic workflow
python example_workflow.py --data_path ~/omop_data

# Workflow with dataset tag
python example_workflow.py --data_path ~/omop_data_2023 --tag aou_2023

# Workflow with custom memory limit
python example_workflow.py --data_path ~/omop_data --tag aou_2023 --memory_limit 16.0

Model Architecture

The ETHOS transformer implements:

Decoder-only architecture with causal masking
Learnable positional encodings instead of fixed sinusoidal
Multi-head self-attention with configurable dimensions
Feed-forward networks with residual connections
Layer normalization and dropout for regularization

Configuration

Model parameters can be adjusted in config.py:

@dataclass
class ModelConfig:
    d_model: int = 768          # Model dimension
    n_heads: int = 12           # Number of attention heads
    n_layers: int = 12          # Number of transformer layers
    d_ff: int = 3072           # Feed-forward dimension
    max_seq_len: int = 2048    # Maximum sequence length
    dropout: float = 0.1       # Dropout rate

@dataclass
class DataConfig:
    chunk_size: int = 10000     # Process data in chunks
    max_patients_per_chunk: int = 5000  # Max patients in memory
    memory_limit_gb: float = 8.0        # Memory limit for processing

Tokenization Strategy

The implementation uses a sophisticated tokenization approach:

Event Type Tokens: ADM (admission), DIS (discharge), COND (condition), etc.
Concept Tokens: Specific medical concepts (ICD codes, ATC codes, etc.)
Quantile Tokens: Numerical values converted to quantiles (Q1-Q10)
Time Interval Tokens: Temporal gaps between events (5m, 15m, 1h, 1d, etc.)
Static Tokens: Patient demographics, age intervals, birth year

Large Dataset Optimization

The code is specifically optimized for large OMOP datasets:

Chunked Processing: Data is processed in manageable chunks to control memory usage
Memory Monitoring: Real-time memory usage tracking with configurable limits
Garbage Collection: Automatic memory cleanup between processing steps
Parallel Processing: Support for multiprocessing when available
Streaming: Processes parquet files without loading entire tables into memory

Training Process

The training follows the ETHOS methodology:

Data Preparation: Convert OMOP data to chronological patient timelines
Tokenization: Transform events into token sequences
Sequence Modeling: Train transformer to predict next tokens
Zero-shot Learning: Model learns to generate future health trajectories

Inference Capabilities

The trained model can perform zero-shot predictions:

Mortality Prediction: Estimate patient mortality probability
Readmission Risk: Predict readmission within specified timeframes
SOFA Score Estimation: Predict Sequential Organ Failure Assessment scores
Length of Stay: Estimate hospital/ICU length of stay
Timeline Generation: Generate future patient health trajectories

Example Workflow

from data_processor import OMOPDataProcessor
from model import create_ethos_model
from inference import ETHOSInference

# 1. Process data with custom path
processor = OMOPDataProcessor(data_path="/path/to/omop_data")
tokenized_timelines, vocab = processor.process_all_data()

# 2. Create and train model
model = create_ethos_model(len(vocab))
# ... training code ...

# 3. Run inference
inference = ETHOSInference('models/best_checkpoint.pth', 'processed_data/vocabulary.pkl')
analysis = inference.analyze_patient_timeline(patient_timeline)
future_timeline = inference.generate_future_timeline(patient_timeline)

Output Files

The pipeline generates several output files:

processed_data/: Tokenized timelines, vocabulary, and mappings
models/: Model checkpoints and weights
logs/: Training logs and metrics
plots/: Training curves and visualizations
inference_results/: Inference results and timeline visualizations

Performance Considerations

Memory: Large models may require significant GPU memory
Batch Size: Adjust based on available memory
Sequence Length: Longer sequences require more memory and computation
Data Size: Larger datasets improve model performance but increase training time
Chunk Size: Adjust chunk size based on available RAM

Troubleshooting

Common Issues

Out of Memory: Reduce batch size, sequence length, or chunk size
Data Loading Errors: Check file paths and parquet file integrity
Training Divergence: Reduce learning rate or increase gradient clipping
Slow Training: Use GPU acceleration and optimize data loading

Performance Tips

Use SSD storage for faster data loading
Enable mixed precision training for faster GPU training
Use multiple workers for data loading
Monitor GPU memory usage during training
Adjust memory limits based on your system

Large Dataset Tips

Start with smaller chunks and increase gradually
Monitor memory usage during processing
Use --memory_limit to set appropriate limits for your system
Process data on machines with sufficient RAM

Citation

If you use this implementation, please cite the original ETHOS paper:

@article{renc2024zero,
  title={Zero shot health trajectory prediction using transformer},
  author={Renc, Pawel and Jia, Yugang and Samir, Anthony E and others},
  journal={npj Digital Medicine},
  year={2024}
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Support

For questions and support, please open an issue on the repository or contact the maintainers.

Acknowledgments

This implementation is based on the ETHOS paper and builds upon the transformer architecture introduced in "Attention Is All You Need" by Vaswani et al.

Name		Name	Last commit message	Last commit date
Latest commit History 109 Commits
utils		utils
README (Dan Elton's conflicted copy 2025-09-21).md		README (Dan Elton's conflicted copy 2025-09-21).md
README.md		README.md
analyze_timelines.py		analyze_timelines.py
build_year_split.sh		build_year_split.sh
build_year_split_rf.py		build_year_split_rf.py
config (Dan Elton's conflicted copy 2025-09-21).py		config (Dan Elton's conflicted copy 2025-09-21).py
config.py		config.py
convert_dbdate_to_standard.py		convert_dbdate_to_standard.py
data_loader.py		data_loader.py
data_processor (Dan Elton's conflicted copy 2025-09-21).py		data_processor (Dan Elton's conflicted copy 2025-09-21).py
data_processor.py		data_processor.py
ethos_model.py		ethos_model.py
evaluate_time_to_event.py		evaluate_time_to_event.py
example_workflow.py		example_workflow.py
export_vocabulary.py		export_vocabulary.py
inference.py		inference.py
model.py		model.py
paper_to_replicate.md		paper_to_replicate.md
requirements.txt		requirements.txt
run_optimized_data_processor.py		run_optimized_data_processor.py
test (Dan Elton's conflicted copy 2025-09-21).py		test (Dan Elton's conflicted copy 2025-09-21).py
test (Dan Elton's conflicted copy 2025-09-21).sh		test (Dan Elton's conflicted copy 2025-09-21).sh
test.py		test.py
test.sh		test.sh
tokenize (Dan Elton's conflicted copy 2025-09-21).sh		tokenize (Dan Elton's conflicted copy 2025-09-21).sh
tokenize.sh		tokenize.sh
train (Dan Elton's conflicted copy 2025-09-21).py		train (Dan Elton's conflicted copy 2025-09-21).py
train (Dan Elton's conflicted copy 2025-09-21).sh		train (Dan Elton's conflicted copy 2025-09-21).sh
train.py		train.py
train.sh		train.sh
train_and_test.sh		train_and_test.sh
train_xgboost_rf.py		train_xgboost_rf.py
train_xgboost_rf.sh		train_xgboost_rf.sh
transformer_model.py		transformer_model.py

Folders and files

Latest commit

History

Repository files navigation

ETHOS Transformer for EHR Data

Overview

Features

Installation

Data Preparation

OMOP Format

Usage

1. Data Processing

2. Training

3. Inference

4. Example Workflow

Model Architecture

Configuration

Tokenization Strategy

Large Dataset Optimization

Training Process

Inference Capabilities

Example Workflow

Output Files

Performance Considerations

Troubleshooting

Common Issues

Performance Tips

Large Dataset Tips

Citation

License

Contributing

Support

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages