This repository implements an ETHOS-like transformer model for Electronic Health Record (EHR) data, based on the paper "Zero shot health trajectory prediction using transformer" by Renc et al. The implementation provides a complete pipeline for processing OMOP format EHR data, training a transformer model, and performing zero-shot inference.
ETHOS (Enhanced Transformer for Health Outcome Simulation) is a novel application of transformer architecture for analyzing high-dimensional, heterogeneous, and episodic health data. The model processes Patient Health Timelines (PHTs) - detailed, tokenized records of health events - to predict future health trajectories using zero-shot learning.
- Data Processing: Convert OMOP format EHR data to tokenized Patient Health Timelines
- Large Dataset Optimization: Memory management and chunked processing for datasets several GB in size
- Transformer Model: Implementation of ETHOS architecture with learnable positional encodings
- Training Pipeline: Complete training script with validation, checkpointing, and visualization
- Zero-shot Inference: Predict mortality, readmission, SOFA scores, and length of stay without task-specific training
- Timeline Generation: Generate future patient health trajectories
- Comprehensive Analysis: Analyze patient timelines and generate insights
- Clone the repository:
git clone <repository-url>
cd cursor_transformer- Install dependencies:
pip install -r requirements.txt- Create necessary directories:
mkdir -p processed_data models logs plotsPlace your OMOP data in a directory with the following structure:
your_omop_data/
├── person/
│ ├── part_0.parquet
│ ├── part_1.parquet
│ └── ...
├── visit_occurrence/
│ ├── part_0.parquet
│ └── ...
├── condition_occurrence/
├── drug_exposure/
├── procedure_occurrence/
├── measurement/
├── observation/
└── death/
Note: The code expects parquet files organized in subdirectories by table name. Each table subdirectory should contain one or more parquet files.
First, process your OMOP data to create tokenized Patient Health Timelines:
# Use default path (omop_data/)
python data_processor.py
# Specify custom OMOP data path
python data_processor.py --data_path /path/to/your/omop_data
# Use dataset tag for isolation (recommended for multiple datasets)
python data_processor.py --data_path /path/to/omop_data --tag aou_2023
# Specify custom output directory
python data_processor.py --data_path /path/to/omop_data --output_dir /path/to/output
# Adjust memory limit for large datasets
python data_processor.py --data_path /path/to/omop_data --memory_limit 16.0
# Force reprocessing (useful for debugging)
python data_processor.py --data_path /path/to/omop_data --tag aou_2023 --force_reprocessCommand line options:
--data_path: Path to OMOP data directory (default:omop_data/)--tag: Dataset tag for isolating different datasets (e.g.,aou_2023,mimic_iv,eicu)--output_dir: Output directory for processed data (default:processed_data/orprocessed_data_{tag}/)--memory_limit: Memory limit in GB (default: 8.0)--force_reprocess: Force reprocessing even if data exists
Dataset Isolation with Tags: The tag system allows you to work with multiple datasets simultaneously:
# Process All of Us 2023 data
python data_processor.py --data_path ~/omop_data_2023 --tag aou_2023
# Process MIMIC-IV data
python data_processor.py --data_path ~/mimic_iv --tag mimic_iv
# Process eICU data
python data_processor.py --data_path ~/eicu --tag eicuThis creates separate directories:
processed_data_aou_2023/- All of Us 2023 processed dataprocessed_data_mimic_iv/- MIMIC-IV processed dataprocessed_data_eicu/- eICU processed data
Train the ETHOS transformer model (memory-friendly, multi-GPU enabled):
# Optional: select multiple GPUs and improve allocator behavior
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
python train.py \
--data_dir processed_data \
--batch_size 8 \
--grad_accum_steps 4 \
--max_seq_len 1024 \
--use_amp \
--max_epochs 100 \
--learning_rate 3e-4 \
--device cuda
# Tagged dataset
python train.py --tag aou_2023 --batch_size 8 --grad_accum_steps 4 --max_seq_len 1024 --use_amp --device cuda
# Custom data directory
python train.py --data_dir processed_data_aou_2023 --batch_size 8 --grad_accum_steps 4 --max_seq_len 1024 --use_amp --device cudaTraining options:
--tag: Dataset tag to use (automatically findsprocessed_data_{tag}/)--data_dir: Directory containing processed data (default:processed_data/)--batch_size: Training batch size (default: 32)--grad_accum_steps: Gradient accumulation steps (default: 1)--max_seq_len: Max sequence length per sample (default: from config)--use_amp: Enable mixed precision--max_epochs: Maximum training epochs (default: 100)--learning_rate: Learning rate (default: 3e-4)--device: Device to use (auto/cuda/cpu, default: auto)--resume: Resume from checkpoint
Tag-based Training:
# Train on All of Us 2023 data
python train.py --tag aou_2023 --batch_size 8 --grad_accum_steps 4 --max_seq_len 1024 --use_amp --max_epochs 200
# Train on MIMIC-IV data
python train.py --tag mimic_iv --batch_size 8 --grad_accum_steps 4 --max_seq_len 1024 --use_amp --max_epochs 100
# Models are saved to separate directories:
# - models/aou_2023/
# - models/mimic_iv/Run inference with the trained model:
# Basic inference
python inference.py --model_path models/best_checkpoint.pth --patient_id 12345
# Inference with tagged dataset
python inference.py --tag aou_2023 --model_path models/aou_2023/best_checkpoint.pth
# Inference with custom data directory
python inference.py --model_path models/best_checkpoint.pth --data_dir processed_data_aou_2023Inference options:
--tag: Dataset tag to use (automatically findsprocessed_data_{tag}/)--model_path: Path to trained model checkpoint--data_dir: Directory containing processed data (default:processed_data/)--patient_id: Specific patient ID to analyze (optional)--output_dir: Directory for inference results (default:inference_results/orinference_results_{tag}/)
Tag-based Inference:
# Analyze All of Us 2023 patients
python inference.py --tag aou_2023 --model_path models/aou_2023/best_checkpoint.pth
# Analyze MIMIC-IV patients
python inference.py --tag mimic_iv --model_path models/mimic_iv/best_checkpoint.pth
# Results are saved to separate directories:
# - inference_results_aou_2023/
# - inference_results_mimic_iv/Run the complete workflow example:
# Basic workflow
python example_workflow.py --data_path ~/omop_data
# Workflow with dataset tag
python example_workflow.py --data_path ~/omop_data_2023 --tag aou_2023
# Workflow with custom memory limit
python example_workflow.py --data_path ~/omop_data --tag aou_2023 --memory_limit 16.0The ETHOS transformer implements:
- Decoder-only architecture with causal masking
- Learnable positional encodings instead of fixed sinusoidal
- Multi-head self-attention with configurable dimensions
- Feed-forward networks with residual connections
- Layer normalization and dropout for regularization
Model parameters can be adjusted in config.py:
@dataclass
class ModelConfig:
d_model: int = 768 # Model dimension
n_heads: int = 12 # Number of attention heads
n_layers: int = 12 # Number of transformer layers
d_ff: int = 3072 # Feed-forward dimension
max_seq_len: int = 2048 # Maximum sequence length
dropout: float = 0.1 # Dropout rate
@dataclass
class DataConfig:
chunk_size: int = 10000 # Process data in chunks
max_patients_per_chunk: int = 5000 # Max patients in memory
memory_limit_gb: float = 8.0 # Memory limit for processingThe implementation uses a sophisticated tokenization approach:
- Event Type Tokens: ADM (admission), DIS (discharge), COND (condition), etc.
- Concept Tokens: Specific medical concepts (ICD codes, ATC codes, etc.)
- Quantile Tokens: Numerical values converted to quantiles (Q1-Q10)
- Time Interval Tokens: Temporal gaps between events (5m, 15m, 1h, 1d, etc.)
- Static Tokens: Patient demographics, age intervals, birth year
The code is specifically optimized for large OMOP datasets:
- Chunked Processing: Data is processed in manageable chunks to control memory usage
- Memory Monitoring: Real-time memory usage tracking with configurable limits
- Garbage Collection: Automatic memory cleanup between processing steps
- Parallel Processing: Support for multiprocessing when available
- Streaming: Processes parquet files without loading entire tables into memory
The training follows the ETHOS methodology:
- Data Preparation: Convert OMOP data to chronological patient timelines
- Tokenization: Transform events into token sequences
- Sequence Modeling: Train transformer to predict next tokens
- Zero-shot Learning: Model learns to generate future health trajectories
The trained model can perform zero-shot predictions:
- Mortality Prediction: Estimate patient mortality probability
- Readmission Risk: Predict readmission within specified timeframes
- SOFA Score Estimation: Predict Sequential Organ Failure Assessment scores
- Length of Stay: Estimate hospital/ICU length of stay
- Timeline Generation: Generate future patient health trajectories
from data_processor import OMOPDataProcessor
from model import create_ethos_model
from inference import ETHOSInference
# 1. Process data with custom path
processor = OMOPDataProcessor(data_path="/path/to/omop_data")
tokenized_timelines, vocab = processor.process_all_data()
# 2. Create and train model
model = create_ethos_model(len(vocab))
# ... training code ...
# 3. Run inference
inference = ETHOSInference('models/best_checkpoint.pth', 'processed_data/vocabulary.pkl')
analysis = inference.analyze_patient_timeline(patient_timeline)
future_timeline = inference.generate_future_timeline(patient_timeline)The pipeline generates several output files:
processed_data/: Tokenized timelines, vocabulary, and mappingsmodels/: Model checkpoints and weightslogs/: Training logs and metricsplots/: Training curves and visualizationsinference_results/: Inference results and timeline visualizations
- Memory: Large models may require significant GPU memory
- Batch Size: Adjust based on available memory
- Sequence Length: Longer sequences require more memory and computation
- Data Size: Larger datasets improve model performance but increase training time
- Chunk Size: Adjust chunk size based on available RAM
- Out of Memory: Reduce batch size, sequence length, or chunk size
- Data Loading Errors: Check file paths and parquet file integrity
- Training Divergence: Reduce learning rate or increase gradient clipping
- Slow Training: Use GPU acceleration and optimize data loading
- Use SSD storage for faster data loading
- Enable mixed precision training for faster GPU training
- Use multiple workers for data loading
- Monitor GPU memory usage during training
- Adjust memory limits based on your system
- Start with smaller chunks and increase gradually
- Monitor memory usage during processing
- Use
--memory_limitto set appropriate limits for your system - Process data on machines with sufficient RAM
If you use this implementation, please cite the original ETHOS paper:
@article{renc2024zero,
title={Zero shot health trajectory prediction using transformer},
author={Renc, Pawel and Jia, Yugang and Samir, Anthony E and others},
journal={npj Digital Medicine},
year={2024}
}This project is licensed under the MIT License - see the LICENSE file for details.
Contributions are welcome! Please feel free to submit a Pull Request.
For questions and support, please open an issue on the repository or contact the maintainers.
This implementation is based on the ETHOS paper and builds upon the transformer architecture introduced in "Attention Is All You Need" by Vaswani et al.