Michael Aerni*, Joshua Swanson*, Kristina Nikolić, Florian Tramèr
Official repository for the paper Modal Aphasia: Can Unified Multimodal Models Describe Images From Memory?
This repository includes all code, training/inference scripts, and results for the paper.
We present modal aphasia, a systematic dissociation in which current unified multimodal models accurately memorize concepts visually but fail to articulate them in writing, despite being trained on images and text simultaneously. For one, we show that leading frontier models can generate near-perfect reproductions of iconic movie artwork, but confuse crucial details when asked for textual descriptions. We corroborate those findings through controlled experiments on synthetic datasets in multiple architectures. Our experiments confirm that modal aphasia reliably emerges as a fundamental property of current unified multimodal models, not just as a training artifact. In practice, modal aphasia can introduce vulnerabilities in AI safety frameworks, as safeguards applied to one modality may leave harmful concepts accessible in other modalities. We demonstrate this risk by showing how a model aligned solely on text remains capable of generating harmful images.
Install the full set of dependencies, including flash-attention, via
uv sync --all-groups
uv pip install --no-build-isolation flash-attn==2.3.4To exclude dependencies for model training (e.g., to only recreate plots), simply run
uv syncRunning Geneval requires an additional setup set; see Geneval setup.
Certain scripts require environment variables to be set (or additional CLI arguments).
The easiest way is to copy template.env to .env and set all entries.
This repository is organized as follows:
configs/: Contains the configuration files and training/inference/grading scripts.modal_aphasia/: Root Python package of the project.misc/: Contains additional required files and results.harmon/: Original Harmon codebase with minor modifications (diffloss.pyandmar.py) to make imports work and to control randomness.janus/: Original Janus codebase.
We use the original Harmon and Janus code as-is.
We implement custom training and inference scripts,
as well as architecture/model modifications directly in our module
(modal_aphasia.harmon and modal_aphasia.janus).
Results: The Real-World experiment results are in misc/real_world_data/posters-*.json. They contain graded rubric for each poster. These results are verified manually.
Data: Textual poster descriptions generated by GPT-5 are in misc/real_world_data/posters-*.json, and GPT-5 generated poster images are in misc/real_world_data/generated_posters/gpt5/. We do not include original poster images due to copyright.
Runing the scripts: To reproduce the grading pipeline run the scripts from model_alphasia/real_world/ in following order:
grade_real_world.py- use LLM judge to grade the replication/description in open-ended format.create_rubric.py- create a rubric based on the open-ended grading.deduplicate_rubric.py- merge the rubric for text description, and rubric for image generation into the universal rubric.check_rubric_negative.py- grade replication/description in respect to universal rubric.- Do manual check of the gradings
fix_rubric_counts_missing.py- recalculate the scores after the manual check.
Run scripts as
uv run -m modal_aphasia.real_world.$script_nameResults: The graded results are available in misc/results_faces.zip and misc/results_concepts.zip.
The JSONL files contain the base64-encoded images where applicable.
Dataset: To reproduce our results, first generate the datasets.
The faces dataset requires additional steps.
If the DATA_ROOT environment variable is set, then simply run the commands below;
else, check the arguments in the scripts.
# Synthetic images
## Janus
uv run -m modal_aphasia.data.generate_synthetic_dataset
## Harmon
uv run -m modal_aphasia.data.generate_synthetic_dataset_hd
# Auxiliary data
uv run -m modal_aphasia.data.generate_aux_t2i_blip_datasetRunning the scripts: The training, inference, and grading scripts are located in configs/$dataset_$model,
where $dataset is faces or concepts and $model is janus or harmon.
Each directory contains three scripts;
all results can be reproduced by running the scripts in the following order:
train.sh: Trains models on the dataset.inference.sh: Performs all inference tasks of the trained models.grading.sh: Grades inference results where necessary.
We use Gemini 2.5 Flash Image (Nano Banana) to generate face images with control atributes (hair colour, eye colour, hair style and accessories).
Dataset: The generated raw face images are in misc/faces_raw.zip with metadata in misc/face_metadata.json.
To generate HF dataset unzip misc/faces_raw.zip into the $DATA_ROOT/faces_cache (assuming $DATA_ROOT is set in .env), and run
uv run -m modal_aphasia.data.generate_faces_datasetThis creates the datest we used in the experiments.
If you want to generate new face images, you can use modal_aphasia/faces/generate_faces.py.
We rely on following data sources for name and surname generation:
- The names are from the US Social Security Administration's list of baby names: https://www.ssa.gov/OACT/babynames/limits.html
- The surnames are derived from the 2010 census list of surnames that occur at least 100 times: https://www.census.gov/topics/population/genealogy/data/2010_surnames.html
Data is in modal_aphasia/faces/data/first_names/, and modal_aphasia/faces/names/data/surnames.csv. To generate new set of names and surnames run
uv run -m modal_aphasia/faces/names/code/sample_names.pyThe final list of names and surnames we used in our dataset is in modal_aphasia/faces/names/output/.
The modal_aphasia/faces/add_names_to_metadata.py adds this names to metadate generated by modal_aphasia/faces/generate_faces.py.
Results: The graded results are available in misc/results_safety.zip,
including the base64-encoded images in the jsonl files.
Dataset: The raw "unsafe" images need to be downloaded manually.
The file misc/safety_images_meta.jsonl contains one row per image,
containing the url, caption, filename, and sha256 hash.
Assuming $DATA_ROOT is set in .env,
download each image to $DATA_ROOT/safety_images_cache/$filename
where $filename is the filename in the misc/safety_images_meta.jsonl file.
Then, run
uv run -m modal_aphasia.data.generate_safety_datasetwhich creates a HF dataset in $DATA_ROOT/safety_images.
Running the scripts: Similar to the controlled synthetic experiments,
the training, inference, and grading scripts are located in configs/safety,
and all results can be reproduced by running the scripts in the following order:
train.sh: Trains an unsafe and aligned model for three seeds.inference.sh: Performs all inference tasks of the aligned models.grading.sh: Grades inference results where necessary.
Geneval requires very specific dependencies; hence, it must be run in its own virtual environment. The following describes how to set it up.
Let GENEVAL_ROOT be the root directory of the geneval repo
(outside of this repo!).
Run the following commands in GENEVAL_ROOT.
This guide is only for Hopper and newer GPUs!
# Set up repo and virtualenv (uv is only used for python version)
git clone https://github.com/djghosh13/geneval.git .
uv init --bare
vim pyproject.toml # change python version to `requires-python = "==3.8.10"`
# Make sure python is installed in the geneval repo, not the user home
export UV_PYTHON_INSTALL_DIR="${GENEVAL_ROOT}/python_dist/"
uv sync
uv add pip
source .venv/bin/activate
# Install dependencies in a known order
pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu121
pip install open-clip-torch==2.26.1
pip install clip-benchmark
pip install -U openmim
pip install einops
pip install lightning
pip install diffusers transformers
pip install tomli
pip install platformdirs
# Need setuptools AND wheel to build the later packages
# This might yield errors; just ignore them as long as the packages are ultimately installed
pip install -U setuptools wheel
# Install the packages that require manual compilation; this might take a while
## mmcv (and mmengine)
git clone https://github.com/open-mmlab/mmcv.git
pushd mmcv
git checkout 1.x
MMCV_WITH_OPS=1 MMCV_CUDA_ARGS="-arch=sm_90" pip install -v -e .
popd
## mmdet
git clone https://github.com/open-mmlab/mmdetection.git
pushd mmdetection
git checkout 2.x
MMCV_CUDA_ARGS="-arch=sm_90" pip install -v -e . --no-build-isolation
popd
# Download model weights
./evaluation/download_models.sh "./model_weights"Now, running the geneval grading script is done via
UV_PYTHON_INSTALL_DIR="${GENEVAL_ROOT}/python_dist/" uv run --project "${GENEVAL_ROOT}" \
-m modal_aphasia.evals.grade_geneval \
--input /path/to/input.jsonl \
--output /path/to/output.jsonl