Modal Aphasia

Michael Aerni*, Joshua Swanson*, Kristina Nikolić, Florian Tramèr

Official repository for the paper Modal Aphasia: Can Unified Multimodal Models Describe Images From Memory?

This repository includes all code, training/inference scripts, and results for the paper.

Abstract

We present modal aphasia, a systematic dissociation in which current unified multimodal models accurately memorize concepts visually but fail to articulate them in writing, despite being trained on images and text simultaneously. For one, we show that leading frontier models can generate near-perfect reproductions of iconic movie artwork, but confuse crucial details when asked for textual descriptions. We corroborate those findings through controlled experiments on synthetic datasets in multiple architectures. Our experiments confirm that modal aphasia reliably emerges as a fundamental property of current unified multimodal models, not just as a training artifact. In practice, modal aphasia can introduce vulnerabilities in AI safety frameworks, as safeguards applied to one modality may leave harmful concepts accessible in other modalities. We demonstrate this risk by showing how a model aligned solely on text remains capable of generating harmful images.

Getting Started

Environment Setup

Install the full set of dependencies, including flash-attention, via

uv sync --all-groups
uv pip install --no-build-isolation flash-attn==2.3.4

To exclude dependencies for model training (e.g., to only recreate plots), simply run

uv sync

Running Geneval requires an additional setup set; see Geneval setup.

Certain scripts require environment variables to be set (or additional CLI arguments). The easiest way is to copy template.env to .env and set all entries.

Repository Structure

This repository is organized as follows:

configs/: Contains the configuration files and training/inference/grading scripts.
modal_aphasia/: Root Python package of the project.
misc/: Contains additional required files and results.
harmon/: Original Harmon codebase with minor modifications (diffloss.py and mar.py) to make imports work and to control randomness.
janus/: Original Janus codebase.

We use the original Harmon and Janus code as-is. We implement custom training and inference scripts, as well as architecture/model modifications directly in our module (modal_aphasia.harmon and modal_aphasia.janus).

Real-World Experiments

Results: The Real-World experiment results are in misc/real_world_data/posters-*.json. They contain graded rubric for each poster. These results are verified manually.

Data: Textual poster descriptions generated by GPT-5 are in misc/real_world_data/posters-*.json, and GPT-5 generated poster images are in misc/real_world_data/generated_posters/gpt5/. We do not include original poster images due to copyright.

Runing the scripts: To reproduce the grading pipeline run the scripts from model_alphasia/real_world/ in following order:

grade_real_world.py - use LLM judge to grade the replication/description in open-ended format.
create_rubric.py - create a rubric based on the open-ended grading.
deduplicate_rubric.py - merge the rubric for text description, and rubric for image generation into the universal rubric.
check_rubric_negative.py - grade replication/description in respect to universal rubric.
Do manual check of the gradings
fix_rubric_counts_missing.py - recalculate the scores after the manual check.

Run scripts as

uv run -m modal_aphasia.real_world.$script_name

Controlled Synthetic Experiments

Results: The graded results are available in misc/results_faces.zip and misc/results_concepts.zip. The JSONL files contain the base64-encoded images where applicable.

Dataset: To reproduce our results, first generate the datasets. The faces dataset requires additional steps. If the DATA_ROOT environment variable is set, then simply run the commands below; else, check the arguments in the scripts.

# Synthetic images
## Janus
uv run -m modal_aphasia.data.generate_synthetic_dataset
## Harmon
uv run -m modal_aphasia.data.generate_synthetic_dataset_hd

# Auxiliary data
uv run -m modal_aphasia.data.generate_aux_t2i_blip_dataset

Running the scripts: The training, inference, and grading scripts are located in configs/$dataset_$model, where $dataset is faces or concepts and $model is janus or harmon. Each directory contains three scripts; all results can be reproduced by running the scripts in the following order:

train.sh: Trains models on the dataset.
inference.sh: Performs all inference tasks of the trained models.
grading.sh: Grades inference results where necessary.

Generate the Faces Dataset

We use Gemini 2.5 Flash Image (Nano Banana) to generate face images with control atributes (hair colour, eye colour, hair style and accessories).

Dataset: The generated raw face images are in misc/faces_raw.zip with metadata in misc/face_metadata.json.

To generate HF dataset unzip misc/faces_raw.zip into the $DATA_ROOT/faces_cache (assuming $DATA_ROOT is set in .env), and run

uv run -m modal_aphasia.data.generate_faces_dataset

This creates the datest we used in the experiments.

If you want to generate new face images, you can use modal_aphasia/faces/generate_faces.py.

We rely on following data sources for name and surname generation:

The names are from the US Social Security Administration's list of baby names: https://www.ssa.gov/OACT/babynames/limits.html
The surnames are derived from the 2010 census list of surnames that occur at least 100 times: https://www.census.gov/topics/population/genealogy/data/2010_surnames.html

Data is in modal_aphasia/faces/data/first_names/, and modal_aphasia/faces/names/data/surnames.csv. To generate new set of names and surnames run

uv run -m modal_aphasia/faces/names/code/sample_names.py

The final list of names and surnames we used in our dataset is in modal_aphasia/faces/names/output/. The modal_aphasia/faces/add_names_to_metadata.py adds this names to metadate generated by modal_aphasia/faces/generate_faces.py.

Safety Case-Study

Results: The graded results are available in misc/results_safety.zip, including the base64-encoded images in the jsonl files.

Dataset: The raw "unsafe" images need to be downloaded manually. The file misc/safety_images_meta.jsonl contains one row per image, containing the url, caption, filename, and sha256 hash. Assuming $DATA_ROOT is set in .env, download each image to $DATA_ROOT/safety_images_cache/$filename where $filename is the filename in the misc/safety_images_meta.jsonl file. Then, run

uv run -m modal_aphasia.data.generate_safety_dataset

which creates a HF dataset in $DATA_ROOT/safety_images.

Running the scripts: Similar to the controlled synthetic experiments, the training, inference, and grading scripts are located in configs/safety, and all results can be reproduced by running the scripts in the following order:

train.sh: Trains an unsafe and aligned model for three seeds.
inference.sh: Performs all inference tasks of the aligned models.
grading.sh: Grades inference results where necessary.

Setup Geneval

Geneval requires very specific dependencies; hence, it must be run in its own virtual environment. The following describes how to set it up.

Let GENEVAL_ROOT be the root directory of the geneval repo (outside of this repo!).

Run the following commands in GENEVAL_ROOT.

This guide is only for Hopper and newer GPUs!

# Set up repo and virtualenv (uv is only used for python version)
git clone https://github.com/djghosh13/geneval.git .
uv init --bare
vim pyproject.toml  # change python version to `requires-python = "==3.8.10"`
# Make sure python is installed in the geneval repo, not the user home
export UV_PYTHON_INSTALL_DIR="${GENEVAL_ROOT}/python_dist/"
uv sync
uv add pip
source .venv/bin/activate

# Install dependencies in a known order
pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu121
pip install open-clip-torch==2.26.1
pip install clip-benchmark
pip install -U openmim
pip install einops
pip install lightning
pip install diffusers transformers
pip install tomli
pip install platformdirs

# Need setuptools AND wheel to build the later packages
# This might yield errors; just ignore them as long as the packages are ultimately installed
pip install -U setuptools wheel

# Install the packages that require manual compilation; this might take a while
## mmcv (and mmengine)
git clone https://github.com/open-mmlab/mmcv.git
pushd mmcv
git checkout 1.x
MMCV_WITH_OPS=1 MMCV_CUDA_ARGS="-arch=sm_90" pip install -v -e .
popd

## mmdet
git clone https://github.com/open-mmlab/mmdetection.git
pushd mmdetection
git checkout 2.x
MMCV_CUDA_ARGS="-arch=sm_90" pip install -v -e . --no-build-isolation
popd

# Download model weights
./evaluation/download_models.sh "./model_weights"

Now, running the geneval grading script is done via

UV_PYTHON_INSTALL_DIR="${GENEVAL_ROOT}/python_dist/" uv run --project "${GENEVAL_ROOT}" \
    -m modal_aphasia.evals.grade_geneval \
        --input /path/to/input.jsonl \
        --output /path/to/output.jsonl

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
configs		configs
harmon		harmon
janus		janus
misc		misc
modal_aphasia		modal_aphasia
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
plot.ipynb		plot.ipynb
pyproject.toml		pyproject.toml
template.env		template.env
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Modal Aphasia

Abstract

Getting Started

Environment Setup

Repository Structure

Real-World Experiments

Controlled Synthetic Experiments

Generate the Faces Dataset

Safety Case-Study

Setup Geneval

About

Uh oh!

Releases

Packages

Languages

ethz-spylab/modal-aphasia

Folders and files

Latest commit

History

Repository files navigation

Modal Aphasia

Abstract

Getting Started

Environment Setup

Repository Structure

Real-World Experiments

Controlled Synthetic Experiments

Generate the Faces Dataset

Safety Case-Study

Setup Geneval

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages