NSYNC: Negative Synthetic Image Generation for Contrastive Training to Improve Stylized Text-To-Image Translation
Serkan Ozturk1, Samet Hicsonmez2, Pinar Duygulu1
1Department of Computer Engineering, Hacettepe University,
2Interdisciplinary Centre for Security, Reliability, and Trust (SnT), University of Luxembourg,
Current text conditioned image generation methods output realistic looking images, but they fail to capture specific styles. Simply finetuning them on the target style datasets still struggles to grasp the style features. In this work, we present a novel contrastive learning framework to improve the stylization capability of large text-to-image diffusion models. Motivated by the astonishing advance in image generation models that makes synthetic data an intrinsic part of model training in various computer vision tasks, we exploit synthetic image generation in our approach. Usually, the generated synthetic data is dependent on the task, and most of the time it is used to enlarge the available real training dataset. With NSYNC, alternatively, we focus on generating negative synthetic sets to be used in a novel contrastive training scheme along with real positive images. In our proposed training setup, we forward negative data along with positive data and obtain negative and positive gradients, respectively. We then refine the positive gradient by subtracting its projection onto the negative gradient to get the orthogonal component, based on which the parameters are updated. This orthogonal component eliminates the trivial attributes that are present in both positive and negative data and directs the model towards capturing a more unique style. Experiments on various styles of painters and illustrators show that our approach improves the performance over the baseline methods both quantitatively and qualitatively. Our code is available at https://github.com/giddyyupp/NSYNC.
git clone https://github.com/giddyyupp/NSYNC
cd nsync
conda create -n nsync python=3.10
conda activate nsync
conda install pytorch==2.1.0 torchvision==0.16.0 pytorch-cuda=11.8 -c pytorch -c nvidia
pip install -r requirements.txtDownload and organize the dataset as follows (e.g., monet2photo, vangogh2photo): Monet and Van Gogh datasets could be downloaded from this link.
|-- ./datasets
|-- monet2photo
|-- testA
|-- 000_train.png
|-- trainA
|-- 100_test.png
|-- vangogh2photo
|-- testA
|-- 001.png
|-- trainA
|-- 002.png
|-- other_datasets
We use a central config file to adjust parameters for different stages of the dataset preperation and training.
First, in order to extract the descriptions from images, set caption_target_name in config file and run following command:
python img2txt_internvl.pyNext, to generate a negative set to be used during training, adjust the parameters under generate NEG set: section in config file and run following command:
python gen_generic_dataset.pyNow we are ready to start training of NSYNC. Update target_name to your dataset name and initializer_token to the content of your dataset, e.g., painting or illustration, and run following command:
python contrastive_training.pyAlso you can experiment with SD versions by setting sd_version parameter.
To generate images with the trained model, first adjust the parameters under Inference: gen_imgs.py section, and run following command:
python gen_imgs.pyYou could also train the baseline Textual Inversion model for comparison. Update baseline_target_name to your dataset name and initializer_token to the content of your dataset, e.g., painting or illustration, and run following command:
python sd_textual_inversion_training.pyAlso you can experiment with SD versions by setting sd_version parameter.
CSD:
We share the required files to use with CSD repo in the metrics folder.
After setting up the CSD repo, first update the parameters (path to real and generated images etc.) in the calculate_csd.py then run the following command:
python calculate_csd.py --dataset nsync --model_path ./pretrainedmodels/pytorch_model.bin --gpu 0CMMD:
For CMMD, we use the Pytorch implementation.
FID and KID:
FID and KID metrics are calculated using this repo.
If you find this work useful, please cite:
@article{ozturk2025nsync,
title={NSYNC: Negative Synthetic Image Generation for Contrastive Training to Improve Stylized Text-To-Image Translation},
author={Serkan Ozturk and Samet Hicsonmez and Pinar Duygulu},
journal={arXiv preprint arXiv:2511.01517},
year={2025}
}This repository is licensed under the Apache License. See the LICENSE file for details.
This repo builds upon open-source contributions from:
