How to prepare the image prompt given the text-image pair

Hi, the training data contains only the image and text pair. How to prepare the image condition to form the image-image_condition-text pair? Thanks!