Hi, the training data contains only the image and text pair. How to prepare the image condition to form the image-image_condition-text pair? Thanks!