Dish images play a crucial role in the digital era, with the demand for culturally distinctive dish images continuously increasing due to the digitization of the food industry and e-commerce. In general cases, existing text-to-image generation models excel in producing high-quality images; however, they struggle to capture diverse characteristics and faithful details of specific domains, particularly Chinese dishes. To address this limitation, we propose Omni-Dish, the first text-to-image generation model specifically tailored for Chinese dishes. We develop a comprehensive dish curation pipeline, building the largest dish dataset to date. Additionally, we introduce a recaption strategy and employ a coarse-to-fine training scheme to help the model better learn fine-grained culinary nuances. During inference, we enhance the user's textual input using a pre-constructed high-quality caption library and a large language model, enabling more photorealistic and faithful image generation. Furthermore, to extend our model's capability for dish editing tasks, we propose Concept-Enhanced P2P. Based on this approach, we build a dish editing dataset and train a specialized editing model. Extensive experiments demonstrate the superiority of our methods.
Our main contributions can be summarized as:
@article{liu2025omni,
title={Omni-Dish: Photorealistic and Faithful Image Generation and Editing for Arbitrary Chinese Dishes},
author={Liu, Huijie and Wang, Bingcan and Hu, Jie and Wei, Xiaoming and Kang, Guoliang},
journal={arXiv preprint arXiv:2504.09948},
year={2025}
}
[1] Minbin Huang, Yanxin Long, Xinchi Deng, Ruihang Chu, Jiangfeng Xiong, Xiaodan Liang, Hong Cheng, Qinglin Lu, and Wei Liu.2024. Dialoggen: Multi-modal interactive dialogue system for multi-turn text-to-image generation. arXiv preprint arXiv:2403.08857 (2024).
[2] Kolors Team. 2024. Kolors: Effective Training of Diffusion Model for Photorealistic Text-to-Image Synthesis. arXiv preprint (2024).
[3] Zhipu AI Tsinghua University. 2025. CogView4 and CogView3 and CogView-3Plus. Retrieved March 21, 2025 from https://github.com/THUDM/CogView4
[4] Lixue Gong, Xiaoxia Hou, Fanshi Li, Liang Li, Xiaochen Lian, Fei Liu, Liyang Liu, Wei Liu, Wei Lu, Yichun Shi, et al. 2025. Seedream 2.0: A Native Chinese-English Bilingual Image Generation Foundation Model. arXiv preprint arXiv:2503.07703(2025).
[5] Tim Brooks, Aleksander Holynski, and Alexei A Efros. 2023. Instructpix2pix:Learning to follow image editing instructions. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 18392–18402.
[6] Shu Zhang, Xinyi Yang, Yihao Feng, Can Qin, Chia-Chih Chen, Ning Yu, Zeyuan Chen, Huan Wang, Silvio Savarese, Stefano Ermon, et al. 2024. Hive: Harnessing human feedback for instructional visual editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9026–9036.
[7] Mude Hui, Siwei Yang, Bingchen Zhao, Yichun Shi, Heng Wang, Peng Wang, Yuyin Zhou, and Cihang Xie. 2024. Hq-edit: A high-quality dataset for instructionbased image editing. arXiv preprint arXiv:2404.09990 (2024).
[8] Kai Zhang, Lingbo Mo, Wenhu Chen, Huan Sun, and Yu Su. 2023. Magicbrush:A manually annotated dataset for instruction-guided image editing. Advances in Neural Information Processing Systems 36 (2023), 31428–31449.
[9] Haozhe Zhao, Xiaojian Shawn Ma, Liang Chen, Shuzheng Si, Rujie Wu, Kaikai An, Peiyu Yu, Minjia Zhang, Qing Li, and Baobao Chang. 2024. UltraEdit: Instructionbased Fine-Grained Image Editing at Scale. Advances in Neural Information Processing Systems 37 (2024), 3058–3093.