Omni-Dish: Photorealistic and Faithful Image Generation and Editing for Arbitrary Chinese Dishes

Huijie Liu^1,2, Bingcan Wang¹, Xiaoming Wei¹, Jie Hu¹, Guoliang Kang²,

¹Meituan, China

²Beihang University, China

Abstract

Dish images play a crucial role in the digital era, with the demand for culturally distinctive dish images continuously increasing due to the digitization of the food industry and e-commerce. In general cases, existing text-to-image generation models excel in producing high-quality images; however, they struggle to capture diverse characteristics and faithful details of specific domains, particularly Chinese dishes. To address this limitation, we propose Omni-Dish, the first text-to-image generation model specifically tailored for Chinese dishes. We develop a comprehensive dish curation pipeline, building the largest dish dataset to date. Additionally, we introduce a recaption strategy and employ a coarse-to-fine training scheme to help the model better learn fine-grained culinary nuances. During inference, we enhance the user's textual input using a pre-constructed high-quality caption library and a large language model, enabling more photorealistic and faithful image generation. Furthermore, to extend our model's capability for dish editing tasks, we propose Concept-Enhanced P2P. Based on this approach, we build a dish editing dataset and train a specialized editing model. Extensive experiments demonstrate the superiority of our methods.

Method

Our main contributions can be summarized as:

We are the first to propose an image generation model specifically for dishes, Omni-Dish. We introduce a novel dish data curation pipeline and a recaption and rewriting strategy for training and inference, capable of generating arbitrary photorealistic and faithful dish images.(Yellow Block)
We extend Omni-Dish’s capability for supporting dish editing. Building upon Omni-Dish, we present Concept-Enhanced P2P for constructing the first open source dish editing dataset, which enables the training of editing models.(Green Block)

Low-Refinement Results

Comparisons of Dish Generation

Comparisons of Dish Editing

BibTeX

@article{liu2025omni, title={Omni-Dish: Photorealistic and Faithful Image Generation and Editing for Arbitrary Chinese Dishes}, author={Liu, Huijie and Wang, Bingcan and Hu, Jie and Wei, Xiaoming and Kang, Guoliang}, journal={arXiv preprint arXiv:2504.09948}, year={2025} }

[1] Minbin Huang, Yanxin Long, Xinchi Deng, Ruihang Chu, Jiangfeng Xiong, Xiaodan Liang, Hong Cheng, Qinglin Lu, and Wei Liu.2024. Dialoggen: Multi-modal interactive dialogue system for multi-turn text-to-image generation. arXiv preprint arXiv:2403.08857 (2024).

[2] Kolors Team. 2024. Kolors: Effective Training of Diffusion Model for Photorealistic Text-to-Image Synthesis. arXiv preprint (2024).

[3] Zhipu AI Tsinghua University. 2025. CogView4 and CogView3 and CogView-3Plus. Retrieved March 21, 2025 from https://github.com/THUDM/CogView4

[4] Lixue Gong, Xiaoxia Hou, Fanshi Li, Liang Li, Xiaochen Lian, Fei Liu, Liyang Liu, Wei Liu, Wei Lu, Yichun Shi, et al. 2025. Seedream 2.0: A Native Chinese-English Bilingual Image Generation Foundation Model. arXiv preprint arXiv:2503.07703(2025).

[5] Tim Brooks, Aleksander Holynski, and Alexei A Efros. 2023. Instructpix2pix:Learning to follow image editing instructions. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 18392–18402.

[6] Shu Zhang, Xinyi Yang, Yihao Feng, Can Qin, Chia-Chih Chen, Ning Yu, Zeyuan Chen, Huan Wang, Silvio Savarese, Stefano Ermon, et al. 2024. Hive: Harnessing human feedback for instructional visual editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9026–9036.

[7] Mude Hui, Siwei Yang, Bingchen Zhao, Yichun Shi, Heng Wang, Peng Wang, Yuyin Zhou, and Cihang Xie. 2024. Hq-edit: A high-quality dataset for instructionbased image editing. arXiv preprint arXiv:2404.09990 (2024).

[8] Kai Zhang, Lingbo Mo, Wenhu Chen, Huan Sun, and Yu Su. 2023. Magicbrush:A manually annotated dataset for instruction-guided image editing. Advances in Neural Information Processing Systems 36 (2023), 31428–31449.

[9] Haozhe Zhao, Xiaojian Shawn Ma, Liang Chen, Shuzheng Si, Rujie Wu, Kaikai An, Peiyu Yu, Minjia Zhang, Qing Li, and Baobao Chang. 2024. UltraEdit: Instructionbased Fine-Grained Image Editing at Scale. Advances in Neural Information Processing Systems 37 (2024), 3058–3093.