Gate News message, April 23 — Google researchers, including He Kaiming and Xie Saining, published a paper introducing Vision Banana, a general-purpose vision understanding model created through lightweight instruction fine-tuning of the company's Nano Banana Pro (Gemini 3 Pro Image) image generation model. The key innovation unifies outputs of all vision tasks as RGB images, enabling segmentation, depth estimation, and surface normal prediction through image generation without task-specific architectures or loss functions.
In semantic segmentation, Vision Banana outperformed the specialized model SAM 3 by 4.7 percentage points on Cityscapes; in referring expression segmentation, it surpassed SAM 3 Agent. However, it lagged behind SAM 3 in instance segmentation. For 3D tasks, metric depth estimation achieved 0.929 average accuracy across four standard datasets, exceeding Depth Anything V3's 0.918, using only synthetic data without real depth information or camera parameters at inference. Surface normal estimation achieved state-of-the-art results on three indoor benchmarks.
Fine-tuning involved minimal vision task data mixed into original image generation training, preserving the model's generation capabilities—performance matched the original Nano Banana Pro in generation quality tests. The paper proposes that image generation pretraining in vision parallels text generation pretraining in language: models learn the internal representations needed for image understanding during generation, with instruction fine-tuning merely releasing this capability.