Google's Vision Banana: A Unified Vision Model Outperforms Task-Specific Models in Segmentation and 3D Geometry

Gate News message, April 23 — Google researchers, including He Kaiming and Xie Saining, published a paper introducing Vision Banana, a general-purpose vision understanding model created through lightweight instruction fine-tuning of the company's Nano Banana Pro (Gemini 3 Pro Image) image generation model. The key innovation unifies outputs of all vision tasks as RGB images, enabling segmentation, depth estimation, and surface normal prediction through image generation without task-specific architectures or loss functions.

In semantic segmentation, Vision Banana outperformed the specialized model SAM 3 by 4.7 percentage points on Cityscapes; in referring expression segmentation, it surpassed SAM 3 Agent. However, it lagged behind SAM 3 in instance segmentation. For 3D tasks, metric depth estimation achieved 0.929 average accuracy across four standard datasets, exceeding Depth Anything V3's 0.918, using only synthetic data without real depth information or camera parameters at inference. Surface normal estimation achieved state-of-the-art results on three indoor benchmarks.

Fine-tuning involved minimal vision task data mixed into original image generation training, preserving the model's generation capabilities—performance matched the original Nano Banana Pro in generation quality tests. The paper proposes that image generation pretraining in vision parallels text generation pretraining in language: models learn the internal representations needed for image understanding during generation, with instruction fine-tuning merely releasing this capability.

Disclaimer: The information on this page may come from third-party sources and is for reference only. It does not represent the views or opinions of Gate and does not constitute any financial, investment, or legal advice. Virtual asset trading involves high risk. Please do not rely solely on the information on this page when making decisions. For details, see the Disclaimer.
Comment
0/400
No comments