Gabeur, Long, Peng, Voigtlaender, Sun, Bao, Truong, Wang, Zhou, Barron, Genova, Kannen, Ben, Li, Guo, Yogin, Gu, Chen, Wang, Xie, Zhou, He, Funkhouser, Alayrac, Soricut — Google DeepMind — arXiv, April 2026
For decades, computer vision has had two separate tracks:
These two camps have mostly stayed separate. Generators make pretty pictures; analyzers do the “real work” of segmentation, depth estimation, etc. Vision Banana argues this split was a mistake.
Think about what happened in language. Before GPT, NLP had separate specialist models for translation, summarization, sentiment analysis, etc. Then GPT showed: if you train a model to predict the next word, it develops such a deep understanding of language that it can do everything.
Vision Banana makes the exact same argument for images:
Language: Next-token prediction → general language understanding → instruction-tune for any task
Vision: Image generation pretraining → general visual understanding → instruction-tune for any vision task
A model that can generate realistic images must understand geometry, occlusion, materials, lighting, and semantics. That’s everything you need for downstream vision tasks.
The recipe is surprisingly simple:
This is the cleverest part. Instead of designing task-specific output formats, Vision Banana encodes every vision task output as a standard RGB image:
| Task | What it predicts | RGB encoding |
|---|---|---|
| Segmentation | Which pixels belong to which object | Each object class gets a specific color defined in the text prompt (e.g., “cat = red, dog = blue”) |
| Depth | How far away each pixel is | Continuous depth values mapped to RGB via a 3D Hilbert curve (a space-filling curve that preserves proximity) |
| Surface normals | Which direction each surface faces | X, Y, Z normal components mapped directly to R, G, B channels |
Because the model already knows how to generate RGB images, it just needs to learn which RGB image to generate for each task. The generation machinery itself doesn’t change.
Vision Banana beats purpose-built specialist models that were specifically designed for each task:
All with a single model, zero-shot (no task-specific training data from the target domain). The specialist models each took years of focused development.
A crucial detail: after instruction tuning for vision tasks, Vision Banana retains its image generation capability. A low mixing ratio of generation data during fine-tuning ensures the model doesn’t “forget” how to create images. This is like a PhD student who learns medical imaging analysis but doesn’t forget how to draw.
Vision Banana shows that the generative vs. discriminative split in computer vision may be a false dichotomy. A model trained to generate images develops such powerful internal representations that, with lightweight instruction tuning, it outperforms specialist models on analysis tasks — mirroring how LLMs unified NLP.
The central engineering insight of Vision Banana is parameterizing all vision outputs as RGB images. This keeps the model in its native output space. Let’s break down each encoding:
Depth is a continuous scalar per pixel. Naively mapping it to grayscale (0–255) gives only 8 bits of resolution. Vision Banana instead uses a 3D Hilbert curve — a space-filling curve that maps a 1D value onto a 3D path through the RGB cube.
1. Apply a power transform to the raw depth to compress the dynamic range: d’ = dγ
2. Quantize d’ into one of 224 = 16.7 million discrete levels (matching the full RGB gamut)
3. Map the quantized index to an (R, G, B) triplet via the 3D Hilbert curve
The key property: nearby depth values map to nearby colors. This “locality preservation” means small depth differences produce small color differences — which is critical because the image generator’s loss function operates in pixel space. Without locality preservation, the model would be penalized equally for being “one step off” and “completely wrong.”
Compare: grayscale gives 256 levels of depth resolution. The Hilbert encoding gives 16.7 million — a 65,536× increase in precision, all within a standard RGB image.
A surface normal is a 3D unit vector (nx, ny, nz) at each pixel, indicating which direction the surface faces. The encoding is elegantly simple:
R = (n_x + 1) / 2 × 255 // X-component → Red
G = (n_y + 1) / 2 × 255 // Y-component → Green
B = (n_z + 1) / 2 × 255 // Z-component → Blue
Since normal components range from −1 to +1, the linear rescaling maps them to 0–255. This is the standard normal-map encoding used in 3D graphics for decades — Vision Banana simply borrows it.
For semantic and instance segmentation, the text prompt specifies which color maps to which class:
Prompt: "Segment this image. Use red (255,0,0) for person,
blue (0,0,255) for car, green (0,255,0) for tree."
Output: An RGB image where each pixel is colored according
to its semantic class.
For instance segmentation (distinguishing individual objects of the same class), Vision Banana runs one class at a time — generating a separate mask per category. Each instance within that class gets a different shade (e.g., person 1 = bright red, person 2 = dark red). This per-class strategy avoids the combinatorial explosion of having to assign unique colors to potentially hundreds of instances simultaneously.
Vision Banana instruction-tunes Nano Banana Pro (built on Gemini 3 Pro) with these key details:
| Component | Detail |
|---|---|
| Base model | Nano Banana Pro (Gemini 3 Pro multimodal transformer) |
| Training data | Task-specific (input image, output RGB annotation) pairs |
| Depth data | Primarily synthetic — rendered 3D scenes with perfect ground-truth depth |
| Generation retention | Low mixing ratio of original generation data prevents forgetting |
| What’s learned | Output formatting, not visual understanding (the model already “sees” — it just learns to express what it sees in the right format) |
This is the key distinction: instruction tuning teaches the model how to express its already-existing visual understanding in a specific output format. It doesn’t teach the model to see — the generation pretraining already did that. This is directly analogous to how instruction-tuning an LLM teaches it to follow instructions, not to understand language.
One of the most surprising results: Vision Banana achieves metric (absolute) depth estimation using only synthetic training data and without any camera intrinsics (focal length, sensor size, etc.). Traditional depth estimation methods need camera parameters to convert relative depth to actual distances. Vision Banana learns to infer these implicitly from the image content itself — the model’s generative pretraining has internalized enough about real-world geometry to know that a standard doorway is about 2 meters tall.
Generative models have a structural advantage over discriminative models for pixel-level prediction:
This is why Vision Banana’s depth maps look sharper than those from regression-based models — it’s a fundamental property of how generative models produce outputs.
The elephant in the room: Vision Banana is expensive. Running a full generative model to produce a depth map or segmentation mask is orders of magnitude slower than a purpose-built discriminative model. SAM 3 or Depth Anything V3 can process images in milliseconds; Vision Banana requires the full diffusion/generation process. The paper argues this is an acceptable tradeoff for quality and generality, similar to how early LLMs were too slow for production but their quality advantage drove adoption.
Vision Banana’s technical contribution is showing that the “hard part” of vision tasks is developing visual representations — and generation pretraining already does this. The instruction tuning is just formatting: teaching the model to express its existing knowledge as Hilbert-encoded depth, channel-mapped normals, or prompt-defined segmentation colors. The representations transfer because they were learned by modeling the full visual manifold.
Vision Banana doesn’t emerge from a vacuum. It sits at the end of a clear intellectual trajectory:
| Paper | Key idea | Limitation |
|---|---|---|
| Marigold (2023) | Fine-tune a diffusion model (Stable Diffusion) for monocular depth. First to show generative representations transfer to geometric tasks. | Affine-invariant depth only (relative, not metric). Limited to depth. Slow multi-step diffusion inference. |
| Lotus (2024) | Reformulate as single-step “noise prediction → annotation prediction.” No iterative denoising needed. Faster and more accurate. | Still affine-invariant. Still limited to depth/normals. Still fine-tuning a UNet-based diffusion model. |
| Lotus-2 (2025) | Extend Lotus to more tasks including segmentation. Better training recipes. | Still UNet-based (limited scale). Still reliant on diffusion model architecture. Still affine depth. |
| Vision Banana (2026) | Replace UNet diffusion with a full multimodal transformer (Gemini 3 Pro). Generation pretraining at scale. Metric depth. All tasks unified. | Computationally expensive. No video support. Potential hallucination. |
The key jump from Lotus-2 to Vision Banana is architectural: moving from a UNet-based diffusion model to a multimodal transformer that was pretrained at massive scale on both text and images. This is what enables the “generation = pretraining” thesis — the model has seen enough of the visual world during generation training to develop genuinely general representations.
Vision Banana is built on Nano Banana Pro, which is itself built on Gemini 3 Pro — Google DeepMind’s multimodal transformer. Key architectural properties:
Previous generative-to-discriminative transfer (Marigold, Lotus) used UNet-based diffusion models designed primarily for generation. Their architecture — encoder-decoder with skip connections — wasn’t designed for the kind of global reasoning needed for scene understanding.
Multimodal transformers like Gemini 3 Pro have full self-attention across all tokens — every image patch can attend to every other image patch and to the text prompt. This gives them:
Three paradigms for using generative models in vision tasks:
| Approach | Training | Inference | Example |
|---|---|---|---|
| Noise prediction | Train diffusion model to denoise images. Fine-tune to predict task output from noisy version of ground truth. | Iterative denoising (10–50 steps) | Marigold |
| Annotation prediction | Reformulate: predict clean annotation directly (no noise). Still uses diffusion model architecture. | Single forward pass | Lotus / Lotus-2 |
| Full generative pretraining | Pretrain a massive generative model on image generation at scale. Then instruction-tune for downstream tasks. | Full generation process | Vision Banana |
The distinction matters: Marigold and Lotus fine-tune an existing diffusion model (pretrained for generation) toward a specific task. Vision Banana argues that the generation pretraining itself is the valuable part — you just need a big enough model and enough data, and the representations will be general enough for any task.
A critical technical distinction:
d_pred = a · d + b where a, b are arbitrary. Useful for some applications, but you can’t extract real-world distances. This is what Marigold, Lotus, and Lotus-2 provide.Going from affine to metric is a massive jump in difficulty. It requires the model to have internalized an understanding of real-world scale — which Vision Banana’s generation pretraining provides. The model has seen millions of images and learned the statistical regularities of how the 3D world projects onto 2D images.
This paper is a direct competitive threat to Meta’s SAM (Segment Anything Model) line:
Vision Banana represents the culmination of a clear research trajectory from Marigold through Lotus to full generative pretraining. The jump to a multimodal transformer at Gemini 3 Pro scale is what enables the “generation = pretraining” thesis to actually work — previous attempts with smaller UNet models could only go so far. The result challenges the entire specialist-model paradigm in computer vision.
Vision Banana enters a three-way contest for how to build general-purpose visual representations:
| Paradigm | Core idea | Learns | Weak at |
|---|---|---|---|
| CLIP (contrastive) | Align image and text embeddings via contrastive learning | Semantic concepts (WHAT is in the image) | Spatial/geometric understanding (WHERE things are, HOW DEEP they are) |
| DINOv2 (self-supervised) | Learn visual features by predicting masked image patches | Dense local features good for matching and retrieval | Generation capability; may miss high-level semantics |
| Generative (Vision Banana) | Learn by generating images — must model the full visual manifold | Everything: semantics, geometry, materials, lighting, occlusion | Computational cost; potential hallucination |
Vision Banana’s argument is that generation is the most information-rich pretraining objective. CLIP learns to classify (what), DINOv2 learns to match (which parts correspond), but generation must learn to reconstruct the entire visual world — geometry, physics, materials, and semantics all at once. This is the “full manifold” argument.
Vision Banana and Transfusion (Meta, 2024) arrive at the same thesis from opposite directions:
Transfusion: Start with a language model and add diffusion-based image generation as a co-training objective. Trains both objectives simultaneously from scratch on a unified architecture. The thesis: you can build one model that does language AND vision generation.
Vision Banana: Start with an already-trained image generator and instruction-tune it for discriminative vision tasks. The thesis: generation pretraining already gives you the representations needed for understanding.
Both conclude that generation and understanding should be unified. Transfusion builds the unified model from day one; Vision Banana shows that an existing generator already has what’s needed — you just need to unlock it with instruction tuning.
If Vision Banana’s thesis is correct, the implications cascade across computer vision:
The most obvious extension is “Video Banana” — applying the same thesis to video generation models:
The challenge: video generation models are even more expensive than image generation models, and temporal consistency in generation is itself an unsolved problem at the frontier.
The deepest philosophical question raised by this paper: is generation a necessary condition for understanding?
The practical path forward for deploying Vision Banana’s insights: knowledge distillation. Train a massive generative model (the teacher), use it to produce high-quality predictions on large-scale unlabeled data, then train a fast discriminative model (the student) on those predictions.
This gives you the best of both worlds: the quality of generative pretraining with the speed of a discriminative model. It’s the same pattern that worked for language (GPT-4 → distilled smaller models) and could be the way Vision Banana’s insights actually reach production.
| Dimension | Score | Notes |
|---|---|---|
| Novelty | 8/10 | The thesis (generation = pretraining for vision) isn’t brand new (Marigold started it), but the scale and completeness of the demonstration is unprecedented. |
| Impact | 9/10 | If correct, this reshapes the entire computer vision research agenda. The “specialist model” paradigm may be ending. |
| Reproducibility | 3/10 | Built on closed-source Gemini 3 Pro. No open weights. Hard to reproduce without comparable-scale multimodal generators. |
| Technical depth | 7/10 | Clever RGB encodings (Hilbert curve), solid experimental design, but the method itself is simple — the power comes from scale. |
| Writing | 8/10 | Clear exposition, good framing of the generative-vs-discriminative debate. Strong analogies to LLM trajectory. |
| Longevity | 8/10 | The generation-as-pretraining thesis will likely endure as a core principle, even if specific methods evolve. Video extension is the obvious next step. |
Vision Banana is to computer vision what GPT-1 was to NLP — not necessarily the final form, but the proof of concept that changes the research direction. The specialist-model era may be winding down. The key open questions: Can video generators achieve the same unification? Can distillation make this practical? And does Meta (with SAM, Chameleon, and Transfusion) respond by unifying its own generation and understanding stacks?