Chameleon Team (Meta FAIR) — May 2024 (ICLR 2025)
Most AI models are specialists. GPT writes text. DALL-E makes images. They can’t truly work together. What if you wanted an AI that could write a travel guide with photos, weaving them naturally?
Chameleon says: turn everything into tokens. Text is already tokens (words broken into pieces). Images? Chop them into 1,024 discrete tokens too via a VQ-VAE. Then one transformer processes the whole thing — text, images, code — as a single stream.
Imagine a United Nations where every delegate speaks a different language. The old approach: hire specialized translators. Chameleon’s approach: teach everyone Esperanto. Every image is translated into the same “language” as text — discrete numbered tokens. One brain processes it all, no translators needed. The tradeoff? Some nuance is lost in translation.
Transfusion keeps images as continuous patches and uses diffusion to generate them. Chameleon takes the opposite bet: convert images to discrete tokens (like text) and use next-token prediction for everything. Simpler architecture, one unified loss — but the image-to-token conversion is lossy, and you need 1,024 tokens per image instead of 256 patches.
Training mixed-modal models is unstable. Chameleon’s biggest contribution might not be the architecture itself, but figuring out how to train it without it blowing up. They invented QK-Norm and reordered layer norms to prevent divergence. Without these tricks, training collapses ~20% in.
Interleaved generation actually works. One of the first models where you can prompt “show me cool birds and tell me about them” and get a coherent response weaving text and generated images naturally.
Chameleon proves that the “tokenize everything” approach can produce a genuinely unified multimodal model that beats GPT-4V on mixed-modal tasks. The cost: image quality is capped by the VQ-VAE tokenizer, and generation requires 4× more tokens per image than Transfusion.
| Data type | Description | Scale |
|---|---|---|
| Text | Web text, books, code (similar to LLaMA) | ~4.5T text tokens |
| Image-text pairs | Image + caption datasets | ~1.4B pairs |
| Interleaved documents | Web pages with images and text naturally mixed | ~400B tokens |
Total: ~10 trillion tokens — 5× more data than Transfusion’s 2T. The interleaved data is crucial: this teaches the model to naturally weave text and images. Most prior work only trained on paired data (one image, one caption).
| Chameleon-7B | Chameleon-34B | |
|---|---|---|
| Parameters | 7B | 34B |
| Layers | 32 | 48 |
| Hidden dim | 4,096 | 8,192 |
| Attention heads | 32 | 64 |
| Training tokens | ~4.4T | ~9.2T |
| Component | Tokens | Notes |
|---|---|---|
| BPE text vocabulary | 65,536 | Standard text tokens |
| Image codebook | 8,192 | Discrete image tokens from VQ-VAE |
| Special tokens | ~100 | <image_start>, <image_end>, etc. |
| Total vocabulary | ~73,828 | One unified softmax |
Text and image tokens live in the same embedding space. The model doesn’t know it’s switching modalities — it just predicts the next token from a ~73K vocabulary.
Chameleon uses a VQ-VAE derived from Meta’s Make-A-Scene image tokenizer:
Encoding:
Input image (512×512×3)
↓ Encoder CNN
Latent grid (32×32×256)
↓ Quantize each position → nearest of 8,192 codebook entries
Token grid (32×32) = 1,024 discrete tokens
↓ Flatten row-by-row
Token sequence [t_1, t_2, ..., t_1024]
Decoding:
Token sequence → codebook lookup → latent grid → Decoder CNN → image
The VQ-VAE is frozen during Chameleon training. Its reconstruction quality caps the model’s generation quality:
| Metric | Chameleon’s VQ-VAE | Stable Diffusion’s VAE |
|---|---|---|
| rFID (reconstruction) | ~1.5–2.0 | ~0.5–1.0 |
| Resolution | 512×512 | 512×512+ |
| Representation | 1,024 discrete tokens | 4,096 continuous floats |
This is arguably Chameleon’s most important contribution. Standard transformer training diverges ~20% into mixed-modal training.
Problem 1: Softmax attention explodes. When training on mixed modalities, the norms of Q and K vectors grow unboundedly. Image tokens and text tokens produce very different activation magnitudes. Once norms get large enough, softmax saturates (one weight → 1.0, all others → 0.0) and gradients vanish. Training collapses.
Normalize Q and K to unit vectors before computing attention, with a learnable temperature τ per head. This bounds attention logits to a fixed range regardless of input magnitudes. Without QK-Norm, training diverges at ~500B tokens. With it, stable through 10T+.
Problem 2: Layer norm placement. Standard Pre-Norm isn’t enough. Chameleon uses a revised Pre-Norm with QK-Norm applied after the query/key projections. The combination of Pre-Norm + QK-Norm prevents gradient explosions.
Problem 3: Image-text loss ratio instabilities. During training, image loss and text loss oscillate in anti-correlation — when image loss drops, text loss spikes, and vice versa. The modalities compete for model capacity. Fix: careful data scheduling with changing ratios during training (more text early, more interleaved later).
Text generation: Identical to any language model — autoregressive next-token prediction.
Image generation: When the model predicts <image_start>, it generates 1,024 image tokens autoregressively — each sampled from the 8,192-entry image vocabulary using the same softmax as text. No diffusion, no iterative denoising. Just next-token prediction, 1,024 times. Then the tokens are decoded through the VQ-VAE decoder into pixels.
Interleaved generation: The model decides when to insert images based on context, generates them inline, then continues text conditioned on everything before.
| Chameleon | Transfusion | |
|---|---|---|
| Forward passes per image | 1,024 (one per token, sequential) | 250 × 1 (denoising steps, patches in parallel) |
| Parallelizable? | No (autoregressive) | Yes (all patches denoised simultaneously per step) |
| Dimension | Chameleon | Transfusion | Winner |
|---|---|---|---|
| Architecture simplicity | One loss, one vocabulary | Two losses, two representations | 🏆 Chameleon |
| Image quality (FID) | Higher (worse) | Lower (better) at same compute | 🏆 Transfusion |
| Compute efficiency | 1,024 tokens × quadratic attention | 256 patches × cheaper attention | 🏆 Transfusion |
| Text quality | Competitive with LLaMA-2 | Competitive with LLaMA-1 | 🏆 Chameleon |
| Mixed-modal generation | ✔ Native, demonstrated | ✔ Possible, not benchmarked | 🏆 Chameleon |
| Training stability | Hard — needed QK-Norm innovations | Easier — diffusion loss is smoother | 🏆 Transfusion |
| Scalability | Proven at 34B | Only tested to 7B | 🏆 Chameleon |
| Information preservation | ✘ VQ-VAE quantization loss | ✔ Continuous, no quantization | 🏆 Transfusion |
| Inference flexibility | Fixed: 1,024 tokens always | Tunable: 16–256 patches, adjustable steps | 🏆 Transfusion |
Text benchmarks (Chameleon-34B):
| Benchmark | Chameleon-34B | Mixtral 8x7B | Gemini-Pro | LLaMA-2 70B |
|---|---|---|---|---|
| MMLU | 62.0 | 70.6 | 71.8 | 69.8 |
| ARC-Challenge | 78.1 | 81.4 | — | 78.3 |
| HellaSwag | 83.9 | 86.5 | — | 85.3 |
| WinoGrande | 77.0 | 81.2 | — | 80.2 |
Competitive but doesn’t beat text-only specialists. Small “multimodal tax” — image training doesn’t destroy text capability, but doesn’t help either.
Image captioning (where Chameleon shines):
| Benchmark | Chameleon-34B | LLaVA-1.5 |
|---|---|---|
| COCO CIDEr | 141.1 | 137.2 |
| NoCaps CIDEr | 124.8 | 117.5 |
| Flickr30K CIDEr | 106.3 | 97.8 |
State-of-the-art on image captioning — early fusion helps the model deeply understand images.
Mixed-modal human evaluation:
| Chameleon preferred | Tie | Other preferred | |
|---|---|---|---|
| vs GPT-4V | 51.6% | 8.2% | 40.2% |
| vs Gemini-Pro | 60.4% | 6.1% | 33.5% |
| Limitation | Detail |
|---|---|
| Image quality | Generated images decent but not competitive with SDXL/DALL-E 3 — capped by VQ-VAE |
| Training cost | 10T tokens at 34B params = enormous compute budget |
| Safety gating | 7B released with image generation disabled; 34B restricted access |
| No video | Text + images only |
| Self-created benchmark | Mixed-modal human eval designed by the authors — no independent validation |
The unified objective. Chameleon’s beauty is its simplicity — one loss for everything:
L = -Σ_{i=1}^{N} log P_θ(x_i | x_{<i})
Where x_i can be a text token OR an image token. Same cross-entropy, same softmax, same backpropagation path. Compare to Transfusion’s dual loss: L_LM + λ · L_DDPM. No balancing hyperparameter λ, no noise scheduling, no timestep conditioning.
Given encoder output z_e and codebook E = {e_k} with K = 8,192 entries:
Quantization (forward):
z_q(i,j) = e_{k*} where k* = argmin_k ||z_e(i,j) - e_k||_2
Straight-through estimator (gradient hack):
z_q = z_e + sg(z_q - z_e)
Forward: equals z_q
Backward: gradient flows through z_e only (pretend quantization didn't happen)
The argmin is non-differentiable. The straight-through estimator (STE) copies the gradient from z_q directly to z_e — mathematically unjustified but empirically works.
VQ-VAE training loss (3 terms):
L_VQ-VAE = ||x - D(z_q)||^2 // reconstruction: make output look like input
+ ||sg[z_e] - z_q||^2 // codebook: move codebook entries toward encoder outputs
+ β||z_e - sg[z_q]||^2 // commitment: prevent encoder from drifting from codebook
| Term | Gradient flows to | Purpose |
|---|---|---|
| Reconstruction | Encoder + Decoder (via STE) | Make reconstructions look good |
| Codebook | Codebook vectors only | Move codebook entries toward encoder outputs |
| Commitment | Encoder only | Prevent encoder from “running away” from codebook |
In practice, Chameleon uses EMA codebook updates instead of the gradient-based codebook loss:
e_k ← γ · e_k + (1 - γ) · mean(z_e mapped to k) (γ = 0.99)
With K = 8,192 entries, typical utilization is only 40–70%. Thousands of entries go unused due to:
Mitigations: code reset (replace dead codes with sampled encoder outputs), EMA decay, entropy regularization. Even with mitigations, the effective information capacity per position is less than the theoretical log2(8192) = 13 bits.
Standard multi-head attention computes:
attn_logits = Q · K^T / √(d_h)
When ||Q|| and ||K|| grow large (which happens with mixed modalities), logits explode → softmax saturates → one-hot attention → gradient vanishing.
Chameleon’s fix:
Q_hat = Q / ||Q||_2 // normalize to unit vector
K_hat = K / ||K||_2 // normalize to unit vector
attn_logits = τ_h · Q_hat · K_hat^T / √(d_h)
where τ_h is a learnable temperature per head
Now dot products are bounded to [-1, 1], and τ_h controls sharpness: higher τ = sharper focus on specific positions, lower τ = broader attention.
| Configuration | Training status |
|---|---|
| Standard attention, no QK-Norm | Diverges at ~500B tokens |
| QK-Norm, standard LayerNorm | Diverges at ~2T tokens |
| QK-Norm + revised Pre-Norm | Stable through 10T+ tokens ✔ |
QK-Norm appeared in ViT-22B and nGPT, but Chameleon’s contribution is proving it’s essential for mixed-modal early fusion at scale — make-or-break, not nice-to-have.
Text loss and image loss exhibit anti-correlated oscillations. When gradient updates optimize for image prediction, shared weights shift toward image-favorable representations — text prediction temporarily suffers, and vice versa.
Management: data ratio scheduling (more text early, more interleaved later), gradient norm monitoring, and a two-stage training process (pre-training on all modalities, then alignment with curated safety-filtered data).
def chameleon_train_step(batch, model, vqvae, optimizer):
total_loss = 0
for document in batch:
token_sequence = []
for element in document:
if element.type == "text":
tokens = bpe_tokenize(element.text)
token_sequence.extend(tokens)
elif element.type == "image":
with torch.no_grad(): # VQ-VAE is frozen
z_e = vqvae.encoder(element.pixels)
indices = vqvae.quantize(z_e) # [32,32] ints
img_tokens = indices.flatten().tolist()
img_tokens = [t + TEXT_VOCAB_SIZE for t in img_tokens]
token_sequence.append(IMAGE_START)
token_sequence.extend(img_tokens)
token_sequence.append(IMAGE_END)
# Standard causal LM - no special attention mask needed
input_ids = token_sequence[:-1]
target_ids = token_sequence[1:]
logits = model(input_ids) # [seq_len, 73828]
# ONE cross-entropy loss over ALL positions
loss = cross_entropy(logits, target_ids)
total_loss += loss
total_loss.backward()
clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()
Note the simplicity compared to Transfusion’s training step: no noise sampling, no timestep conditioning, no dual losses.
class QKNormAttention(nn.Module):
def __init__(self, d_model, n_heads):
self.d_h = d_model // n_heads
self.W_Q = nn.Linear(d_model, d_model, bias=False)
self.W_K = nn.Linear(d_model, d_model, bias=False)
self.W_V = nn.Linear(d_model, d_model, bias=False)
self.W_O = nn.Linear(d_model, d_model, bias=False)
self.tau = nn.Parameter(torch.ones(n_heads, 1, 1))
def forward(self, x, causal_mask):
Q = self.W_Q(x) # project
K = self.W_K(x)
V = self.W_V(x)
# Reshape to [B, heads, S, d_h] ...
Q = F.normalize(Q, dim=-1) # unit vectors
K = F.normalize(K, dim=-1)
logits = self.tau * (Q @ K.T) / sqrt(self.d_h)
logits = logits.masked_fill(~causal_mask, -inf)
weights = softmax(logits, dim=-1)
return self.W_O(weights @ V)
def generate_image(model, text_context, vqvae, temp=0.9, top_p=0.95):
tokens = tokenize(text_context) + [IMAGE_START]
for i in range(1024):
logits = model(tokens)[-1]
# Mask out text tokens - only sample from image vocab
logits[:TEXT_VOCAB_SIZE] = -inf
probs = softmax(logits / temp, dim=-1)
# Nucleus (top-p) sampling
sorted_p, sorted_idx = sort(probs, descending=True)
cumsum = cumulative_sum(sorted_p)
mask = (cumsum - sorted_p) > top_p
sorted_p[mask] = 0
sorted_p /= sorted_p.sum()
next_token = sorted_idx[multinomial(sorted_p, 1)]
tokens.append(next_token)
tokens.append(IMAGE_END)
# Decode via VQ-VAE
img_indices = [t - TEXT_VOCAB_SIZE for t in tokens[-1025:-1]]
indices = tensor(img_indices).reshape(32, 32)
with torch.no_grad():
z_q = vqvae.codebook_lookup(indices)
image = vqvae.decoder(z_q)
return image
| Contribution | Novelty | Notes |
|---|---|---|
| Early fusion at 34B scale | ⭐⭐⭐⭐⭐ 5/5 | First to prove tokenize-everything works at this scale for generation |
| QK-Norm for mixed-modal stability | ⭐⭐⭐⭐ 4/5 | QK-Norm existed, but proving it’s essential for mixed-modal is new |
| Interleaved generation | ⭐⭐⭐⭐⭐ 5/5 | First model to convincingly generate naturally interleaved text-image documents |
| Alignment for multimodal | ⭐⭐⭐ 3/5 | RLHF-style alignment applied to multimodal — needed to be done |
| Architecture itself | ⭐⭐ 2/5 | Standard transformer + VQ-VAE — innovation is in training, not architecture |
No FID reported. The paper never reports FID on a standard benchmark. The VQ-VAE bottleneck makes FID uncompetitive with SDXL/DALL-E 2 — this is a strategic omission. They focus on mixed-modal evaluation where they’re stronger.
Self-created human eval. The “beats GPT-4V 51.6%” headline comes from a benchmark the authors designed, with prompts they chose, criteria they set, and annotation they ran. 51.6% is barely above a coin flip. No independent replication.
Missing compute comparisons. No total training FLOPs, no GPU-hours, no inference latency. We can’t tell if the multimodal capability is “free” or expensive relative to text-only.
Safety gating tells a story. Meta released the 7B model with image generation disabled and safety-gated the 34B. This means: Meta is confident in understanding capabilities (released openly), but NOT confident they’ve solved safety for generation (restricted access). The alignment stage likely reduced but didn’t eliminate harmful image generation.
| Chameleon | Transfusion | |
|---|---|---|
| Core thesis | Simplicity wins — one loss, one vocabulary | Quality wins — continuous is worth the complexity |
| Strongest evidence | Mixed-modal generation works; scales to 34B; human eval | Better FID at lower compute; explicit scaling curves |
| Weakest evidence | No FID; self-created benchmark; compute not reported | Never tested interleaved generation; only 7B |
| Real-world readiness | Closer — actually generates documents | Image quality too low without upsampler |
| What it needs | Better VQ-VAE (or switch to continuous) | Scale to 34B+; test interleaved; add resolution |
These papers are complementary, not competitive. Chameleon proved the training recipe and the product concept. Transfusion proved the representation and efficiency. The model that wins in production will combine both — Chameleon’s training stability and interleaved capability with Transfusion’s continuous image representation.
Chameleon, alongside Transfusion, kicked off a wave of unified multimodal models. Here’s what it sparked and what comes next.
| Paper | Date | Approach | Key result |
|---|---|---|---|
| Emu3 (BAAI → Nature) | Sep 2024 | Tokenize everything, but much better visual tokenizer (SBER-MoVQGAN, 32K codebook) | Proves Chameleon’s thesis: bottleneck was VQ-VAE quality, not the approach. Matches SDXL on FID. |
| Janus-Pro (DeepSeek) | Jan 2025 | Separate vision encoders for understanding vs generation | Key insight: what makes good visual representation for understanding ≠ generation |
| JanusFlow (DeepSeek) | Jan 2025 | Janus’s decoupled encoders + rectified flow (continuous) | Bridges Chameleon vs Transfusion — unified training + continuous generation |
| Discrete Diffusion Timestep Tokens | Apr 2025 | Discrete tokens + diffusion scheduling hybrid | Gets best of both: Chameleon’s simplicity + diffusion’s iterative refinement |
| Show-o | 2024 | Discrete diffusion for images (masking/unmasking) | Another hybrid: Chameleon’s vocabulary + diffusion-like generation |
Emu3 deserves special attention: same approach (tokenize everything, single next-token prediction loss), but with a much better visual tokenizer. Published in Nature — rare for an ML paper. Matches SDXL on FID while also being a strong language model. This proves Chameleon’s core claim was right. The bottleneck wasn’t the discrete approach — it was the VQ-VAE quality.
Janus-Pro’s key finding: what makes a good visual representation for understanding (semantic, abstract) is different from what makes a good representation for generation (pixel-precise, detailed). Use SigLIP/CLIP for understanding, VQ-VAE/VAE for generation, share the transformer backbone. The cost of two vision encoders is minimal compared to the transformer.
May 2024: Chameleon Aug 2024: Transfusion
| "Tokenize everything" | "Continuous + diffusion"
v v
Sep 2024: Emu3 Jan 2025: JanusFlow
"Better tokenizer fixes it" "Decoupled encoders + flow"
| |
+----------------+-----------------+
|
v
2025-2026: Convergence
Emerging consensus:
1. Separate encoders for understanding vs generation
2. Better tokenizers (continuous OR high-quality discrete)
3. Unified transformer backbone
4. Training stability tricks (QK-Norm) are essential
Chameleon’s 8,192-entry VQ-VAE with 40–70% utilization is the biggest bottleneck. Paths: larger codebook (32K–64K), SBER-MoVQGAN (Emu3’s approach), Finite Scalar Quantization (FSQ — eliminates codebook collapse by construction), or Lookup-Free Quantization (LFQ — exponential codebook without explicit entries).
One VQ-VAE serving double duty is a forced compromise. Use SigLIP/CLIP for understanding (optimized for semantics), VQ-VAE/VAE for generation (optimized for pixel reconstruction), share the transformer. Janus-Pro proved this works.
1,024 tokens per image is expensive (quadratic attention, sequential generation). Paths: higher compression VQ-VAE (256 tokens), hierarchical coarse-to-fine (64 + 256), variable-length tokenization (simple images get fewer tokens), or multi-scale with upsampler. Dropping to 256 tokens would make Chameleon compute-competitive with Transfusion.
At 1,024 tokens per frame and 24 fps: 1 second = 24,576 tokens, 10 seconds = 245,760 tokens. Computationally intractable with current sequence lengths. Needs: temporal compression (1 token-set per keyframe), 3D VQ-VAE for video volumes, sparse attention across frames.
Does the multimodal tax shrink at larger scale? Hypothesis: yes — at 70B+, the transformer has enough capacity that text and image objectives stop competing. Only a handful of labs can attempt this.
| Dimension | Rating | Notes |
|---|---|---|
| Novelty | ⭐⭐⭐⭐ 4/5 | Early fusion with discrete tokens at scale for generation + understanding. QK-Norm contribution is real. |
| Rigor | ⭐⭐⭐ 3/5 | Good breadth but key metrics missing (FID, compute). Self-created benchmark. Safety gating limits repro. |
| Impact | ⭐⭐⭐⭐⭐ 5/5 | ICLR 2025. Spawned Emu3 (Nature), Janus-Pro, and the entire “tokenize everything” direction. |
| Clarity | ⭐⭐⭐⭐ 4/5 | Well-written but some key details vague (training data ratios, alignment recipe). Stability section is excellent. |
| Relevance | ⭐⭐⭐⭐⭐ 5/5 | Closest existence proof to a media gen agent product. Proves interleaved generation is viable. |
| Overall | ⭐⭐⭐⭐ 4.2/5 |
| Chameleon | Transfusion | |
|---|---|---|
| Novelty | ⭐⭐⭐⭐ 4/5 | ⭐⭐⭐⭐ 4/5 |
| Rigor | ⭐⭐⭐ 3/5 | ⭐⭐⭐⭐ 4/5 |
| Impact | ⭐⭐⭐⭐⭐ 5/5 | ⭐⭐⭐⭐⭐ 5/5 |
| Clarity | ⭐⭐⭐⭐ 4/5 | ⭐⭐⭐⭐⭐ 5/5 |
| Relevance | ⭐⭐⭐⭐⭐ 5/5 | ⭐⭐⭐⭐⭐ 5/5 |
Together, Chameleon and Transfusion define the design space. Chameleon is the “simplicity” pole, Transfusion is the “quality” pole. Every model since sits somewhere on the spectrum. For your media gen agent: start with Chameleon’s training recipe (QK-Norm, interleaved data), use a better tokenizer (Emu3’s or continuous), consider Janus’s decoupled encoders. Multi-image consistency is still unsolved — that’s your product’s biggest technical risk.
| Improvement vector | Status | Key work |
|---|---|---|
| Fix tokenizer | Addressed by Emu3 | SBER-MoVQGAN, FSQ, LFQ |
| Decouple encoders | Addressed by Janus | Separate understanding vs generation |
| Reduce tokens/image | Area to explore | Higher compression, variable-length |
| Video generation | Area to explore | Temporal compression, 3D VQ-VAE |
| Scale to 70B+ | Area to explore | Multimodal tax at larger scale |