Zhou, Yu, Alon, Levy et al. — August 2024 (ICLR 2025 Oral)
AI today has two brilliant employees. One is amazing at writing — give it a prompt and it’ll write a perfect essay, word by word. The other is amazing at painting — give it a description and it’ll create a masterpiece. But they work in completely different buildings, speak different languages, and can’t collaborate.
Language models (like LLaMA, GPT) are great at text. Diffusion models (like Stable Diffusion, DALL-E) are great at images. But they’re separate systems stitched together with duct tape. Transfusion says: what if one brain could do both?
The old way (Chameleon) takes a beautiful steak photo, chops it into tiny numbered LEGO pieces (discrete tokens), and has the text chef reassemble it. The chef is great at words but the LEGO steak looks blocky — you lost information when you chopped it up. Transfusion keeps the image smooth and continuous. The same chef switches between writing (text) and painting (images), using different techniques but the same brain.
Transfusion trains one single transformer to do two things simultaneously:
It uses a different “recipe” for each type of data, but the same brain processes both. When it sees text tokens, it does next-word prediction. When it sees image patches, it does denoising. The losses are simply added together.
Three big reasons:
This paper from Meta is the architectural thesis for unified media generation + understanding. For a media gen agent producing comics, magazines, and slide decks — Transfusion can natively produce both text and images in one pass, not two separate systems glued together.
Transfusion trains on a carefully balanced diet:
| Data type | Source | Proportion |
|---|---|---|
| Text | Tokenized text corpus (same recipe as LLaMA) | ~50% of tokens |
| Image-text pairs | Paired datasets (image + caption, VAE-encoded) | ~50% of tokens |
Total: 2 trillion tokens for the 7B model.
[text] [BOI] [patches] [EOI] [more text]Balancing: λ = 1 (equal weighting) works well — the two objectives are surprisingly compatible.
| Standard diffusion | Transfusion | |
|---|---|---|
| Denoiser | U-Net predicts noise | Transformer predicts noise |
| Architecture | Separate model | Same model that does text |
| Operates on | Full image latent | Per-patch vectors in a sequence |
| Text conditioning | Cross-attention | Causal attention (text precedes image) |
Each image patch gets a timestep embedding added before entering the transformer — this tells the model “these patches have noise level t.”
Step 1 — Text generation. The model generates text tokens one at a time, left to right, exactly like GPT. When it predicts [BOI] (beginning of image), it switches to image mode.
Step 2 — Image generation. Start from pure noise. Run N denoising passes through the transformer, each pass predicting and subtracting noise, producing progressively cleaner patches. After N steps, decode patches to pixels via the VAE decoder.
Step 3 — Resume text. The model continues generating text autoregressively, now conditioned on both the preceding text AND the generated image. When it predicts another [BOI], it generates another image.
| Component | Cost |
|---|---|
| Text generation | 1 forward pass per token (standard) |
| Image generation | N forward passes per image (N = denoising steps, typically 250) |
| Image decoding | 1 VAE decoder pass (cheap) |
| Model size | Parameters | Key finding |
|---|---|---|
| 0.16B | 160M | Even tiny models benefit from the dual objective |
| 0.37B | 370M | Transfusion pulls ahead of Chameleon on images |
| 0.76B | 760M | Gap widens — efficiency advantage grows with scale |
| 7B | 7B | Matches LLaMA-1 on text, beats DALL-E 2 on images |
The scaling curve tells two stories: text quality scales similarly for both approaches, but Transfusion scales much better for images. The gap grows with compute.
| Model | Type | Text | Image | Unified? |
|---|---|---|---|---|
| LLaMA-1 7B | Text-only LM | ✔ Strong | ✘ None | No |
| DALL-E 2 | Image-only diffusion | ✘ None | ✔ Good | No |
| SDXL | Image-only diffusion | ✘ None | ✔ Better | No |
| Chameleon 7B | Unified (discrete tokens) | ✔ Strong | ⚪ Okay | Yes |
| Show-o | Unified (mixed) | ⚪ Decent | ⚪ Decent | Yes |
| Transfusion 7B | Unified (continuous) | ✔ Strong | ✔ Good | Yes |
Transfusion is the first model competitive with dedicated systems on both modalities simultaneously. Previous unified models always sacrificed one for the other.
| Limitation | Detail |
|---|---|
| Image resolution | Trained at 256×256 only — modern models generate at 1024×1024+ |
| No video | Framework could extend but wasn’t tested |
| VAE dependency | Image quality capped by the VAE’s reconstruction ability |
| Inference speed | 250 denoising steps per image is slow |
| No interleaved training | Trained on pairs, not true interleaved documents |
| 7B only | Didn’t push to 70B+ where advantages would likely grow |
GenEval tests compositional image generation — not “does the image look pretty” but “did the model actually generate what you asked for?”
| Skill tested | Example | What it checks |
|---|---|---|
| Single object | “a backpack” | Can it generate one object correctly? |
| Two objects | “a cat and a dog” | Both present, not just one? |
| Counting | “three apples” | Correct count? |
| Colors | “a red car and a blue truck” | Right colors on right objects? |
| Position | “cat to the left of a dog” | Spatial relationships? |
| Color attribution | “green apple and red chair” | Binding color to correct object (hardest) |
Scoring uses an object detector (DDETR) to check presence, color, position, and count.
| Model | Single | Two obj | Count | Colors | Position | Color attr | Overall |
|---|---|---|---|---|---|---|---|
| DALL-E 2 | 0.94 | 0.66 | 0.49 | 0.77 | 0.10 | 0.01 | 0.52 |
| SDXL | 0.98 | 0.74 | 0.39 | 0.85 | 0.15 | 0.23 | 0.55 |
| Transfusion 7B | 0.99 | 0.78 | 0.42 | 0.87 | 0.16 | 0.40 | 0.63 |
Transfusion dominates on color attribution (0.40 vs 0.23 vs 0.01) — the hardest compositional skill. Still struggles with counting and position, which remain unsolved even for dedicated image models.
Chameleon converts images into 1,024 discrete tokens, each processed like a text token. Attention cost scales as O(n²) with sequence length.
Saving #1 — Fewer patches.
| Representation | Tokens/patches per image | Relative length |
|---|---|---|
| Chameleon | 1,024 tokens | 1.0× |
| Transfusion (default) | 256 patches | 0.25× |
| Transfusion (compressed) | 16 patches | 0.016× |
Saving #2 — Compression at inference. 256 → 64 patches: FID goes from 6.78 to 7.03 (barely noticeable). 256 → 16 patches: FID goes to 11.0 (still decent, 64× cheaper). You can choose your quality–cost tradeoff at inference time.
Saving #3 — No codebook bottleneck. No quantization loss, no codebook collapse, no straight-through estimator hacks.
Saving #4 — Better scaling curve. At every FLOP budget, Transfusion produces better images. To match Transfusion at 10²¹ FLOPs, Chameleon needs ~3.3× more compute.
Shared backbone: All transformer layers (attention + FFN) process both text and image data. The model learns shared representations across modalities.
What’s separate: Input embeddings (text uses token embeddings; images use a VAE encoder + linear projection) and output heads (text uses softmax over vocabulary; images use a U-Net-inspired decoder).
Attention masking — the critical design choice:
U-Net skip connections: Intermediate representations from earlier transformer layers are concatenated with the final layer output and fed to the image decoder. This gives access to both low-level features (textures, edges) and high-level semantics.
| Innovation | What | Why it matters |
|---|---|---|
| Continuous patches | Images as VAE-encoded vectors, not discrete tokens | No information loss, fewer patches needed |
| Dual loss | Cross-entropy + diffusion MSE, added together | One model learns both objectives |
| Bidirectional intra-image | Image patches attend to each other freely | Images aren’t sequential — holistic reasoning |
| Causal inter-modal | Text is autoregressive, images condition on text | Maintains language model properties |
| U-Net skip connections | Early-layer features fed to image decoder | Better low-level detail in generated images |
| Patch compression | 256 → 64 → 16 patches | Tunable quality–cost tradeoff at inference |
The joint objective:
L_total = L_LM + λ · L_DDPM
Language modelling loss (standard next-token cross-entropy over text positions):
L_LM = -Σ_{i ∈ T} log P_θ(x_i | x_{<i})
Diffusion loss (noise prediction over image positions):
L_DDPM = Σ_{j ∈ I} E_{t,ε} [ || ε - ε_θ(z_j^(t), t, c) ||² ]
where z_j^(t) = √(ᾱ_t) · z_j + √(1 - ᾱ_t) · ε
Here z_j is the clean latent patch vector from the VAE encoder, t is a uniformly sampled timestep, ε is the Gaussian noise added, and ᾱ_t is the cumulative noise schedule (cosine schedule from Nichol & Dhariwal 2021).
Given text positions T and image positions I_k for the k-th image, the attention mask M has five rules:
| Case | Condition | M = 1? |
|---|---|---|
| Within same image | i, j ∈ I_k | ✔ Bidirectional |
| Text → text | i, j ∈ T, j ≤ i | ✔ Causal |
| Image → preceding text | i ∈ I_k, j < min(I_k) | ✔ Allowed |
| Text → preceding image | i ∈ T, j ∈ I_k, max(I_k) < i | ✔ Allowed |
| Across different images | i ∈ I_k, j ∈ I_{k′}, k ≠ k′ | ✘ Blocked |
def train_step(batch, model, vae_encoder):
for sequence in batch:
patches, text_tokens = [], []
for element in sequence:
if element.type == "text":
text_tokens.append(model.text_embed(element.token_id))
elif element.type == "image":
z = vae_encoder(element.pixels) # [256, d_vae]
z = model.image_proj(z) # [256, d_model]
t = randint(1, T) # sample timestep
eps = randn_like(z) # sample noise
z_noisy = sqrt(alpha_bar[t]) * z + sqrt(1 - alpha_bar[t]) * eps
z_noisy += model.timestep_embed(t) # add timestep info
patches.append((z_noisy, eps, t))
# Build mixed sequence, construct attention mask
hidden = model.transformer(mixed_sequence, mask)
# Compute dual loss
lm_loss = cross_entropy(hidden[text_pos], next_tokens)
dif_loss = mse(model.image_decoder(hidden[img_pos]), target_noise)
loss = lm_loss + lambda_ * dif_loss
loss.backward()
def generate_document(model, vae_decoder, prompt):
tokens = tokenize(prompt)
while not done:
# --- Autoregressive text ---
next_token = model.predict_next(tokens)
tokens.append(next_token)
if next_token == BOI:
# --- Iterative image denoising ---
latents = randn(256, d_vae) # pure noise
for step in range(N, 0, -1):
t_emb = model.timestep_embed(step)
pred_noise = model.forward_image(tokens, latents + t_emb)
latents = ddpm_update(latents, pred_noise, step)
image = vae_decoder(latents)
tokens.append(image_repr)
tokens.append(EOI)
return tokens
DDPM update step:
z^(t-1) = (1/√α_t) · (z^(t) - (1-α_t)/√(1-ᾱ_t) · ε_θ) + σ_t · w
where w ~ N(0, I) for t > 1, σ_1 = 0
Continuous patches — significant novelty. Demonstrating that you don’t need to discretize images, and that keeping them continuous is strictly better, is an important empirical finding.
Bidirectional intra-image attention — elegant solution. The attention mask design is the architectural innovation that makes the whole thing work.
Thorough scaling study — four model sizes, clear compute–quality curves, good ablations on attention masks and skip connections.
U-Net skip connections in a transformer — nice practical contribution to image generation quality.
“Matches LLaMA-1 on text” — true at 7B/2T, but LLaMA-1 is a 2023 model. Small but consistent text degradation when image data ratio increases. The Pareto frontier isn’t fully quantified.
“Beats DALL-E 2 and SDXL” — on GenEval (compositional accuracy), not on FID (perceptual quality). Human preference was never tested. The paper cherry-picks the benchmark where it wins.
256×256 only — the elephant in the room. Scaling to 1024 means 16K patches → quadratic attention cost explodes.
No interleaved document evaluation — the paper’s core promise is unified multimodal generation, but text and image benchmarks are tested separately. Nobody tested actual interleaved output quality.
VAE as black box — image quality ceiling set by the VAE, unexplored. How sensitive are results to VAE quality?
Single timestep per image — each training step samples one noise level per image. Is the model getting enough diffusion signal given ~50% of compute goes to text?
| Year | Milestone | Contribution |
|---|---|---|
| 2022 | Flamingo | Frozen LM + frozen vision, glue layer — “you can combine them” |
| 2023 | Chameleon | Discrete tokens for everything — “you can unify them” |
| 2024 | Transfusion | Continuous diffusion inside LM — “you can unify them without information loss” |
| 2025+ | ??? | High-res, multi-image consistency, video, editing — “the real product” |
Transfusion is a strong proof of concept that continuous diffusion inside a language model is viable and efficient. But it’s exactly that — a proof of concept at 256px. The production version needs higher resolution, multi-image consistency, controllability, and faster inference.
Transfusion received an ICLR 2025 Oral (top ~1% of submissions). Here’s what it sparked and what comes next.
| Paper | Date | Key idea | Relationship |
|---|---|---|---|
| Janus-Pro (DeepSeek) | Jan 2025 | Unified understanding + generation, separate vision encoders | Validates unified thesis but uses discrete tokens — opposite bet |
| Show-o Turbo | Feb 2025 | Accelerated unified model, discrete diffusion | Parallel work, focuses on inference speed |
| MADFormer | Jun 2025 | Mixed autoregressive + diffusion transformer | Direct evolution — explores which patches should be AR vs diffused |
| Lumina-mGPT 2.0 | Jul 2025 | Standalone AR image gen, no diffusion needed | Challenges claim that you need diffusion for images |
| ImAgent | 2025 | Training-free unified multimodal agent | Uses unified models as agent components |
The community has divided into two camps: Continuous + Diffusion (Transfusion’s bet — MADFormer, flow-matching variants) vs Discrete Tokens (Chameleon’s bet — Janus-Pro, Emu3, Show-o). Consensus as of early 2026: continuous representations are winning for image quality, but discrete tokens are simpler to scale. No clear knockout yet.
256px is a toy. 1024px with 8× downsampling = 16,384 patches — impossible with standard O(n²) attention. Paths forward: hierarchical generation (generate low-res, super-resolve), windowed/sparse attention, progressive patch compression (coarse-to-fine), or latent cascade with a lightweight upsampler.
Generate a 4-page comic → the fox looks different on every page. Paths: entity embeddings (persistent identity vector injected into each image), cross-image attention (let later image patches attend to earlier images), reference image conditioning. The attention mask needs a new rule allowing later images to reference earlier ones.
250 denoising steps × full transformer forward pass = very slow. Paths: DDIM (→ 50 steps), DPM-Solver++ (→ 20–25), consistency distillation (→ 1–4), flow matching (→ 10–20). Flow matching (behind SD3 and Flux) is the likely winner — mathematically simpler, straighter trajectories, fewer steps. Fully compatible with Transfusion’s framework.
No way to condition on pose, depth, edges, or style references. Paths: ControlNet-style adapters, concatenating control embeddings with image patches, T2I-Adapter approach. The continuous patch representation makes this easier than adding control to discrete-token approaches.
The framework naturally extends to video. Each frame → patches with bidirectional intra-frame attention, plus temporal attention across frames. Frames can be diffused jointly or autoregressively (generate frame 1, condition frame 2 on it).
Image quality is capped by VAE reconstruction quality. Low-hanging fruit: swapping to SDXL’s VAE (~20% better reconstruction), consistency decoder for sharper high-frequency detail, adversarial training (VAE-GAN), or higher-resolution latents with less spatial downsampling.
| Dimension | Rating | Notes |
|---|---|---|
| Novelty | ⭐⭐⭐⭐ 4/5 | Continuous diffusion inside LM is meaningful; pieces existed but the combination is new |
| Rigor | ⭐⭐⭐⭐ 4/5 | Thorough scaling experiments and ablations; loses a point for 256px and no interleaved eval |
| Impact | ⭐⭐⭐⭐⭐ 5/5 | ICLR 2025 Oral; spawned a research direction; every unified model paper references it |
| Clarity | ⭐⭐⭐⭐⭐ 5/5 | Exceptionally well-written; implementable from the paper alone |
| Relevance | ⭐⭐⭐⭐⭐ 5/5 | Architectural thesis for a media generation agent |
| Reproducibility | ⭐⭐⭐⭐ 4/5 | Clear methodology but requires significant compute; no public model weights |
| Overall | ⭐⭐⭐⭐½ 4.5/5 |
| Improvement vector | Status | Key work |
|---|---|---|
| High resolution | Area to explore | No published high-res Transfusion variant |
| Multi-image consistency | Area to explore | Attention mask extension needed |
| Faster inference | Partially addressed | Flow matching, consistency distillation |
| Controllability | Area to explore | ControlNet-style adapters possible |
| Video extension | Area to explore | Temporal attention straightforward |
| Better VAE | Partially addressed | SDXL VAE, consistency decoder |
Transfusion proved that continuous diffusion inside a language model isn’t just possible — it’s strictly better than discretizing images, and the field is now racing to build on that foundation. The single highest-impact next step: high-resolution generation with multi-image consistency for interleaved document production.