← Back to all papers

Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

Zhou, Yu, Alon, Levy et al. — August 2024 (ICLR 2025 Oral)

📄 arXiv:2408.11039  ·  📥 PDF

TL;DR: A single transformer that uses next-token prediction for text and diffusion denoising for images simultaneously — achieving image quality matching DALL-E 2 and SDXL while preserving LLaMA-1-level text ability, at less than 1/3 the compute of discrete tokenization approaches like Chameleon.

Level 1 — Beginner

What problem does this solve?

AI today has two brilliant employees. One is amazing at writing — give it a prompt and it’ll write a perfect essay, word by word. The other is amazing at painting — give it a description and it’ll create a masterpiece. But they work in completely different buildings, speak different languages, and can’t collaborate.

Language models (like LLaMA, GPT) are great at text. Diffusion models (like Stable Diffusion, DALL-E) are great at images. But they’re separate systems stitched together with duct tape. Transfusion says: what if one brain could do both?

The cooking analogy

Core idea

The old way (Chameleon) takes a beautiful steak photo, chops it into tiny numbered LEGO pieces (discrete tokens), and has the text chef reassemble it. The chef is great at words but the LEGO steak looks blocky — you lost information when you chopped it up. Transfusion keeps the image smooth and continuous. The same chef switches between writing (text) and painting (images), using different techniques but the same brain.

What does it actually do?

Transfusion trains one single transformer to do two things simultaneously:

  1. For text: Predict the next word (like GPT — “The cat sat on the ___”)
  2. For images: Remove noise from a fuzzy image step by step (like diffusion models — start with TV static, gradually sharpen into a picture)

It uses a different “recipe” for each type of data, but the same brain processes both. When it sees text tokens, it does next-word prediction. When it sees image patches, it does denoising. The losses are simply added together.

Why does this matter?

Three big reasons:

  1. No information loss. When you convert images to discrete tokens, you’re compressing a smooth photo into numbered LEGO blocks. Transfusion keeps images as smooth, continuous data.
  2. Way more efficient. Transfusion matches Chameleon’s image quality using less than 1/3 the compute. Massive cost saving at scale.
  3. One model does everything. Text generation, image generation, image captioning, mixed content — all from a single model.

Key results

0.63
GenEval Overall
(beats DALL-E 2 & SDXL)
<1/3
Compute vs Chameleon
(same image quality)
7B
Parameters
(matches LLaMA-1 on text)
Key takeaway

This paper from Meta is the architectural thesis for unified media generation + understanding. For a media gen agent producing comics, magazines, and slide decks — Transfusion can natively produce both text and images in one pass, not two separate systems glued together.

Quiz — Level 1
1. What are the TWO different training objectives Transfusion uses within a single model?
The model predicts the next word for text and removes noise from images — two fundamentally different objectives in one model.
2. Why does Transfusion avoid converting images into discrete tokens?
VQ-VAE quantization snaps continuous values to the nearest codebook entry, losing fine detail. Continuous VAE patches avoid this.
3. In the cooking analogy, what does “chopping the steak into LEGO pieces” represent?
VQ-VAE converts smooth continuous images into a finite set of discrete codes — like replacing a smooth photograph with LEGO blocks.
4. Compared to Chameleon, Transfusion achieves similar image quality using approximately how much compute?
At every compute budget, Transfusion produces better images. To match Transfusion’s quality, Chameleon needs ~3.3× more FLOPs.
5. What makes Transfusion particularly relevant for products that create interleaved image-text documents?
Since one model handles both modalities, generating a mixed document is a natural sequence — text tokens then image patches then more text.

Level 2 — Intermediate

Training: how do you teach one model two skills?

Transfusion trains on a carefully balanced diet:

Data typeSourceProportion
TextTokenized text corpus (same recipe as LLaMA)~50% of tokens
Image-text pairsPaired datasets (image + caption, VAE-encoded)~50% of tokens

Total: 2 trillion tokens for the 7B model.

The training loop

  1. Encode images — Run each image through a frozen VAE encoder → 256 continuous patch vectors per image
  2. Build the mixed sequence — Interleave text tokens and image patches: [text] [BOI] [patches] [EOI] [more text]
  3. Add noise to image patches — Sample a random diffusion timestep t, add Gaussian noise to each image patch
  4. Forward pass — The transformer processes the entire mixed sequence
  5. Compute dual loss — Cross-entropy for text positions, MSE for image positions
  6. Backpropagate — Both losses flow through the shared transformer weights

Balancing: λ = 1 (equal weighting) works well — the two objectives are surprisingly compatible.

Diffusion inside a transformer

Standard diffusionTransfusion
DenoiserU-Net predicts noiseTransformer predicts noise
ArchitectureSeparate modelSame model that does text
Operates onFull image latentPer-patch vectors in a sequence
Text conditioningCross-attentionCausal attention (text precedes image)

Each image patch gets a timestep embedding added before entering the transformer — this tells the model “these patches have noise level t.”

Inference: generating a mixed document

Step 1 — Text generation. The model generates text tokens one at a time, left to right, exactly like GPT. When it predicts [BOI] (beginning of image), it switches to image mode.

Step 2 — Image generation. Start from pure noise. Run N denoising passes through the transformer, each pass predicting and subtracting noise, producing progressively cleaner patches. After N steps, decode patches to pixels via the VAE decoder.

Step 3 — Resume text. The model continues generating text autoregressively, now conditioned on both the preceding text AND the generated image. When it predicts another [BOI], it generates another image.

ComponentCost
Text generation1 forward pass per token (standard)
Image generationN forward passes per image (N = denoising steps, typically 250)
Image decoding1 VAE decoder pass (cheap)

Scaling experiments

Model sizeParametersKey finding
0.16B160MEven tiny models benefit from the dual objective
0.37B370MTransfusion pulls ahead of Chameleon on images
0.76B760MGap widens — efficiency advantage grows with scale
7B7BMatches LLaMA-1 on text, beats DALL-E 2 on images

The scaling curve tells two stories: text quality scales similarly for both approaches, but Transfusion scales much better for images. The gap grows with compute.

The competition

ModelTypeTextImageUnified?
LLaMA-1 7BText-only LM✔ Strong✘ NoneNo
DALL-E 2Image-only diffusion✘ None✔ GoodNo
SDXLImage-only diffusion✘ None✔ BetterNo
Chameleon 7BUnified (discrete tokens)✔ Strong⚪ OkayYes
Show-oUnified (mixed)⚪ Decent⚪ DecentYes
Transfusion 7BUnified (continuous)✔ Strong✔ GoodYes
Key takeaway

Transfusion is the first model competitive with dedicated systems on both modalities simultaneously. Previous unified models always sacrificed one for the other.

Limitations the authors acknowledge

LimitationDetail
Image resolutionTrained at 256×256 only — modern models generate at 1024×1024+
No videoFramework could extend but wasn’t tested
VAE dependencyImage quality capped by the VAE’s reconstruction ability
Inference speed250 denoising steps per image is slow
No interleaved trainingTrained on pairs, not true interleaved documents
7B onlyDidn’t push to 70B+ where advantages would likely grow

Deep dive: GenEval

GenEval tests compositional image generation — not “does the image look pretty” but “did the model actually generate what you asked for?”

Skill testedExampleWhat it checks
Single object“a backpack”Can it generate one object correctly?
Two objects“a cat and a dog”Both present, not just one?
Counting“three apples”Correct count?
Colors“a red car and a blue truck”Right colors on right objects?
Position“cat to the left of a dog”Spatial relationships?
Color attribution“green apple and red chair”Binding color to correct object (hardest)

Scoring uses an object detector (DDETR) to check presence, color, position, and count.

ModelSingleTwo objCountColorsPositionColor attrOverall
DALL-E 20.940.660.490.770.100.010.52
SDXL0.980.740.390.850.150.230.55
Transfusion 7B0.990.780.420.870.160.400.63

Transfusion dominates on color attribution (0.40 vs 0.23 vs 0.01) — the hardest compositional skill. Still struggles with counting and position, which remain unsolved even for dedicated image models.

Deep dive: efficiency

Chameleon converts images into 1,024 discrete tokens, each processed like a text token. Attention cost scales as O(n²) with sequence length.

Saving #1 — Fewer patches.

RepresentationTokens/patches per imageRelative length
Chameleon1,024 tokens1.0×
Transfusion (default)256 patches0.25×
Transfusion (compressed)16 patches0.016×

Saving #2 — Compression at inference. 256 → 64 patches: FID goes from 6.78 to 7.03 (barely noticeable). 256 → 16 patches: FID goes to 11.0 (still decent, 64× cheaper). You can choose your quality–cost tradeoff at inference time.

Saving #3 — No codebook bottleneck. No quantization loss, no codebook collapse, no straight-through estimator hacks.

Saving #4 — Better scaling curve. At every FLOP budget, Transfusion produces better images. To match Transfusion at 10²¹ FLOPs, Chameleon needs ~3.3× more compute.

Deep dive: architecture

Shared backbone: All transformer layers (attention + FFN) process both text and image data. The model learns shared representations across modalities.

What’s separate: Input embeddings (text uses token embeddings; images use a VAE encoder + linear projection) and output heads (text uses softmax over vocabulary; images use a U-Net-inspired decoder).

Attention masking — the critical design choice:

  • Text tokens: standard causal (left-to-right) attention
  • Image patches: bidirectional attention within the same image — every patch sees every other patch
  • Image patches CAN attend to all preceding text (conditioned on caption)
  • Text after an image CAN attend to the image patches
  • Patches from different images do NOT attend to each other

U-Net skip connections: Intermediate representations from earlier transformer layers are concatenated with the final layer output and fed to the image decoder. This gives access to both low-level features (textures, edges) and high-level semantics.

InnovationWhatWhy it matters
Continuous patchesImages as VAE-encoded vectors, not discrete tokensNo information loss, fewer patches needed
Dual lossCross-entropy + diffusion MSE, added togetherOne model learns both objectives
Bidirectional intra-imageImage patches attend to each other freelyImages aren’t sequential — holistic reasoning
Causal inter-modalText is autoregressive, images condition on textMaintains language model properties
U-Net skip connectionsEarly-layer features fed to image decoderBetter low-level detail in generated images
Patch compression256 → 64 → 16 patchesTunable quality–cost tradeoff at inference
Quiz — Level 2
1. How does the model know whether to apply cross-entropy loss or MSE loss at a given sequence position?
Each position in the sequence has a modality type — text tokens get cross-entropy, image patches get MSE.
2. During image generation at inference, the model produces a clean image by:
Standard diffusion inference — start from pure noise, iteratively denoise over N steps.
3. Why does Transfusion use bidirectional attention within image patches but causal attention for text?
The top-left corner of a photo is just as important as the bottom-right — there’s no “reading order.” Text is inherently sequential.
4. In the GenEval benchmark, Transfusion’s biggest advantage over DALL-E 2 and SDXL is in:
0.40 vs 0.23 (SDXL) vs 0.01 (DALL-E 2). The hardest compositional skill by far.
5. Transfusion achieves better image quality than Chameleon at <1/3 the compute primarily because:
256 patches vs 1,024 tokens = shorter sequences = quadratically less attention compute, plus no information lost to quantization.

Level 3 — Expert

Mathematical formulations

The joint objective:

L_total = L_LM + λ · L_DDPM

Language modelling loss (standard next-token cross-entropy over text positions):

L_LM = -Σ_{i ∈ T} log P_θ(x_i | x_{<i})

Diffusion loss (noise prediction over image positions):

L_DDPM = Σ_{j ∈ I} E_{t,ε} [ || ε - ε_θ(z_j^(t), t, c) ||² ]

where z_j^(t) = √(ᾱ_t) · z_j + √(1 - ᾱ_t) · ε

Here z_j is the clean latent patch vector from the VAE encoder, t is a uniformly sampled timestep, ε is the Gaussian noise added, and ᾱ_t is the cumulative noise schedule (cosine schedule from Nichol & Dhariwal 2021).

The attention mask (formally)

Given text positions T and image positions I_k for the k-th image, the attention mask M has five rules:

CaseConditionM = 1?
Within same imagei, j ∈ I_k✔ Bidirectional
Text → texti, j ∈ T, j ≤ i✔ Causal
Image → preceding texti ∈ I_k, j < min(I_k)✔ Allowed
Text → preceding imagei ∈ T, j ∈ I_k, max(I_k) < i✔ Allowed
Across different imagesi ∈ I_k, j ∈ I_{k′}, k ≠ k′✘ Blocked

Training pseudocode

def train_step(batch, model, vae_encoder):
    for sequence in batch:
        patches, text_tokens = [], []
        for element in sequence:
            if element.type == "text":
                text_tokens.append(model.text_embed(element.token_id))
            elif element.type == "image":
                z = vae_encoder(element.pixels)        # [256, d_vae]
                z = model.image_proj(z)                 # [256, d_model]
                t = randint(1, T)                       # sample timestep
                eps = randn_like(z)                     # sample noise
                z_noisy = sqrt(alpha_bar[t]) * z + sqrt(1 - alpha_bar[t]) * eps
                z_noisy += model.timestep_embed(t)      # add timestep info
                patches.append((z_noisy, eps, t))

        # Build mixed sequence, construct attention mask
        hidden = model.transformer(mixed_sequence, mask)

        # Compute dual loss
        lm_loss  = cross_entropy(hidden[text_pos], next_tokens)
        dif_loss = mse(model.image_decoder(hidden[img_pos]), target_noise)
        loss = lm_loss + lambda_ * dif_loss
        loss.backward()

Inference pseudocode

def generate_document(model, vae_decoder, prompt):
    tokens = tokenize(prompt)
    while not done:
        # --- Autoregressive text ---
        next_token = model.predict_next(tokens)
        tokens.append(next_token)
        if next_token == BOI:
            # --- Iterative image denoising ---
            latents = randn(256, d_vae)         # pure noise
            for step in range(N, 0, -1):
                t_emb = model.timestep_embed(step)
                pred_noise = model.forward_image(tokens, latents + t_emb)
                latents = ddpm_update(latents, pred_noise, step)
            image = vae_decoder(latents)
            tokens.append(image_repr)
            tokens.append(EOI)
    return tokens

DDPM update step:

z^(t-1) = (1/√α_t) · (z^(t) - (1-α_t)/√(1-ᾱ_t) · ε_θ) + σ_t · w

where w ~ N(0, I) for t > 1, σ_1 = 0

Critical evaluation

Strengths

Continuous patches — significant novelty. Demonstrating that you don’t need to discretize images, and that keeping them continuous is strictly better, is an important empirical finding.

Bidirectional intra-image attention — elegant solution. The attention mask design is the architectural innovation that makes the whole thing work.

Thorough scaling study — four model sizes, clear compute–quality curves, good ablations on attention masks and skip connections.

U-Net skip connections in a transformer — nice practical contribution to image generation quality.

Weaknesses

“Matches LLaMA-1 on text” — true at 7B/2T, but LLaMA-1 is a 2023 model. Small but consistent text degradation when image data ratio increases. The Pareto frontier isn’t fully quantified.

“Beats DALL-E 2 and SDXL” — on GenEval (compositional accuracy), not on FID (perceptual quality). Human preference was never tested. The paper cherry-picks the benchmark where it wins.

256×256 only — the elephant in the room. Scaling to 1024 means 16K patches → quadratic attention cost explodes.

No interleaved document evaluation — the paper’s core promise is unified multimodal generation, but text and image benchmarks are tested separately. Nobody tested actual interleaved output quality.

VAE as black box — image quality ceiling set by the VAE, unexplored. How sensitive are results to VAE quality?

Single timestep per image — each training step samples one noise level per image. Is the model getting enough diffusion signal given ~50% of compute goes to text?

Where this paper sits in the field

YearMilestoneContribution
2022FlamingoFrozen LM + frozen vision, glue layer — “you can combine them”
2023ChameleonDiscrete tokens for everything — “you can unify them”
2024TransfusionContinuous diffusion inside LM — “you can unify them without information loss”
2025+???High-res, multi-image consistency, video, editing — “the real product”
Key takeaway

Transfusion is a strong proof of concept that continuous diffusion inside a language model is viable and efficient. But it’s exactly that — a proof of concept at 256px. The production version needs higher resolution, multi-image consistency, controllability, and faster inference.

Quiz — Level 3
1. In the diffusion loss L_DDPM, the model is trained to predict:
Standard ε-prediction — the model learns to identify what noise was added, enabling iterative removal.
2. In Transfusion’s attention mask, image patches from different images in the same sequence:
The attention mask blocks cross-image attention. Each image is a self-contained unit for patch-level reasoning.
3. The paper’s claim of “beating DALL-E 2 and SDXL” is strongest on GenEval but weakest on:
GenEval tests compositional accuracy, not perceptual quality or aesthetics. FID is competitive but not SOTA, and human preference was never measured.
4. During the DDPM reverse step at inference, after subtracting the predicted noise the model:
The stochastic reverse process adds σ_t · w noise at each step (except t = 1) to maintain the correct distribution.
5. What is the most significant evaluation gap given Transfusion’s stated goal of unified multi-modal generation?
The paper’s promise is unified multimodal generation, but text and image quality are evaluated independently. No one tested interleaved document generation quality.

Phase 4 — Frontier

Transfusion received an ICLR 2025 Oral (top ~1% of submissions). Here’s what it sparked and what comes next.

What happened since August 2024

PaperDateKey ideaRelationship
Janus-Pro (DeepSeek)Jan 2025Unified understanding + generation, separate vision encodersValidates unified thesis but uses discrete tokens — opposite bet
Show-o TurboFeb 2025Accelerated unified model, discrete diffusionParallel work, focuses on inference speed
MADFormerJun 2025Mixed autoregressive + diffusion transformerDirect evolution — explores which patches should be AR vs diffused
Lumina-mGPT 2.0Jul 2025Standalone AR image gen, no diffusion neededChallenges claim that you need diffusion for images
ImAgent2025Training-free unified multimodal agentUses unified models as agent components
The field split

The community has divided into two camps: Continuous + Diffusion (Transfusion’s bet — MADFormer, flow-matching variants) vs Discrete Tokens (Chameleon’s bet — Janus-Pro, Emu3, Show-o). Consensus as of early 2026: continuous representations are winning for image quality, but discrete tokens are simpler to scale. No clear knockout yet.

Improvement vectors

1. High resolution

Area to explore

256px is a toy. 1024px with 8× downsampling = 16,384 patches — impossible with standard O(n²) attention. Paths forward: hierarchical generation (generate low-res, super-resolve), windowed/sparse attention, progressive patch compression (coarse-to-fine), or latent cascade with a lightweight upsampler.

2. Multi-image consistency

Area to explore

Generate a 4-page comic → the fox looks different on every page. Paths: entity embeddings (persistent identity vector injected into each image), cross-image attention (let later image patches attend to earlier images), reference image conditioning. The attention mask needs a new rule allowing later images to reference earlier ones.

3. Faster inference

Partially addressed

250 denoising steps × full transformer forward pass = very slow. Paths: DDIM (→ 50 steps), DPM-Solver++ (→ 20–25), consistency distillation (→ 1–4), flow matching (→ 10–20). Flow matching (behind SD3 and Flux) is the likely winner — mathematically simpler, straighter trajectories, fewer steps. Fully compatible with Transfusion’s framework.

4. Controllability

Area to explore

No way to condition on pose, depth, edges, or style references. Paths: ControlNet-style adapters, concatenating control embeddings with image patches, T2I-Adapter approach. The continuous patch representation makes this easier than adding control to discrete-token approaches.

5. Video extension

Area to explore

The framework naturally extends to video. Each frame → patches with bidirectional intra-frame attention, plus temporal attention across frames. Frames can be diffused jointly or autoregressively (generate frame 1, condition frame 2 on it).

6. Better VAE

Partially addressed

Image quality is capped by VAE reconstruction quality. Low-hanging fruit: swapping to SDXL’s VAE (~20% better reconstruction), consistency decoder for sharper high-frequency detail, adversarial training (VAE-GAN), or higher-resolution latents with less spatial downsampling.

Scorecard

DimensionRatingNotes
Novelty⭐⭐⭐⭐ 4/5Continuous diffusion inside LM is meaningful; pieces existed but the combination is new
Rigor⭐⭐⭐⭐ 4/5Thorough scaling experiments and ablations; loses a point for 256px and no interleaved eval
Impact⭐⭐⭐⭐⭐ 5/5ICLR 2025 Oral; spawned a research direction; every unified model paper references it
Clarity⭐⭐⭐⭐⭐ 5/5Exceptionally well-written; implementable from the paper alone
Relevance⭐⭐⭐⭐⭐ 5/5Architectural thesis for a media generation agent
Reproducibility⭐⭐⭐⭐ 4/5Clear methodology but requires significant compute; no public model weights
Overall⭐⭐⭐⭐½ 4.5/5
Improvement vectorStatusKey work
High resolutionArea to exploreNo published high-res Transfusion variant
Multi-image consistencyArea to exploreAttention mask extension needed
Faster inferencePartially addressedFlow matching, consistency distillation
ControllabilityArea to exploreControlNet-style adapters possible
Video extensionArea to exploreTemporal attention straightforward
Better VAEPartially addressedSDXL VAE, consistency decoder
Bottom line

Transfusion proved that continuous diffusion inside a language model isn’t just possible — it’s strictly better than discretizing images, and the field is now racing to build on that foundation. The single highest-impact next step: high-resolution generation with multi-image consistency for interleaved document production.

← Back to all papers