Transfusion — Predict the Next Token and Diffuse Images with One Multi-Modal Model

Level 1 — Beginner

▼

What problem does this solve?

AI today has two brilliant employees. One is amazing at writing — give it a prompt and it’ll write a perfect essay, word by word. The other is amazing at painting — give it a description and it’ll create a masterpiece. But they work in completely different buildings, speak different languages, and can’t collaborate.

Language models (like LLaMA, GPT) are great at text. Diffusion models (like Stable Diffusion, DALL-E) are great at images. But they’re separate systems stitched together with duct tape. Transfusion says: what if one brain could do both?

The cooking analogy

Core idea

The old way (Chameleon) takes a beautiful steak photo, chops it into tiny numbered LEGO pieces (discrete tokens), and has the text chef reassemble it. The chef is great at words but the LEGO steak looks blocky — you lost information when you chopped it up. Transfusion keeps the image smooth and continuous. The same chef switches between writing (text) and painting (images), using different techniques but the same brain.

What does it actually do?

Transfusion trains one single transformer to do two things simultaneously:

For text: Predict the next word (like GPT — “The cat sat on the ___”)
For images: Remove noise from a fuzzy image step by step (like diffusion models — start with TV static, gradually sharpen into a picture)

It uses a different “recipe” for each type of data, but the same brain processes both. When it sees text tokens, it does next-word prediction. When it sees image patches, it does denoising. The losses are simply added together.

Why does this matter?

Three big reasons:

No information loss. When you convert images to discrete tokens, you’re compressing a smooth photo into numbered LEGO blocks. Transfusion keeps images as smooth, continuous data.
Way more efficient. Transfusion matches Chameleon’s image quality using less than 1/3 the compute. Massive cost saving at scale.
One model does everything. Text generation, image generation, image captioning, mixed content — all from a single model.

Key results

0.63

GenEval Overall
(beats DALL-E 2 & SDXL)

<1/3

Compute vs Chameleon
(same image quality)

7B

Parameters
(matches LLaMA-1 on text)

Key takeaway

This paper from Meta is the architectural thesis for unified media generation + understanding. For a media gen agent producing comics, magazines, and slide decks — Transfusion can natively produce both text and images in one pass, not two separate systems glued together.

Quiz — Level 1

1. What are the TWO different training objectives Transfusion uses within a single model?

The model predicts the next word for text and removes noise from images — two fundamentally different objectives in one model.

2. Why does Transfusion avoid converting images into discrete tokens?

VQ-VAE quantization snaps continuous values to the nearest codebook entry, losing fine detail. Continuous VAE patches avoid this.

3. In the cooking analogy, what does “chopping the steak into LEGO pieces” represent?

VQ-VAE converts smooth continuous images into a finite set of discrete codes — like replacing a smooth photograph with LEGO blocks.

4. Compared to Chameleon, Transfusion achieves similar image quality using approximately how much compute?

At every compute budget, Transfusion produces better images. To match Transfusion’s quality, Chameleon needs ~3.3× more FLOPs.

5. What makes Transfusion particularly relevant for products that create interleaved image-text documents?

Since one model handles both modalities, generating a mixed document is a natural sequence — text tokens then image patches then more text.

Level 2 — Intermediate

▼

Training: how do you teach one model two skills?

Transfusion trains on a carefully balanced diet:

Data type	Source	Proportion
Text	Tokenized text corpus (same recipe as LLaMA)	~50% of tokens
Image-text pairs	Paired datasets (image + caption, VAE-encoded)	~50% of tokens

Total: 2 trillion tokens for the 7B model.

The training loop

Encode images — Run each image through a frozen VAE encoder → 256 continuous patch vectors per image
Build the mixed sequence — Interleave text tokens and image patches: [text] [BOI] [patches] [EOI] [more text]
Add noise to image patches — Sample a random diffusion timestep t, add Gaussian noise to each image patch
Forward pass — The transformer processes the entire mixed sequence
Compute dual loss — Cross-entropy for text positions, MSE for image positions
Backpropagate — Both losses flow through the shared transformer weights

Balancing: λ = 1 (equal weighting) works well — the two objectives are surprisingly compatible.

Diffusion inside a transformer

	Standard diffusion	Transfusion
Denoiser	U-Net predicts noise	Transformer predicts noise
Architecture	Separate model	Same model that does text
Operates on	Full image latent	Per-patch vectors in a sequence
Text conditioning	Cross-attention	Causal attention (text precedes image)

Each image patch gets a timestep embedding added before entering the transformer — this tells the model “these patches have noise level t.”

Inference: generating a mixed document

Step 1 — Text generation. The model generates text tokens one at a time, left to right, exactly like GPT. When it predicts [BOI] (beginning of image), it switches to image mode.

Step 2 — Image generation. Start from pure noise. Run N denoising passes through the transformer, each pass predicting and subtracting noise, producing progressively cleaner patches. After N steps, decode patches to pixels via the VAE decoder.

Step 3 — Resume text. The model continues generating text autoregressively, now conditioned on both the preceding text AND the generated image. When it predicts another [BOI], it generates another image.

Component	Cost
Text generation	1 forward pass per token (standard)
Image generation	N forward passes per image (N = denoising steps, typically 250)
Image decoding	1 VAE decoder pass (cheap)

Scaling experiments

Model size	Parameters	Key finding
0.16B	160M	Even tiny models benefit from the dual objective
0.37B	370M	Transfusion pulls ahead of Chameleon on images
0.76B	760M	Gap widens — efficiency advantage grows with scale
7B	7B	Matches LLaMA-1 on text, beats DALL-E 2 on images

The scaling curve tells two stories: text quality scales similarly for both approaches, but Transfusion scales much better for images. The gap grows with compute.

The competition

Model	Type	Text	Image	Unified?
LLaMA-1 7B	Text-only LM	✔ Strong	✘ None	No
DALL-E 2	Image-only diffusion	✘ None	✔ Good	No
SDXL	Image-only diffusion	✘ None	✔ Better	No
Chameleon 7B	Unified (discrete tokens)	✔ Strong	⚪ Okay	Yes
Show-o	Unified (mixed)	⚪ Decent	⚪ Decent	Yes
Transfusion 7B	Unified (continuous)	✔ Strong	✔ Good	Yes

Key takeaway

Transfusion is the first model competitive with dedicated systems on both modalities simultaneously. Previous unified models always sacrificed one for the other.

Limitations the authors acknowledge

Limitation	Detail
Image resolution	Trained at 256×256 only — modern models generate at 1024×1024+
No video	Framework could extend but wasn’t tested
VAE dependency	Image quality capped by the VAE’s reconstruction ability
Inference speed	250 denoising steps per image is slow
No interleaved training	Trained on pairs, not true interleaved documents
7B only	Didn’t push to 70B+ where advantages would likely grow

Deep dive: GenEval

GenEval tests compositional image generation — not “does the image look pretty” but “did the model actually generate what you asked for?”

Skill tested	Example	What it checks
Single object	“a backpack”	Can it generate one object correctly?
Two objects	“a cat and a dog”	Both present, not just one?
Counting	“three apples”	Correct count?
Colors	“a red car and a blue truck”	Right colors on right objects?
Position	“cat to the left of a dog”	Spatial relationships?
Color attribution	“green apple and red chair”	Binding color to correct object (hardest)

Scoring uses an object detector (DDETR) to check presence, color, position, and count.

Model	Single	Two obj	Count	Colors	Position	Color attr	Overall
DALL-E 2	0.94	0.66	0.49	0.77	0.10	0.01	0.52
SDXL	0.98	0.74	0.39	0.85	0.15	0.23	0.55
Transfusion 7B	0.99	0.78	0.42	0.87	0.16	0.40	0.63

Transfusion dominates on color attribution (0.40 vs 0.23 vs 0.01) — the hardest compositional skill. Still struggles with counting and position, which remain unsolved even for dedicated image models.

Deep dive: efficiency

Chameleon converts images into 1,024 discrete tokens, each processed like a text token. Attention cost scales as O(n²) with sequence length.

Saving #1 — Fewer patches.

Representation	Tokens/patches per image	Relative length
Chameleon	1,024 tokens	1.0×
Transfusion (default)	256 patches	0.25×
Transfusion (compressed)	16 patches	0.016×

Saving #2 — Compression at inference. 256 → 64 patches: FID goes from 6.78 to 7.03 (barely noticeable). 256 → 16 patches: FID goes to 11.0 (still decent, 64× cheaper). You can choose your quality–cost tradeoff at inference time.

Saving #3 — No codebook bottleneck. No quantization loss, no codebook collapse, no straight-through estimator hacks.

Saving #4 — Better scaling curve. At every FLOP budget, Transfusion produces better images. To match Transfusion at 10²¹ FLOPs, Chameleon needs ~3.3× more compute.

Deep dive: architecture

Shared backbone: All transformer layers (attention + FFN) process both text and image data. The model learns shared representations across modalities.

What’s separate: Input embeddings (text uses token embeddings; images use a VAE encoder + linear projection) and output heads (text uses softmax over vocabulary; images use a U-Net-inspired decoder).

Attention masking — the critical design choice:

Text tokens: standard causal (left-to-right) attention
Image patches: bidirectional attention within the same image — every patch sees every other patch
Image patches CAN attend to all preceding text (conditioned on caption)
Text after an image CAN attend to the image patches
Patches from different images do NOT attend to each other

U-Net skip connections: Intermediate representations from earlier transformer layers are concatenated with the final layer output and fed to the image decoder. This gives access to both low-level features (textures, edges) and high-level semantics.

Innovation	What	Why it matters
Continuous patches	Images as VAE-encoded vectors, not discrete tokens	No information loss, fewer patches needed
Dual loss	Cross-entropy + diffusion MSE, added together	One model learns both objectives
Bidirectional intra-image	Image patches attend to each other freely	Images aren’t sequential — holistic reasoning
Causal inter-modal	Text is autoregressive, images condition on text	Maintains language model properties
U-Net skip connections	Early-layer features fed to image decoder	Better low-level detail in generated images
Patch compression	256 → 64 → 16 patches	Tunable quality–cost tradeoff at inference

Quiz — Level 2

1. How does the model know whether to apply cross-entropy loss or MSE loss at a given sequence position?

Each position in the sequence has a modality type — text tokens get cross-entropy, image patches get MSE.

2. During image generation at inference, the model produces a clean image by:

Standard diffusion inference — start from pure noise, iteratively denoise over N steps.

3. Why does Transfusion use bidirectional attention within image patches but causal attention for text?

The top-left corner of a photo is just as important as the bottom-right — there’s no “reading order.” Text is inherently sequential.

4. In the GenEval benchmark, Transfusion’s biggest advantage over DALL-E 2 and SDXL is in:

0.40 vs 0.23 (SDXL) vs 0.01 (DALL-E 2). The hardest compositional skill by far.

5. Transfusion achieves better image quality than Chameleon at <1/3 the compute primarily because:

256 patches vs 1,024 tokens = shorter sequences = quadratically less attention compute, plus no information lost to quantization.

Level 3 — Expert

▼

Mathematical formulations

The joint objective:

L_total = L_LM + λ · L_DDPM

Language modelling loss (standard next-token cross-entropy over text positions):

L_LM = -Σ_{i ∈ T} log P_θ(x_i | x_{<i})

Diffusion loss (noise prediction over image positions):

L_DDPM = Σ_{j ∈ I} E_{t,ε} [ || ε - ε_θ(z_j^(t), t, c) ||² ]

where z_j^(t) = √(ᾱ_t) · z_j + √(1 - ᾱ_t) · ε

Here z_j is the clean latent patch vector from the VAE encoder, t is a uniformly sampled timestep, ε is the Gaussian noise added, and ᾱ_t is the cumulative noise schedule (cosine schedule from Nichol & Dhariwal 2021).

The attention mask (formally)

Given text positions T and image positions I_k for the k-th image, the attention mask M has five rules:

Case	Condition	M = 1?
Within same image	i, j ∈ I_k	✔ Bidirectional
Text → text	i, j ∈ T, j ≤ i	✔ Causal
Image → preceding text	i ∈ I_k, j < min(I_k)	✔ Allowed
Text → preceding image	i ∈ T, j ∈ I_k, max(I_k) < i	✔ Allowed
Across different images	i ∈ I_k, j ∈ I_{k′}, k ≠ k′	✘ Blocked

Training pseudocode

def train_step(batch, model, vae_encoder):
    for sequence in batch:
        patches, text_tokens = [], []
        for element in sequence:
            if element.type == "text":
                text_tokens.append(model.text_embed(element.token_id))
            elif element.type == "image":
                z = vae_encoder(element.pixels)        # [256, d_vae]
                z = model.image_proj(z)                 # [256, d_model]
                t = randint(1, T)                       # sample timestep
                eps = randn_like(z)                     # sample noise
                z_noisy = sqrt(alpha_bar[t]) * z + sqrt(1 - alpha_bar[t]) * eps
                z_noisy += model.timestep_embed(t)      # add timestep info
                patches.append((z_noisy, eps, t))

        # Build mixed sequence, construct attention mask
        hidden = model.transformer(mixed_sequence, mask)

        # Compute dual loss
        lm_loss  = cross_entropy(hidden[text_pos], next_tokens)
        dif_loss = mse(model.image_decoder(hidden[img_pos]), target_noise)
        loss = lm_loss + lambda_ * dif_loss
        loss.backward()

Inference pseudocode

def generate_document(model, vae_decoder, prompt):
    tokens = tokenize(prompt)
    while not done:
        # --- Autoregressive text ---
        next_token = model.predict_next(tokens)
        tokens.append(next_token)
        if next_token == BOI:
            # --- Iterative image denoising ---
            latents = randn(256, d_vae)         # pure noise
            for step in range(N, 0, -1):
                t_emb = model.timestep_embed(step)
                pred_noise = model.forward_image(tokens, latents + t_emb)
                latents = ddpm_update(latents, pred_noise, step)
            image = vae_decoder(latents)
            tokens.append(image_repr)
            tokens.append(EOI)
    return tokens

DDPM update step:

z^(t-1) = (1/√α_t) · (z^(t) - (1-α_t)/√(1-ᾱ_t) · ε_θ) + σ_t · w

where w ~ N(0, I) for t > 1, σ_1 = 0

Critical evaluation

Strengths

Continuous patches — significant novelty. Demonstrating that you don’t need to discretize images, and that keeping them continuous is strictly better, is an important empirical finding.

Bidirectional intra-image attention — elegant solution. The attention mask design is the architectural innovation that makes the whole thing work.

Thorough scaling study — four model sizes, clear compute–quality curves, good ablations on attention masks and skip connections.

U-Net skip connections in a transformer — nice practical contribution to image generation quality.

Weaknesses

“Matches LLaMA-1 on text” — true at 7B/2T, but LLaMA-1 is a 2023 model. Small but consistent text degradation when image data ratio increases. The Pareto frontier isn’t fully quantified.

“Beats DALL-E 2 and SDXL” — on GenEval (compositional accuracy), not on FID (perceptual quality). Human preference was never tested. The paper cherry-picks the benchmark where it wins.

256×256 only — the elephant in the room. Scaling to 1024 means 16K patches → quadratic attention cost explodes.

No interleaved document evaluation — the paper’s core promise is unified multimodal generation, but text and image benchmarks are tested separately. Nobody tested actual interleaved output quality.

VAE as black box — image quality ceiling set by the VAE, unexplored. How sensitive are results to VAE quality?

Single timestep per image — each training step samples one noise level per image. Is the model getting enough diffusion signal given ~50% of compute goes to text?

Where this paper sits in the field

Year	Milestone	Contribution
2022	Flamingo	Frozen LM + frozen vision, glue layer — “you can combine them”
2023	Chameleon	Discrete tokens for everything — “you can unify them”
2024	Transfusion	Continuous diffusion inside LM — “you can unify them without information loss”
2025+	???	High-res, multi-image consistency, video, editing — “the real product”

Key takeaway

Transfusion is a strong proof of concept that continuous diffusion inside a language model is viable and efficient. But it’s exactly that — a proof of concept at 256px. The production version needs higher resolution, multi-image consistency, controllability, and faster inference.

Quiz — Level 3

1. In the diffusion loss L_DDPM, the model is trained to predict:

Standard ε-prediction — the model learns to identify what noise was added, enabling iterative removal.

2. In Transfusion’s attention mask, image patches from different images in the same sequence:

The attention mask blocks cross-image attention. Each image is a self-contained unit for patch-level reasoning.

3. The paper’s claim of “beating DALL-E 2 and SDXL” is strongest on GenEval but weakest on:

GenEval tests compositional accuracy, not perceptual quality or aesthetics. FID is competitive but not SOTA, and human preference was never measured.

4. During the DDPM reverse step at inference, after subtracting the predicted noise the model:

The stochastic reverse process adds σ_t · w noise at each step (except t = 1) to maintain the correct distribution.

5. What is the most significant evaluation gap given Transfusion’s stated goal of unified multi-modal generation?

The paper’s promise is unified multimodal generation, but text and image quality are evaluated independently. No one tested interleaved document generation quality.

Phase 4 — Frontier

▼

Transfusion received an ICLR 2025 Oral (top ~1% of submissions). Here’s what it sparked and what comes next.

What happened since August 2024

Paper	Date	Key idea	Relationship
Janus-Pro (DeepSeek)	Jan 2025	Unified understanding + generation, separate vision encoders	Validates unified thesis but uses discrete tokens — opposite bet
Show-o Turbo	Feb 2025	Accelerated unified model, discrete diffusion	Parallel work, focuses on inference speed
MADFormer	Jun 2025	Mixed autoregressive + diffusion transformer	Direct evolution — explores which patches should be AR vs diffused
Lumina-mGPT 2.0	Jul 2025	Standalone AR image gen, no diffusion needed	Challenges claim that you need diffusion for images
ImAgent	2025	Training-free unified multimodal agent	Uses unified models as agent components

The field split

The community has divided into two camps: Continuous + Diffusion (Transfusion’s bet — MADFormer, flow-matching variants) vs Discrete Tokens (Chameleon’s bet — Janus-Pro, Emu3, Show-o). Consensus as of early 2026: continuous representations are winning for image quality, but discrete tokens are simpler to scale. No clear knockout yet.

Improvement vectors

1. High resolution

Area to explore

256px is a toy. 1024px with 8× downsampling = 16,384 patches — impossible with standard O(n²) attention. Paths forward: hierarchical generation (generate low-res, super-resolve), windowed/sparse attention, progressive patch compression (coarse-to-fine), or latent cascade with a lightweight upsampler.

2. Multi-image consistency

Area to explore

Generate a 4-page comic → the fox looks different on every page. Paths: entity embeddings (persistent identity vector injected into each image), cross-image attention (let later image patches attend to earlier images), reference image conditioning. The attention mask needs a new rule allowing later images to reference earlier ones.

3. Faster inference

Partially addressed

250 denoising steps × full transformer forward pass = very slow. Paths: DDIM (→ 50 steps), DPM-Solver++ (→ 20–25), consistency distillation (→ 1–4), flow matching (→ 10–20). Flow matching (behind SD3 and Flux) is the likely winner — mathematically simpler, straighter trajectories, fewer steps. Fully compatible with Transfusion’s framework.

4. Controllability

Area to explore

No way to condition on pose, depth, edges, or style references. Paths: ControlNet-style adapters, concatenating control embeddings with image patches, T2I-Adapter approach. The continuous patch representation makes this easier than adding control to discrete-token approaches.

5. Video extension

Area to explore

The framework naturally extends to video. Each frame → patches with bidirectional intra-frame attention, plus temporal attention across frames. Frames can be diffused jointly or autoregressively (generate frame 1, condition frame 2 on it).

6. Better VAE

Partially addressed

Image quality is capped by VAE reconstruction quality. Low-hanging fruit: swapping to SDXL’s VAE (~20% better reconstruction), consistency decoder for sharper high-frequency detail, adversarial training (VAE-GAN), or higher-resolution latents with less spatial downsampling.

Scorecard

Dimension	Rating	Notes
Novelty	⭐⭐⭐⭐ 4/5	Continuous diffusion inside LM is meaningful; pieces existed but the combination is new
Rigor	⭐⭐⭐⭐ 4/5	Thorough scaling experiments and ablations; loses a point for 256px and no interleaved eval
Impact	⭐⭐⭐⭐⭐ 5/5	ICLR 2025 Oral; spawned a research direction; every unified model paper references it
Clarity	⭐⭐⭐⭐⭐ 5/5	Exceptionally well-written; implementable from the paper alone
Relevance	⭐⭐⭐⭐⭐ 5/5	Architectural thesis for a media generation agent
Reproducibility	⭐⭐⭐⭐ 4/5	Clear methodology but requires significant compute; no public model weights
Overall	⭐⭐⭐⭐½ 4.5/5

Improvement vector	Status	Key work
High resolution	Area to explore	No published high-res Transfusion variant
Multi-image consistency	Area to explore	Attention mask extension needed
Faster inference	Partially addressed	Flow matching, consistency distillation
Controllability	Area to explore	ControlNet-style adapters possible
Video extension	Area to explore	Temporal attention straightforward
Better VAE	Partially addressed	SDXL VAE, consistency decoder

Bottom line

Transfusion proved that continuous diffusion inside a language model isn’t just possible — it’s strictly better than discretizing images, and the field is now racing to build on that foundation. The single highest-impact next step: high-resolution generation with multi-image consistency for interleaved document production.

Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

Level 1 — Beginner

What problem does this solve?

The cooking analogy

What does it actually do?

Why does this matter?

Key results

Level 2 — Intermediate

Training: how do you teach one model two skills?

The training loop

Diffusion inside a transformer

Inference: generating a mixed document

Scaling experiments

The competition

Limitations the authors acknowledge

Deep dive: GenEval

Deep dive: efficiency

Deep dive: architecture

Level 3 — Expert

Mathematical formulations

The attention mask (formally)

Training pseudocode

Inference pseudocode

Critical evaluation

Where this paper sits in the field

Phase 4 — Frontier

What happened since August 2024

Improvement vectors

1. High resolution

2. Multi-image consistency

3. Faster inference

4. Controllability

5. Video extension

6. Better VAE

Scorecard