DALL-E 3: Improving Image Generation with Better Captions

OpenAI — October 2023

TL;DR: DALL-E 3’s breakthrough isn’t a new model architecture — it’s a data engineering innovation. They trained a custom image captioner (CoCa-style) to re-describe every training image with rich, detailed captions, then retrained a standard diffusion model on these synthetic captions. The result: 71.7% human preference over SDXL, dramatically better prompt following, and readable text in generated images. A field-defining paper that proved data quality trumps model complexity.

Level 1 — Beginner

▼

The big idea

Every image generation model has a dirty secret: the training data captions are terrible. A photo of a golden retriever playing fetch on a beach at sunset gets labeled “dog.” A complex scene with three people, specific clothing, and spatial relationships gets a two-word alt-text.

DALL-E 3’s core insight: don’t build a better model — build better training data. They trained a custom image captioner that writes detailed, accurate descriptions of every image in the training set, then retrained the image generator on these synthetic captions. The quality jump was massive.

The caption problem

Real-world image-text datasets (LAION, etc.) are scraped from the web. The “captions” are actually:

Source	Example Caption	What’s Actually in the Image
Alt text	“IMG_2847.jpg”	A family of four at a lake
Product listing	“Buy now!”	A red sneaker on white background
Social media	“vibes 🔥”	Sunset over Manhattan skyline
News article	“CEO announces merger”	Person at podium with specific background

The model learns from what you tell it. If you train on garbage captions, the model learns a loose, noisy mapping between text and images. It can generate “a dog” but struggles with “a golden retriever wearing a blue bandana sitting to the LEFT of a tabby cat on a red couch.”

The fix — a three-step pipeline

The recaptioning pipeline

Step 1: Train a captioner. Take a vision-language model and fine-tune it to write extremely detailed image descriptions. Not “a dog” but “A golden retriever with a wet coat stands on a sandy beach, mouth open, catching a red tennis ball mid-air. The sun is setting behind, casting an orange glow on the wet sand.”

Step 2: Re-caption everything. Run this captioner over the entire training dataset. Every image now has a rich, detailed, accurate description instead of garbage alt-text.

Step 3: Retrain the generator. Same model architecture, same compute — but dramatically better training signal. The model can now learn fine-grained associations: spatial relationships, object counts, colors, textures, styles, text rendering.

Why this matters

This paper flipped the conventional wisdom. The AI field was obsessed with model architecture — bigger transformers, better diffusion schedulers, more parameters. DALL-E 3 proved that data quality trumps model complexity.

A mediocre model trained on excellent captions beats an excellent model trained on mediocre captions.

Key results

71.7%

Human preference
vs SDXL

SOTA

T2I-CompBench
prompt following

95/5

Synthetic/ground truth
training mix

The prompt rewriting trick

DALL-E 3 also uses GPT-4 at inference time to rewrite user prompts before generating. When you type “a cat,” GPT-4 expands it to a detailed scene description. This bridges the gap between how humans write prompts (short, vague) and how the model was trained (long, detailed captions).

Connection to other papers

Key contrast

Remember the Transfusion vs Chameleon debate? Both papers fought over model architecture (continuous vs. discrete). DALL-E 3 says: “You’re both missing the point — fix the data first, and the architecture matters less.” The data-centric insight is architecture-agnostic and benefits any generation approach.

Key takeaway

DALL-E 3 is a data engineering paper disguised as a generation paper. The model is a standard diffusion model. The breakthrough is the synthetic recaptioning pipeline — proving that a custom-trained captioner can fix the fundamental data quality problem that limits all image generators.

Quiz — Level 1

1. What is DALL-E 3’s core breakthrough compared to previous image generation models?

The model architecture is standard diffusion. The innovation is entirely in the training data: better captions → better generation quality.

2. Why are web-scraped image captions problematic for training image generators?

Alt-text like “IMG_2847.jpg” and social media captions like “vibes 🔥” teach the model a loose, inaccurate text-to-image mapping.

3. DALL-E 3 uses GPT-4 at inference time to:

The model was trained on long, detailed captions. GPT-4 bridges the gap between short user prompts and the model’s expected input distribution.

4. DALL-E 3’s approach challenges the conventional AI wisdom by showing:

While the field focused on architecture innovations, DALL-E 3 proved that fixing the training data (captions) yields larger quality improvements.

5. In human preference evaluations against SDXL (Stable Diffusion XL), DALL-E 3 was preferred:

71.7% is a decisive margin in human preference studies, driven primarily by DALL-E 3’s superior prompt following from the recaptioning pipeline.

Level 2 — Intermediate

▼

The captioner — building the foundation

DALL-E 3’s captioner is built on a CoCa-style (Contrastive Captioner) architecture — a vision-language model that combines two training objectives: contrastive learning (CLIP-style image-text matching) and autoregressive captioning (generating text descriptions token by token).

Image → Vision Encoder (ViT) → Cross-attention → Language Decoder → Caption

They took a pre-trained CoCa model and fine-tuned it on a curated dataset of high-quality, human-written image descriptions — not web-scraped alt-text, but detailed descriptions specifically collected for this purpose.

Two caption types

Caption Type	Purpose	Example
Short	Quick identification (~5–10 words)	“A golden retriever on a beach at sunset”
Descriptive	Exhaustive detail (~50–150 words)	“A golden retriever with a wet, dark-gold coat stands on packed sand near the waterline. Its mouth is open mid-pant, tongue hanging left. Behind the dog, gentle waves break with white foam. The sky is a gradient from deep orange at the horizon to pale blue above…”

The descriptive captions capture: objects and attributes (wet coat, dark-gold), spatial relationships (stands on, behind the dog), counts (no other people or animals), style (low angle, eye level), and negatives (what’s NOT in the image).

Caption quality matters more than quantity

The captioner was trained on a relatively small curated dataset (hundreds of thousands, not billions). But because each example was a high-quality, detailed description, the captioner learned to generalize this level of detail to any image. Quality of supervision > quantity of supervision.

The three caption types in training

Type	Length	Content
Ground truth	Variable	Original web-scraped alt-text, preserved as-is
Short synthetic	~20 tokens	Brief accurate description generated by the captioner
Long synthetic	~100–200 tokens	Exhaustive descriptive caption from the captioner

The training mix

They blend synthetic and original captions:

Training batch composition:
└─ 95% synthetic captions (long descriptive)
└─  5% ground truth (original web captions)

Why keep 5% ground truth? Two reasons:

Distribution matching. Users write short, casual prompts. Training only on long captions makes the model over-reliant on verbose descriptions and poor on short prompts.
Diversity signal. Web captions, despite being noisy, contain cultural context and conceptual associations that synthetic captions miss.

Ablation results

Training Mix	Human Preference	Prompt Following
100% ground truth (baseline)	48%	Low
100% synthetic long	65%	High for detailed, poor for short
95% synthetic + 5% ground truth	71.7%	High across both

The 95/5 blend outperforms both extremes. Pure synthetic loses distribution matching; pure ground truth has the noisy caption problem.

What the captioner learns

Spatial awareness:
  Web caption:   “family photo”
  Synthetic:     “Three adults stand in a row. A woman in a blue
                  dress is on the left, a tall man in a gray suit
                  is in the center...”

Counting:
  Web caption:   “flowers”
  Synthetic:     “Seven sunflowers in a clear glass vase. Five
                  are fully bloomed, two are still partially closed.”

Text recognition:
  Web caption:   “street scene”
  Synthetic:     “A busy city street with a red stop sign
                  reading ‘STOP’. Behind it, a green street
                  sign reads ‘BROADWAY’.”

Prompt upsampling — GPT-4 at inference

User prompt: “a cat”
       ↓
   GPT-4 rewrite
       ↓
Expanded: “A fluffy orange tabby cat sits on a windowsill,
looking out at a rainy day. Soft natural light illuminates
its fur. The window frame is white-painted wood...”
       ↓
   DALL-E 3 diffusion model
       ↓
   Generated image

GPT-4 is instructed to: preserve user intent, add plausible details, be specific, and vary outputs. Users can opt out and see the rewritten prompt. This bridges the distribution gap between how users write and how the model was trained.

Evaluation framework

T2I-CompBench (automated) tests compositional generation:

Category	What It Tests	Example
Attribute binding	Correct color/texture on correct object	“A red cube and a blue sphere”
Spatial relationships	Objects in correct positions	“A cat sitting on top of a piano”
Object count	Correct number of objects	“Three apples on a wooden table”

Human preference studies: Side-by-side comparisons with SDXL, rated by human evaluators for quality and prompt following. DALL-E 3 achieved 71.7% preference.

De-emphasis of FID: FID measures distributional similarity to real images, not prompt following. A model can have great FID but still generate the wrong objects and colors. DALL-E 3 prioritized compositional accuracy and human preference over FID.

Key takeaway

The 95/5 training mix is the paper’s most practical contribution. It shows that synthetic data alone isn’t enough — you need a small amount of real-world signal to maintain distribution matching. This principle applies to any synthetic data pipeline.

Quiz — Level 2

1. The DALL-E 3 training mix uses 95% synthetic captions and 5% original web captions. What happens if you use 100% synthetic captions instead?

At inference time, users write short casual prompts. A model trained only on long descriptive captions can’t handle terse input well — the 5% ground truth preserves this ability.

2. The captioner generates multiple caption types per image. Which of the following is NOT one of them?

The three types are: ground truth (original), short synthetic, and long synthetic. No adversarial captions are used in the pipeline.

3. DALL-E 3 notably de-emphasizes FID as an evaluation metric. Why?

FID captures whether generated images look realistic as a population, but completely ignores whether each image matches its specific prompt. DALL-E 3 prioritizes compositional accuracy.

4. GPT-4 rewrites user prompts at inference time. Consider these claims:
I. It bridges the distribution gap between short user prompts and the model’s long-caption training data
II. It always improves output quality regardless of the original prompt
III. Users can opt out of prompt rewriting
IV. The rewritten prompt is hidden from the user to maintain the illusion of magic

II is false: rewriting can hurt when users already wrote detailed prompts. IV is false: OpenAI shows the rewritten prompt to maintain transparency.

5. A colleague argues: “DALL-E 3’s recaptioning is just data augmentation — no different from flipping images or adding noise.” What’s the strongest counterargument?

Augmentation transforms the input while keeping the label. Recaptioning replaces a bad label with a good one — it’s label correction, not data augmentation.

Level 3 — Expert

▼

Captioner base model selection

The captioner isn’t trained from scratch. OpenAI starts with a pre-trained CoCa model and fine-tunes it:

Starting Point	Caption Quality	Why
CLIP encoder + random decoder	Mediocre	Good vision, no generation ability yet
CoCa (contrastive + captioning)	Best	Both vision understanding AND text generation warm-started
Pure captioning model (no contrastive)	Good but less robust	Generates fluently but misidentifies objects more often

The contrastive pretraining gives the vision encoder discriminative features — it knows the difference between a golden retriever and a labrador. Without it, the captioner writes fluent but less precise descriptions.

Annotation guidelines

What annotators describe

Main subject(s) — appearance, pose, expression
Spatial layout — where things are relative to each other
Background — setting, environment, secondary elements
Style — photography type, artistic medium, lighting
Text — any visible text, signs, labels (exact transcription)
Absences — notable things NOT present
Quantities — exact counts where possible

What annotators must NOT do

Interpret emotions or intent
Speculate about what’s outside the frame
Use subjective quality judgments (“beautiful”, “ugly”)
Reference external knowledge (“this is the Eiffel Tower”) — instead describe visual properties (“a tall iron lattice tower with a tapered shape”)

The last point is crucial: the captioner should describe what it sees, not what it knows. This teaches the generator to render visual properties, not pattern-match labels.

Decoding strategy

At recaptioning time, the captioner uses nucleus sampling (top-p = 0.9) with temperature τ = 0.7, rather than beam search:

Method	Behavior	Tradeoff
Beam search	Deterministic, picks highest-probability sequence	Repetitive, generic — converges to safe, bland descriptions
Nucleus sampling	Samples randomly from top-p probability mass	Diverse but occasionally inconsistent — same image gets different captions

Nucleus sampling’s adaptive nucleus size is key: when the model is confident, the nucleus shrinks (near-deterministic). When uncertain, it expands (more creative). Diversity in captions → diversity in what the generator learns.

The diffusion model

The paper is deliberately vague about the generator (proprietary), but the architecture is a U-Net with text cross-attention:

Noisy latent z_t
     ↓
┌─────────────────────────────┐
│         U-Net                  │
│  Down: [Conv → ResBlock → Attn] │
│  Bottleneck                     │
│  Up:   [Conv → ResBlock → Attn] │  ← Cross-attention to
│                                 │    text embeddings
└─────────────────────────────┘
        Predicted noise ε_θ

Training objective: Standard DDPM noise prediction:

L_diffusion = E[||ε - ε_θ(z_t, t, c)||²]

Where:
  z_0 = clean latent (image encoded by VAE)
  z_t = noisy latent at timestep t
  ε  = actual noise added
  c   = text conditioning (caption embedding)

The only change from DALL-E 2 is c. Same architecture, same loss, same noise schedule — but c went from garbage alt-text embeddings to rich descriptive caption embeddings. That’s the entire paper’s contribution to the diffusion model itself.

Classifier-Free Guidance (CFG)

At inference, DALL-E 3 uses CFG to amplify text conditioning:

ε̂_θ = ε_θ(z_t, t, ∅) + s · [ε_θ(z_t, t, c) - ε_θ(z_t, t, ∅)]

Run the model twice — once with the caption, once without. The difference is the “direction” the caption pushes the image. Scale s amplifies that direction (typically 7–15). Higher s = more faithful to the prompt but less diverse. During training, captions are randomly dropped ~10% of the time to enable this.

Recaptioning at scale

Parameter	Value
Dataset size	~600M+ images
Caption types per image	2 (short + long)
Total captions generated	~1.2B+
Compute cost	Thousands of GPU-hours

Quality filtering: Length filter (10–300 tokens), repetition filter, hallucination detection via CLIP alignment scores. Captions with alignment below a threshold are regenerated or replaced with the original ground truth.

Ablation studies

Caption length sweet spot

Avg Caption Length	Human Preference
~10 tokens (short only)	58%
~50 tokens (medium)	66%
~120 tokens (long descriptive)	71%
~200+ tokens (extremely verbose)	69%

There’s a sweet spot. Too short = insufficient detail. Too long = noise and redundancy creep in.

Captioner quality vs generator quality

Captioner CIDEr	Generator Human Pref
95 (weak)	59%
115 (medium)	65%
135 (strong)	71.7%

Near-linear relationship. Better captioner → better generator. The pipeline hasn’t saturated — an even better captioner would likely produce an even better generator.

The generator–captioner flywheel

A better generator enables a better captioner through synthetic training pairs:

Step 1: Use generator to create images from detailed prompts
  Prompt → Generator → [perfectly-aligned image]
  
Step 2: Use (prompt, generated_image) pairs to fine-tune captioner
  Captioner learns precise descriptions from perfectly-aligned data

The cycle:
  Weak captioner → OK generator
  OK generator → better training pairs → better captioner
  Better captioner → better generator → ...

Catch: The flywheel can amplify errors. Real images in the mix (that 5% ground truth) keep the loop grounded in reality.

Safety measures

T2I safety classifier filters both prompts and generated images. Provenance metadata (C2PA) embedded in outputs. Prompt-level filtering prevents generation of public figures and harmful content.

Key takeaway

The diffusion model is essentially unchanged from DALL-E 2. The only variable that changed is the conditioning signal c. This is the purest possible demonstration that better data > better architecture for image generation.

Quiz — Level 3

1. The captioner’s annotation guidelines instruct humans NOT to write “This is the Eiffel Tower” and instead describe “a tall iron lattice tower with a tapered shape.” Why?

Describing visual properties teaches the model to actually render those features. A name just teaches pattern matching without understanding the visual structure.

2. DALL-E 3 uses nucleus sampling (top-p=0.9, τ=0.7) instead of beam search for recaptioning. If they switched to beam search, what would happen?

Beam search optimizes for the single most probable sequence, producing repetitive, template-like captions. The generator would learn rigid patterns instead of diverse associations.

3. Captioner CIDEr scores of 95, 115, and 135 map to generator human preferences of 59%, 65%, and 71.7%. What does the near-linear relationship suggest?

A non-saturating, near-linear curve means captioner quality is a key optimization lever with room to improve.

4. Classifier-Free Guidance (CFG) runs the model twice per denoising step. Which correctly describes the tradeoff of increasing guidance scale s?

CFG amplifies the text conditioning signal. More amplification = more faithful to the prompt but more constrained outputs with less creative variation.

5. A researcher wants to reproduce DALL-E 3’s results. Which statement is most accurate?

The recaptioning idea has been independently replicated (SD3, FLUX). But DALL-E 3’s specific numbers can’t be verified since the captioner, fine-tuning data, and generator are all proprietary.

Phase 4 — Frontier

▼

DALL-E 3 triggered a paradigm shift. Every major lab adopted recaptioning within months. Here’s what it sparked and what comes next.

Failure modes — what DALL-E 3 still gets wrong

Compositionality ceiling

Prompts with 4+ objects and complex spatial relationships degrade. Colors bleed, attributes bind to wrong objects. Better captions help but don’t solve the diffusion model’s fundamental attribute binding problem — “red” needs to attach to “cube” not “sphere,” and cross-attention’s capacity for this scales poorly.

Counting beyond ~4

Diffusion models lack an explicit counting mechanism. The model learns statistical associations (“7 apples” correlates with “many apples”), not discrete counting. The captioner correctly writes “7” but the generator can’t enforce it.

Text rendering limits

DALL-E 3 renders “STOP” and “OPEN” correctly (short, common words memorized as visual patterns), but novel strings >~10 characters get garbled. The model learned text as visual textures, not as a compositional character-level system.

The homogeneity problem

Despite nucleus sampling, the captioner describes all images in the same clinical “museum placard” style. Slang, humor, technical jargon, and cultural context from original web captions get flattened. The generator loses some ability to produce diverse visual styles from informal prompts.

The recaptioning wave

DALL-E 3 established recaptioning as standard practice. Every subsequent SOTA model adopted it:

Model	Date	Captioner Used	Key Difference
DALL-E 3	Oct 2023	Custom CoCa fine-tune	First to prove the approach
Stable Diffusion 3	Feb 2024	CogVLM	Added rectified flow + MM-DiT
Imagen 3 (Google)	Mid 2024	Gemini-based captioner	Leveraged stronger vision model
FLUX (Black Forest Labs)	Aug 2024	T5-based + recaptioning	Open-weights, flow matching
Ideogram 2	Late 2024	Proprietary	Best-in-class text rendering

Architecture evolution post-DALL-E 3

U-Net → DiT (Diffusion Transformer)

DALL-E 3 era:
  Text encoder → U-Net with cross-attention → Image
  (Convolutional backbone, attention at select layers)

Post-DALL-E 3:
  Text encoder → Pure transformer (DiT) → Image
  (No convolutions, attention everywhere, scales better)

DiT replaces the U-Net with a pure transformer, scaling more predictably. SD3 and FLUX adopted this. DiT + recaptioning pushed quality beyond DALL-E 3.

DDPM → Flow Matching

DDPM (DALL-E 3):
  Forward: add noise gradually over 1000 steps
  Reverse: predict and remove noise over 1000 steps
  Path: curved, complex noise schedule

Flow Matching (SD3, FLUX):
  Forward: linear interpolation from data to noise
  Reverse: follow straight path back
  Path: straight lines, simpler, faster

Flow matching achieves comparable quality with fewer denoising steps (20–30 vs 50–100), making generation faster.

Single → Triple text encoders

DALL-E 3:     [Single text encoder] → cross-attention → U-Net

SD3:    [CLIP-L] ───┐
        [CLIP-G] ───&boxn;→ concatenate → cross-attention → DiT
        [T5-XXL] ───┘

FLUX:   [CLIP-L] ───┐
        [T5-XXL] ───┘→ concatenate → cross-attention → DiT

SD3 uses three encoders; FLUX uses two (dropping CLIP-G).
Each captures different aspects:
  CLIP-L:  visual concepts, style
  CLIP-G:  scene composition, global semantics (SD3 only)
  T5-XXL:  language understanding, long prompts

Open research frontiers

1. Autoregressive image generation

Area to explore

The hottest shift post-DALL-E 3 is moving away from diffusion entirely. Autoregressive models (Transfusion, Chameleon) generate image tokens left-to-right, like writing text. They naturally unify text + image generation with variable resolution. Both approaches still benefit from recaptioning — the data insight is architecture-agnostic.

2. The captioner becomes the model

Area to explore

DALL-E 3 treated the captioner as a preprocessing step. The frontier is making the captioner and generator the same model — a single unified model that can both understand and generate images (Gemini, GPT-4o, Transfusion). No separate pipeline, no distribution mismatch.

3. Video recaptioning

Area to explore

The same principle applies to video, but captions must describe temporal dynamics: actions, transitions, cause-and-effect, camera motion. A 10-second clip needs a caption covering what happens over time, not just a static description.

4. Compositional generation without cross-attention

Partially explored

Approach	Idea	Status
Layout-guided generation	Specify bounding boxes for each object	Works but requires extra input
Attention manipulation	Directly edit cross-attention maps	Promising (Attend-and-Excite)
LLM-planned generation	LLM decomposes scene, generates each object separately	Emerging
Autoregressive	Sequential token generation naturally handles binding	Most promising long-term

5. Evaluation beyond human preference

Area to explore

Metric	Measures	Limitation
FID	Distribution similarity	Ignores prompt following
CLIP Score	Text-image alignment	Coarse — misses fine-grained errors
T2I-CompBench	Compositional accuracy	Limited prompt categories
GenAI-Bench (2024)	Holistic generation quality	Newer, less validated
DPG-Bench (2024)	Dense prompt following	Promising, tests long prompts

No single metric captures “is this a good image for this prompt?” The field still relies on human evals for serious comparisons.

The DALL-E 3 legacy

Before DALL-E 3

“How do we make image generation better?”
  → Bigger model
  → Better architecture
  → More training compute
  → Better noise schedule

After DALL-E 3

“How do we make image generation better?”
  → Better training data descriptions  ← new top priority
  → Then bigger model
  → Then better architecture

This mirrors the broader data-centric AI movement: for mature architectures, improving data quality yields more than improving the model. DALL-E 3 is the most dramatic demonstration in generative AI.

Lasting contributions (ranked)

Synthetic recaptioning as standard practice — now universal across all image/video generation
Data quality > model architecture — a clear, reproducible demonstration
Caption-generator flywheel — the virtuous cycle of using generators to improve captioners
Prompt upsampling via LLM — GPT-4 rewriting at inference, adopted by Midjourney, Ideogram, and others
Evaluation framework shift — de-emphasizing FID in favor of human preference and compositional benchmarks

Scorecard

Dimension	Rating	Notes
Novelty	⭐⭐⭐⭐ 4/5	Core idea (fix the data) is simple but the execution and proof are novel. Not a new architecture.
Impact	⭐⭐⭐⭐⭐ 5/5	Changed standard practice for the entire field within months.
Reproducibility	⭐⭐ 2/5	Key components are proprietary (captioner, GPT-4 upsampling, training data).
Technical depth	⭐⭐⭐ 3/5	Deliberate simplicity — the contribution is the pipeline, not complex math.
Writing quality	⭐⭐⭐⭐ 4/5	Clear, well-structured, honest about limitations.
Longevity	⭐⭐⭐⭐⭐ 5/5	The data-centric insight will outlast every specific architecture.

Bottom line

A field-defining paper that won by asking a different question. While everyone competed on model architecture, OpenAI asked “what if the training labels are just bad?” and proved the answer mattered more than anyone expected. Every image generator released since — SD3, FLUX, Imagen 3, Ideogram — uses synthetic recaptioning. The data-centric insight is DALL-E 3’s most lasting contribution, and it transfers to video, 3D, and any future modality.

Improvement vector	Status	Key work
Better architectures (DiT, flow matching)	Addressed by SD3/FLUX	Complementary to recaptioning, not replacement
Triple text encoders	Addressed by SD3/FLUX	Richer conditioning signal
Autoregressive generation	Area to explore	Transfusion, Chameleon — still benefits from recaptioning
Unified captioner-generator	Area to explore	Gemini, GPT-4o — no separate pipeline
Video recaptioning	Area to explore	Temporal descriptions, camera motion
Compositional generation	Partially explored	Layout-guided, attention manipulation, LLM-planned

Quiz — Phase 4

1. DALL-E 3’s recaptioning causes a “homogeneity problem.” A researcher proposes fixing it by training multiple captioners with different “voices” (casual, technical, artistic). What is the most likely outcome?

Multiple voices add diversity but introduce new problems: inconsistent quality and the generator may learn to associate description styles with visual styles rather than treating them as interchangeable.

2. Post-DALL-E 3, SD3 stacks three text encoders (CLIP-L, CLIP-G, T5-XXL) while FLUX uses two (CLIP-L, T5-XXL). Consider:
I. Each encoder captures different semantic aspects
II. Multiple encoders eliminate the need for recaptioning through redundancy
III. Concatenated embeddings give cross-attention a richer conditioning signal
IV. This approach trades inference speed for prompt-following quality

II is false: SD3, FLUX, and other post-DALL-E 3 models all still use recaptioned training data. Multiple text encoders are complementary to recaptioning, not a replacement.

3. DALL-E 3 renders “STOP” on a sign but fails on novel strings like “Manohar Paluri” on a nameplate. Why?

“STOP” appears thousands of times in training photos and is memorized as a visual texture pattern. Novel strings require compositional character-level rendering that diffusion models don’t support.

4. Applying DALL-E 3’s recaptioning to video generation faces unique challenges. Which are specific to video?
I. Temporal descriptions — capturing actions, transitions, and cause-effect over time
II. Camera motion — describing pans, zooms, and cuts
III. Spatial relationships between objects in a single frame
IV. Caption length explosion — a 10-second clip may need 500+ tokens

Spatial relationships (III) are already a challenge for image captioning. Temporal dynamics (I), camera motion (II), and caption length (IV) are uniquely introduced by video.

5. A colleague claims: “DALL-E 3 is obsolete — DiT + flow matching + triple text encoders have made recaptioning irrelevant.” What’s the strongest rebuttal?

Every post-DALL-E 3 model (SD3, FLUX, Imagen 3) uses recaptioned training data. Removing recaptioning from any of them would significantly degrade quality. Architecture and data improvements are complementary.

← Back to all papers