Coursework · Generative AI
DreamBooth
Two-week reimplementation of DreamBooth (Ruiz et al., 2022) on Stable Diffusion v1.5, fine-tuned on three photos of an Apple Watch. CLIP-T above the paper, DINO and CLIP-I below.
Context
The Generative AI module at ESAIP runs over two weeks. Each team is assigned a paper to reimplement, drawn at random by the professor. Ours was DreamBooth (Ruiz et al., 2022).
Team of six. The graded deliverable was a 40-minute oral defense rather than the code itself, so we shared the implementation and worked through the theory together. I built the Streamlit interface around the pipeline.
The problem
Standard text-to-image models cannot generate a specific subject from a prompt alone — your watch, your dog, your friend. Naive fine-tuning on a few photos creates a worse problem than it solves: the model overfits to the subject and forgets what the broader class means. After a few hundred steps, prompting a watch generates only the trained instance. DreamBooth's contribution is a loss formulation that lets you fine-tune on the subject without losing the general concept.
The approach
Two losses run in parallel during training. The subject loss
is a standard diffusion MSE — noise added to the subject's
latents, the UNet predicts the noise, gradients flow. The
prior preservation loss runs on around 200 class images
generated by the frozen base model before training begins,
with the class prompt — a watch, no rare token. Both
losses sum with equal weight:
L = L_subject + λ × L_prior, with
λ = 1.0. Without the prior term, after a few
hundred steps the model forgets that watch refers to
the entire category — it just generates the trained instance.
With it, the general concept stays anchored.
We fine-tuned for 1000 steps with a learning rate of 5e-6: the
base model already knows what a watch is, we only needed to
slip in the information that [V_ohwx] refers to
this watch specifically. The text encoder was trained
alongside the UNet for better subject fidelity, at the cost of
more VRAM and more time — about a quarter of a day on the
hardware we had. Seed 1234 for reproducibility.
Result
Three reference metrics from the paper — DINO (subject fidelity), CLIP-I (image-image similarity), CLIP-T (text-image alignment) — measured on three prompts (a beach, a snowy scene, a Monet-style painting), four images per prompt.
CLIP-T came in at 0.32, above the paper's 0.305. DINO at 0.36 against the paper's 0.668; CLIP-I at 0.72 against 0.803. The model follows text prompts reliably; subject fidelity is below.
The gap is interpretable. We trained on three photos rather than the five to ten the paper assumes. An Apple Watch is a manufactured object with fine-grained details — screen text, crown button — that DINO captures less reliably because it was trained mostly on organic subjects. On the Monet-style prompt specifically, DINO drops to 0.24: a stylized treatment naturally diverges from the original photographs. CLIP-T above the paper is what matters in practice — the model integrates new prompts cleanly, which is the prerequisite for the rest.
Around the model
That kind of reading only holds up if the engineering underneath is sound. Three choices we made to keep it that way. Pydantic-validated YAML for the entire configuration — typed parameters, explicit defaults, no hardcoded numbers. Test-first development: every function has a failing test before its implementation. No silent fallback: preconditions raise.
The five-page Streamlit interface on top — subject registry, training launch, models, generation, evaluation — was not required by the module. A 1000-step run takes a quarter of a day, and watching metrics live is the only honest way to interpret what fine-tuning is doing to the model.