Coursework · Generative AI

DreamBooth

Two-week reimplementation of DreamBooth (Ruiz et al., 2022) on Stable Diffusion v1.5, fine-tuned on three photos of an Apple Watch. CLIP-T above the paper, DINO and CLIP-I below.

Context

The Generative AI module at ESAIP runs over two weeks. Each team is assigned a paper to reimplement, drawn at random by the professor. Ours was DreamBooth (Ruiz et al., 2022).

Team of six. The graded deliverable was a 40-minute oral defense rather than the code itself, so we shared the implementation and worked through the theory together. I built the Streamlit interface around the pipeline.

The problem

Standard text-to-image models cannot generate a specific subject from a prompt alone — your watch, your dog, your friend. Naive fine-tuning on a few photos creates a worse problem than it solves: the model overfits to the subject and forgets what the broader class means. After a few hundred steps, prompting a watch generates only the trained instance. DreamBooth's contribution is a loss formulation that lets you fine-tune on the subject without losing the general concept.

The approach

Two losses run in parallel during training. The subject loss is a standard diffusion MSE — noise added to the subject's latents, the UNet predicts the noise, gradients flow. The prior preservation loss runs on around 200 class images generated by the frozen base model before training begins, with the class prompt — a watch, no rare token. Both losses sum with equal weight: L = L_subject + λ × L_prior, with λ = 1.0. Without the prior term, after a few hundred steps the model forgets that watch refers to the entire category — it just generates the trained instance. With it, the general concept stays anchored.

We fine-tuned for 1000 steps with a learning rate of 5e-6: the base model already knows what a watch is, we only needed to slip in the information that [V_ohwx] refers to this watch specifically. The text encoder was trained alongside the UNet for better subject fidelity, at the cost of more VRAM and more time — about a quarter of a day on the hardware we had. Seed 1234 for reproducibility.

Training photo: the watch on a wooden surface beside a red plush strawberry — The three photos given to the model.

Training photo: the watch resting on a white surface — The three photos given to the model.

Result

Three reference metrics from the paper — DINO (subject fidelity), CLIP-I (image-image similarity), CLIP-T (text-image alignment) — measured on three prompts (a beach, a snowy scene, a Monet-style painting), four images per prompt.

CLIP-T came in at 0.32, above the paper's 0.305. DINO at 0.36 against the paper's 0.668; CLIP-I at 0.72 against 0.803. The model follows text prompts reliably; subject fidelity is below.

The gap is interpretable. We trained on three photos rather than the five to ten the paper assumes. An Apple Watch is a manufactured object with fine-grained details — screen text, crown button — that DINO captures less reliably because it was trained mostly on organic subjects. On the Monet-style prompt specifically, DINO drops to 0.24: a stylized treatment naturally diverges from the original photographs. CLIP-T above the paper is what matters in practice — the model integrates new prompts cleanly, which is the prerequisite for the rest.

Generated: the watch on a pebble beach — Generated samples in three contexts: a beach, a snowy scene, and a Monet-style painting.

Generated: the watch in a snowy scene — Generated samples in three contexts: a beach, a snowy scene, and a Monet-style painting.

Around the model

That kind of reading only holds up if the engineering underneath is sound. Three choices we made to keep it that way. Pydantic-validated YAML for the entire configuration — typed parameters, explicit defaults, no hardcoded numbers. Test-first development: every function has a failing test before its implementation. No silent fallback: preconditions raise.

The five-page Streamlit interface on top — subject registry, training launch, models, generation, evaluation — was not required by the module. A 1000-step run takes a quarter of a day, and watching metrics live is the only honest way to interpret what fine-tuning is doing to the model.