Coursework · AI Challenge V2025

PatchCore

A one-day exam: reimplement PatchCore (Roth et al., 2022) from the paper, without using the open-source implementation that already exists. By 8 PM, 99.03% image AUROC and 98.01% pixel AUROC on MVTec AD — within 0.13 points of the official Amazon Science baseline. With a Streamlit dashboard for the operator on top.

Context

The AI Challenge module at ESAIP discloses its subject only on the morning of the exam. By 8 PM the same day, a working POC and a written report must be submitted — no preparation, no preceding course. The 2025 subject was industrial defect detection on MVTec AD.

Before starting, I proposed using anomalib — the open-source PatchCore implementation maintained by Intel's Open Edge Platform team, Apache 2.0, with the best published numbers on MVTec AD and a complete dashboard already in place. The exercise required us to act as if it did not exist. So I rebuilt the method from the paper instead.

Team of three. I led the technical implementation end-to-end — model, evaluation, alignment with the official baseline, Streamlit dashboard — while my teammates supported the surrounding deliverables, including the report that counted for half of the grade.

The problem

Cold-start defect detection on industrial parts is a hard problem because anomalies are rare, varied, and unknown at training time. The standard approach is to learn a representation of "normal" parts and flag anything statistically distant. PatchCore (Roth et al., 2022) set the bar for this approach by combining frozen pretrained features, patch-level memory banks, and coreset reduction.

The approach

Reimplemented the full pipeline in PyTorch, GPU-first, FAISS-free. Features come from a frozen WideResNet50, layers 2 and 3, patchified via Unfold with patchsize 3 and stride 1, then reduced per layer to 1024 dimensions before final aggregation. The coreset is built with approximate greedy reduction at 10%, projected to 128 dimensions, with 10 starting points. Nearest-neighbor search uses exact L2 via torch.cdist on CUDA, batched by chunks. Anomaly heatmaps use bilinear upsampling to 224×224 and Gaussian smoothing with σ = 4.

Result

On MVTec AD's 15 categories, average image-level AUROC reached 99.03% and pixel-level AUROC reached 98.01% — within 0.13 percentage points of the official Amazon Science reference. Reached by 8 PM the same day, from the paper.

What made the difference

Closing 0.13 points with the official baseline didn't come from a clever idea — it came from three forms of discipline. Pixel-level AUROC is not as simple as it looks: most reproductions skip edge cases or use library shortcuts that drift from the paper's exact definition. Getting that right was the prerequisite for any meaningful comparison.

The image score uses raw_max over patch scores, not a reweighted variant — switching to it produced an immediate jump on the categories that were most behind. The patch geometry — patchsize 3, stride 1, grid alignment after Unfold — matches the reference implementation exactly. Deterministic seeding everywhere kept the runs reproducible end to end.

The dashboard

Behind the metrics is the actual deliverable. A Streamlit interface sits on top of the pipeline as a small operator UX, not a demo gimmick. The operator picks a category, scrolls through inspected images alongside their anomaly heatmaps, applies an accept/reject filter, and the screen surfaces the most suspicious patch driving each reject decision. Quality teams looking at this kind of system don't need a static report — they need a screen they can scan and act on.

PatchCore Streamlit dashboard: sidebar with category, decision and threshold filters; main panel showing inspection metrics, score histogram, and a table of inspected images.
The dashboard overview: filters, metrics, score distribution, and the per-image inspection table.
Per-image explanation view: the inspected part shown next to its anomaly heatmap, with the most suspicious patch highlighted and the score, threshold, and decision listed above.
Click into a flagged item: the original part, its heatmap, and the patch that drove the reject decision.