Lab · Deep Learning

Flowers Classifier

CNN vs ViT tournament on five flower species, graded on unseen images. ConvNeXt V2 Large for the CNN side, DeiT-Small/16 for the ViT side, both fine-tuned via timm with transfer learning and test-time augmentation. Plus ten theory questions on OCR and Vision Transformers.

Context

The Deep Learning module at ESAIP (S7) sets a series of labs alongside its larger projects. TP3 is the CNN-vs-ViT tournament: train two image classifiers on a small flower dataset, submit weights and an inference interface, and the teacher grades by ranking on a held-out test set the students never see. Five classes — daisy, dandelion, rose, sunflower, tulip. Solo lab. Two-thirds of the grade is the model rank, one-third is a written exam on OCR and Vision Transformer theory.

The problem

The dataset is small enough that a model trained from scratch will memorise. The grading set is unseen by design, so any model that overfits on the training distribution loses points. The brief flags this explicitly and recommends transfer learning, data augmentation, and early stopping. The actual trade is between a strong pretrained backbone (which carries the right inductive bias for free) and a careful fine-tuning loop (which protects against the small-dataset trap).

The approach

For the CNN side, ConvNeXt V2 Large (convnextv2_large.fcmae_ft_in22k_in1k via timm) — a modern CNN that absorbs ViT-era ideas (LayerNorm, GELU, inverted bottlenecks) while keeping the convolutional inductive bias, pretrained on ImageNet-22k then fine-tuned on ImageNet-1k. For the ViT side, DeiT-Small/16 (deit_small_patch16_224) — a data-efficient Vision Transformer designed for cases where you cannot afford a billion-image pretraining set. Both loaded via the same timm registry, which kept the comparison mechanically clean.

Both models share the same training discipline: standard ImageNet normalisation, data augmentation tuned for small datasets (random crops, horizontal flips, colour jitter), and early stopping on validation accuracy. At inference, the predict() function applies test-time augmentation — average the logits from the original image and its horizontal flip — which costs almost nothing and reliably shaves a fraction of a percent off the error rate on tournament-style evaluation.

The submission interface follows the teacher's exact contract: load_cnn_model(), load_vit_model(), and predict(model, image_path) returning one of the five class strings. Both checkpoints save the model name, class names, image size, and normalisation parameters alongside the weights, so the loader doesn't have to hardcode assumptions that the trainer might silently change.

The theory exam

A third of the grade is a written exam on the two units the module covered late in the semester: OCR (CER vs WER, the OCR pipeline, CTC loss, EAST vs CRAFT detectors, BiLSTM context) and Vision Transformers (patch tokenisation, the inductive-bias gap with CNNs, scaled dot-product attention, Swin's shifted windows, the role of the [CLS] token). Ten questions, one point each. The exam doesn't show on the tournament leaderboard, but it weighs the same as either of the two model ranks — a reminder that this kind of work is half implementation, half being able to explain why the implementation makes sense.