Classifying Evangelion: From Zero-Shot to Training Heads

Premise

Models like CLIP can classify images against arbitrary text labels with no training data. In theory, you describe what you’re looking for and the model finds it. In practice, zero-shot accuracy lands between 78% and 84% depending on how you phrase the prompt — and each phrasing fails on completely different images. A small trained head on the same model hits 93.5% with no prompt engineering, trained in minutes on ~50 labeled images.

This article walks through how content classification works end to end — the same pipeline behind recommendations, feeds, and discovery at Pinterest, Spotify, and X. Images are the most visual modality, so we’ll use them as the primary example, though the same principles apply to text, audio, and any other embeddable modality.

Throughout we’ll classify Evangelion characters as a running example — 207 images from Safebooru, trained and evaluated with Headmaster on CLIP (openai/clip-vit-large-patch14).

Tip

Want to follow along? This script downloads the full dataset from Safebooru and organizes it into the bucket structure used throughout this article.

Foundation Models

CLIP is the most popular but not the only option with a shared embedding space:

SigLIP — drop-in alternative to CLIP, increasingly the go-to choice. Better embeddings, same API.
MetaCLIP — same CLIP architecture, just better training data curation.
EVA-CLIP — same architecture with improved training recipes.
ALIGN — Google never fully open-sourced the original weights. Some reproductions exist.

These all work the same way — encode image and text into the same vector space, compare with cosine similarity. They’re all fast, batchable, and work for both zero-shot and as embedding backbones for training heads.

Warning

If the foundation model is swapped (say, CLIP for SigLIP), every stored embedding is invalidated — each model maps to its own vector space, so vectors from one are meaningless to another. Even minor versions of the same model can shift the space enough to break existing heads. The entire corpus must be re-embedded and all heads retrained. Choose the backbone carefully — it’s a commitment.

Classification Flow

When a new image arrives:

Embed — pass the image through the foundation model (e.g. CLIP) to get a vector.
Classify — feed the vector into the training head. Out come label scores.
Threshold — accept labels above a confidence threshold, discard the rest.
Store — write the labels to the content record. Downstream systems (feed ranking, search, recommendations) consume them from there.

That’s it. The embedding step is the expensive part. The head itself is a matrix multiply — negligible.

This also enables splitting the pipeline — GPU machines extract embeddings, a queue moves them, and CPU machines run the heads. No GPU needed for classification.

Zero-Shot Classification

Models like CLIP can classify images against arbitrary text labels with no training data. Provide candidate labels (“cat”, “dog”, “car”), the model embeds both the image and each label into the same vector space, and the highest cosine similarity wins. This works well for broad categories — “human” vs “no human” is reliable for obvious cases, though it falls apart at the edges (a tiny face on a badge, a statue, a cartoon character).

Where it breaks down: content rating (G vs PG vs R vs X). The boundary between PG and R is culturally defined, context-dependent, and not something CLIP learned from alt-text on the internet. Zero-shot has no way to learn a specific threshold.

Here are a few examples:

Rei Ayanami in plugsuit with her name written in Japanese

Rei Ayanami from Evangelion / not Rei Ayanami	0.08
girl in a white suit blue hair / anime character	0.72
Trained head (49 images)	1.00

Rei in her iconic plugsuit with her name (綾波レイ) on the image. Zero-shot scores 0.08 — CLIP is confident this isn't Rei, despite the character being unambiguous. A visual description prompt does better but is still unreliable.

Zero-shot struggles with negation-style prompts. The embeddings for “Rei Ayanami from Evangelion” and “not Rei Ayanami” are semantically close — both contain “Rei Ayanami” — so cosine similarity doesn’t separate them cleanly. Across a 77-image test set, three prompt formulations range from 78% to 84% accuracy with completely different failure modes. A trained head hits 93.5% with no prompt engineering — and its errors are predictable.

What Is a Training Head

A foundation model (CLIP, ViT, etc.) produces embeddings — dense vector representations of content. An embedding captures what an image “is about” as a list of numbers, where similar images end up with similar numbers. A training head is a small neural network layer trained on top of those embeddings for a specific classification task. The foundation model stays frozen; only the head is trained.

Why this matters: this provides the representation power of a model trained on billions of examples, while only requiring a small labeled dataset (20–100+ images per category) to teach it a specific taxonomy. Training is fast (minutes, not days), cheap, and the head is tiny — easy to version, swap, or A/B test independently of the base model.

Headmaster (hm) handles both steps:

$ uv run hm embed --head eva_vs_other
embedding head 'eva_vs_other' (207 images)
  done

$ uv run hm train --head eva_vs_other
training head 'eva_vs_other' (binary, 2 classes)
  train: 165, val: 42
  epoch   1/50  loss=0.9705
  epoch  10/50  loss=0.0022
  epoch  20/50  loss=0.0002
  epoch  30/50  loss=0.0000
  epoch  40/50  loss=0.0001
  epoch  50/50  loss=0.0000
  threshold=0.05  f1=0.9630
  saved workspace/out/eva_vs_other.pt

Binary heads

The simplest case — two buckets. “Human” vs “no human”. “Safe for work” vs “not safe for work”. Collect examples of each, train a head, and it outputs a score between 0 and 1. Pick a threshold and it’s a binary classifier. Binary heads are easy to reason about, easy to evaluate, and often the first thing to build — a single head that filters out unwanted content before anything else runs.

Here are the three binary heads from our Evangelion example. Each head gets the same 207 images reorganized into its own positive/negative split:

Multi-bucket heads

When a dimension has more than two states, add more buckets. Content rating is a natural example — G (general audience), PG (some mild content), R (restricted, adult themes), X (explicit). The buckets are ordered, mutually exclusive, and the set is closed. The head outputs a score per bucket and the highest wins. Multi-bucket heads need more training data (each bucket needs its own examples) but the mechanics are the same.

Stacking heads

Each head handles one dimension. In practice, content runs through several — an explicit content filter, then a genre classifier, then a quality scorer. Each head is trained independently on its own dataset, and they all run against the same embedding. This keeps each head simple and lets them be updated or swapped without affecting the others.

The first head gates entry. If the image isn’t Evangelion content, the character heads are irrelevant. If it is, both character heads run independently. Adding a fourth head — say, asuka_vs_not — means training one more model. The existing heads don’t change.

Tip

Store the embedding alongside the labels. Persisting the vector (in pgvector, a vector DB, or even a parquet file) means new heads can run against existing embeddings without touching a GPU again. The expensive embedding step only happens once per image.

Training Data Requirements

Training requires source images. As few as ~20 images per bucket can work when the distinction is visually clear, scaling up to hundreds per bucket for subtler boundaries or production quality. “Contains a car” vs “doesn’t contain a car” needs fewer examples than “luxury interior” vs “budget interior” where the boundary is subjective.

Two things matter equally: clean labels and diversity. At small dataset sizes, even a few mislabeled images can push the head in the wrong direction — it learns the noise. And 100 near-identical images of the same thing teach nothing useful. Each bucket needs variety: different angles, lighting, compositions, contexts.

Warning

The negative bucket deserves special attention. When training a “has human” head, the “no human” bucket shouldn’t be all landscapes — it should contain the full range of content the app actually sees. If the negative bucket is too narrow, the head doesn’t learn “no human” — it learns “landscape vs human”, and everything else becomes a coin flip.

Keep the source images around even after extracting embeddings. They’re needed to visually review misclassifications, to retrain when adding new buckets, and to re-embed the entire dataset if the foundation model is ever swapped.

Class imbalance is worth watching. With 500 G-rated images but only 30 X-rated ones, the head will bias toward G — it’s seen it more, so it defaults there. Balance the buckets by collecting more data for underrepresented ones, or use class weights during training to compensate — this tells the model to weight rare buckets more heavily without duplicating data.

Note

Both training and test data are sourced exclusively from anime images. In production, the dataset would include the full range of content the system actually encounters — photos, illustrations, screenshots, etc. The dataset is kept narrow here to keep the demo focused.

Evaluation

A head that trains without errors isn’t necessarily a head that works. Run it against images it hasn’t seen during training and look at what it gets wrong.

Start with a visual review — scroll through the misclassifications and look at the actual images. The why becomes obvious: an ambiguous image, a weird crop, a category the training data didn’t cover. For a binary head or a small test set, this is usually enough. Ten wrong predictions reveal more than any metric.

Note

Evaluation data must be separate from training data. Testing on the same images used for training will make the head appear to perform perfectly — it memorized the answers. Always hold out a set of labeled images the head has never seen.

When the numbers get too big to scan — many buckets, hundreds of test images — a confusion matrix helps:

Most of these errors come from a small training set (207 images total). More examples of the boundary cases — non-Eva anime that looks similar, ambiguous crops — would tighten the heads significantly.

Headmaster generates these from a test directory:

$ uv run hm confusion-matrix --head shinji_vs_not --test-dir dataset/test/shinji_vs_not --extended

  shinji_vs_not
                       Pred not_shinji        Pred shinji   Total
-----------------------------------------------------------------
  Actual not_shinji                 57                  2      59
      Actual shinji                  2                 16      18
-----------------------------------------------------------------
  Accuracy: 73/77 (94.8%)

  [CORRECT] actual=not_shinji pred=not_shinji (57)
    not_shinji/3867966.png
    ... 56 more

  [WRONG] actual=not_shinji pred=shinji (2)
    not_shinji/6550170.jpg
    not_shinji/6550335.jpg

  [WRONG] actual=shinji pred=not_shinji (2)
    shinji/1862421.png
    shinji/5265902.jpg

  [CORRECT] actual=shinji pred=shinji (16)
    shinji/5157477.jpg
    ... 15 more

The Feedback Loop

Evaluation isn’t a one-time step — it’s the start of a cycle. A user flags a misclassification. That correction gets stored. The next dataset export includes it. The head gets retrained. The new head is better in exactly the region where it was wrong. This is active learning, and it’s what makes classification systems improve over time rather than degrade. The tighter this loop, the faster the head converges on the actual content distribution.

Note

Zero-shot and trained heads aren’t a binary choice — they’re a progression. Start with zero-shot to get labels immediately. Collect corrections over time. Once there’s enough labeled data, train a head and deploy it alongside the zero-shot fallback. The trained head takes priority where it exists; zero-shot covers the rest.

Where Training Data Lives

Inside the app

When training images are already present in the app (user uploads, catalog, etc.), training can be tightly coupled to the application. The app serves as both the source of truth for content and the training data store. This simplifies the pipeline — no syncing, no external storage, labels can reference app entities directly.

Retraining can trigger from app events — a batch of new uploads, a moderation decision, a user-reported misclassification. Bad label spotted, training data updated, head retrained, all within the same system.

The downside is data lifecycle coupling. When images get deleted or moderated out, training data shifts underneath. A retrain might produce different results just because the dataset shifted. This needs explicit handling — snapshot training sets, or accept that the head drifts with the data.

Outside the app

When source images shouldn’t persist in the app (e.g. moderated content, licensed samples, ephemeral data), an external filesystem-driven approach is preferred. Training data lives outside the app, organized by directory structure.

This also applies outside of app development entirely. Standalone research — exploring a dataset, prototyping a classifier, benchmarking embeddings — benefits from the same workflow. Iterating on training data doesn’t require an app — a filesystem and a CLI are enough.

Retraining is a deliberate step — curate a dataset, run a command, get a head. Nothing changes until the next deliberate retrain. The dataset is stable, versioned, and doesn’t depend on what the app does with its content.

Tooling

The ecosystem splits into three layers: dataset management, training, and managed platforms.

For dataset management — inspecting, cleaning, and curating images before training — FiftyOne provides a visual grid to spot mislabeled or duplicate images. Label Studio handles annotation when humans need to label from scratch. DVC versions datasets alongside git. Git LFS keeps large files (images, model checkpoints) out of git history — it stores lightweight pointers in the repo while the actual binaries live on a remote server.

For training, it depends on how much control is needed. fastai goes from a directory of images to a trained model in a few lines of Python. PyTorch’s ImageFolder gives full control for writing a custom training loop. With already-extracted embeddings, scikit-learn can train a classifier on the vectors with no GPU at all.

Headmaster (used throughout this article) sits between these — the shortest path from “folders of images” to “working classifier.”

For teams that want a fully managed pipeline — labeling, training, hosting, API — Roboflow, Hugging Face AutoTrain, and Google Vertex AI handle everything in the cloud.

The tooling is commoditized. The competitive advantage is in the feedback loop.