Classifying Content with Foundation Models
Premise
Apps like Pinterest, Spotify, Instagram, and X use classification to label content and serve it better — recommendations, feeds, discovery all depend on it. Images, music, text — each modality has its own pipeline, but the underlying mechanics are similar.
Images are the most visual modality, so we’ll use them as the primary example — though the same principles apply to text, audio, and anything else you can embed.
Two ways to classify from an embedding: zero-shot or a trained head
Foundation Models
CLIP is the most popular but not the only option with a shared embedding space:
- SigLIP — drop-in alternative to CLIP, increasingly the go-to choice. Better embeddings, same API.
- MetaCLIP — same CLIP architecture, just better training data curation.
- EVA-CLIP — same architecture with improved training recipes.
- ALIGN — Google never fully open-sourced the original weights. Some reproductions exist.
These all work the same way — encode image and text into the same vector space, compare with cosine similarity. They’re all fast, batchable, and work for both zero-shot and as embedding backbones for training heads.
One caveat: if you swap the foundation model (say, CLIP for SigLIP), every stored embedding is invalidated. You re-embed your entire corpus. Choose your backbone carefully — it’s a commitment.
Zero-Shot Classification
Models like CLIP can classify images against arbitrary text labels with no training data. Provide candidate labels (“cat”, “dog”, “car”), the model embeds both the image and each label into the same vector space, and the highest cosine similarity wins. This works well for broad categories — “human” vs “no human” is reliable for obvious cases, though it falls apart at the edges (a tiny face on a badge, a statue, a cartoon character).
Where it breaks down: content rating (G vs PG vs R vs X). The boundary between PG and R is culturally defined, context-dependent, and not something CLIP learned from alt-text on the internet. Zero-shot has no way to learn a specific threshold.
Here’s a real example — a golden retriever and a gray cat cuddling, classified with CLIP (openai/clip-vit-base-patch32). Source script. This image is tricky for zero-shot because the dog dominates the frame — CLIP embeds the whole scene at once and the cat gets lost in the dog’s visual weight. Each prompt is a binary contest: positive label vs negative label, highest similarity wins.
Round 1 — simple positive/negative framing:
| Positive label | Negative label | Result | Correct? |
|---|---|---|---|
| contains a cat: positive | contains a cat: negative | positive (0.654) | barely |
| contains only a cat: positive | contains only a cat: negative | positive (0.744) | no |
| pictures a family: positive | pictures a family: negative | positive (0.769) | debatable |
Round 2 — descriptive labels:
| Positive label | Negative label | Result | Correct? |
|---|---|---|---|
| a photo containing a cat | a photo with no cat | negative (0.966) | no |
| a photo of only a cat | a photo of a cat with other animals | negative (0.957) | yes |
| a family portrait photo | a photo that is not a family portrait | positive (0.834) | debatable |
Better prompts fixed “only a cat” but broke “contains a cat.” CLIP does scene matching, not object detection — it compares the whole image holistically, and the negative label shapes the result as much as the positive. Round 1 gave low-confidence wrong answers; round 2 gave high-confidence wrong answers. There’s no prompt-engineering around this.
A trained head would get this right. The cat’s features are captured in the embedding — the head just learns to look for them from labeled examples, regardless of what else is in the frame or how a prompt is worded.
What Is a Training Head
A foundation model (CLIP, ViT, etc.) produces embeddings — dense vector representations of content. An embedding captures what an image “is about” as a list of numbers, where similar images end up with similar numbers. A training head is a small neural network layer trained on top of those embeddings for a specific classification task. The foundation model stays frozen; only the head is trained.
Why this matters: you get the representation power of a model trained on billions of examples, but you only need a small labeled dataset (20–100+ images per category) to teach it your specific taxonomy. Training is fast (minutes, not days), cheap, and the head is tiny — easy to version, swap, or A/B test independently of the base model.
Binary heads
The simplest case — two buckets. “Human” vs “no human”. “Safe for work” vs “not safe for work”. You collect examples of each, train a head, and it outputs a score between 0 and 1. Pick a threshold and you have a binary classifier. Binary heads are easy to reason about, easy to evaluate, and often the first thing you build — a single head that filters out content you don’t want before anything else runs.
Multi-bucket heads
When a dimension has more than two states, you add more buckets. Content rating is a natural example — G (general audience), PG (some mild content), R (restricted, adult themes), X (explicit). The buckets are ordered, mutually exclusive, and the set is closed. The head outputs a score per bucket and you take the highest. Multi-bucket heads need more training data (each bucket needs its own examples) but the mechanics are the same.
Stacking heads
Each head handles one dimension. In practice, content runs through several — an explicit content filter, then a genre classifier, then a quality scorer. Each head is trained independently on its own dataset, and they all run against the same embedding. This keeps each head simple and lets them be updated or swapped without affecting the others.
Classification Flow
When a new image arrives:
- Embed — pass the image through the foundation model (e.g. CLIP) to get a vector.
- Classify — feed the vector into the training head. Out come label scores.
- Threshold — accept labels above a confidence threshold, discard the rest.
- Store — write the labels to the content record. Downstream systems (feed ranking, search, recommendations) consume them from there.
That’s it. The embedding step is the expensive part (~10–50ms on GPU). The head itself is a matrix multiply — negligible.
One architectural detail worth getting right early: store the embedding alongside the labels. If you persist the vector (in pgvector, a vector DB, or even a parquet file), you can run new heads against existing embeddings without touching a GPU again. That’s what makes head swapping, A/B testing, and retraining cheap — the expensive embedding step only happens once per image.
This also lets you split the pipeline — GPU machines extract embeddings, a queue moves them, and CPU machines run the heads. The head is just a matrix multiply, no GPU needed.
Training Data Requirements
Source images are required for training. Minimum ~20 images per bucket for a niche classification head, scaling up to hundreds per bucket for production quality. Small datasets work fine when the distinction is clear and specific — “contains a car” vs “doesn’t contain a car” needs fewer examples than “luxury interior” vs “budget interior” where the boundary is subjective.
Two things matter equally: clean labels and diversity. At small dataset sizes, even a few mislabeled images can push the head in the wrong direction — it learns the noise. And 100 near-identical images of the same thing teach nothing useful. Each bucket needs variety: different angles, lighting, compositions, contexts.
The negative bucket deserves special attention. If you’re training a “has human” head, the “no human” bucket shouldn’t be all landscapes — it should contain the full range of content your app actually sees. Animals, objects, interiors, text screenshots, abstract art. If the negative bucket is too narrow, the head doesn’t learn “no human” — it learns “landscape vs human”, and everything else becomes a coin flip.
Keep the source images around even after extracting embeddings. You’ll need them to visually review misclassifications, to retrain when you add new buckets, and to re-embed your entire dataset if you ever swap the foundation model.
Class imbalance is worth watching. If you have 500 G-rated images but only 30 X-rated ones, the head will bias toward G — it’s seen it more, so it defaults there. Balance the buckets by collecting more data for underrepresented ones, or use class weights during training to compensate — this tells the model to weight rare buckets more heavily without duplicating data.
Evaluation
A head that trains without errors isn’t necessarily a head that works. Run it against images you haven’t trained on and look at what it gets wrong.
Start with a visual review — scroll through the misclassifications and look at the actual images. You’ll see why the head is confused: an ambiguous image, a weird crop, a category you didn’t account for. For a binary head or a small test set, this is usually enough. You get more signal from seeing 10 wrong predictions than from any metric.
When the numbers get too big to scan — many buckets, hundreds of test images — a confusion matrix helps. Set aside a batch of hand-labeled images that weren’t used for training. Run the head against them and compare predictions to your labels:
| Predicted G | Predicted PG | Predicted R | Predicted X | |
|---|---|---|---|---|
| Actual G | 92 | 7 | 1 | 0 |
| Actual PG | 5 | 81 | 13 | 1 |
| Actual R | 0 | 11 | 83 | 6 |
| Actual X | 0 | 1 | 8 | 91 |
Read it row by row. The G row: 92 correctly labeled G, 7 misclassified as PG, 1 as R. The off-diagonal numbers show which buckets bleed into each other. Here, PG and R overlap the most — that’s where you need more or better training data.
To build one: label a separate batch of images by hand (same directory-per-bucket structure as training data, just kept aside), run your head against each image, and compare. scikit-learn and Cleanlab both handle the comparison and reporting.
The Feedback Loop
Evaluation isn’t a one-time step — it’s the start of a cycle. A user flags a misclassification. That correction gets stored. The next dataset export includes it. The head gets retrained. The new head is better in exactly the region where it was wrong. This is active learning, and it’s what makes classification systems improve over time rather than degrade. The tighter this loop, the faster the head converges on your actual content.
Zero-shot and trained heads aren’t a binary choice — they’re a progression. Start with zero-shot to get labels immediately, no training needed. Collect human corrections over time. Once there’s enough labeled data, train a head and deploy it alongside the zero-shot fallback. The trained head takes priority where it exists; zero-shot covers dimensions that haven’t been trained for yet.
Where Training Data Lives
Inside the app
When training images are already present in the app (user uploads, catalog, etc.), training can be tightly coupled to the application. The app serves as both the source of truth for content and the training data store. This simplifies the pipeline — no syncing, no external storage, labels can reference app entities directly.
Retraining can trigger from app events — a batch of new uploads, a moderation decision, a user-reported misclassification. The feedback loop is tight: bad label spotted, training data updated, head retrained, all within the same system.
The downside is data lifecycle coupling. When images get deleted or moderated out, training data changes under you. A retrain might produce different results just because the dataset shifted. You need to handle this explicitly — snapshot training sets, or accept that the head drifts with the data.
Outside the app
When source images shouldn’t persist in the app (e.g. moderated content, licensed samples, ephemeral data), an external filesystem-driven approach is preferred. Training data lives outside the app, organized by directory structure.
This also applies outside of app development entirely. Standalone research — exploring a dataset, prototyping a classifier, benchmarking embeddings — benefits from the same workflow. You don’t need an app to iterate on training data; a filesystem and a CLI are enough.
Retraining is a deliberate step — you curate a dataset, run a command, get a head. Nothing changes until you decide it should. The dataset is stable, versioned, and doesn’t depend on what the app does with its content.
Tooling
The ecosystem splits into three layers: dataset management, training, and managed platforms.
For dataset management — inspecting, cleaning, and curating images before training — FiftyOne gives you a visual grid to spot mislabeled or duplicate images. Label Studio handles annotation when you need humans labeling from scratch. DVC versions your datasets alongside git. Git LFS keeps large files (images, model checkpoints) out of your git history — it stores lightweight pointers in the repo while the actual binaries live on a remote server.
For training, it depends on how much control you want. fastai gets you from a directory of images to a trained model in a few lines of Python. PyTorch’s ImageFolder gives full control if you want to write your own training loop. If you’ve already extracted embeddings, scikit-learn can train a classifier on the vectors with no GPU at all.
Headmaster sits between these — a CLI tool that handles both the embedding and training steps. Point it at a directory of images organized by label, pick a foundation model, and it trains a classification head. No custom code, no training loop, no configuration. The shortest path from “folders of images” to “working classifier.”
For teams that want a fully managed pipeline — labeling, training, hosting, API — Roboflow, Hugging Face AutoTrain, and Google Vertex AI handle everything in the cloud.