Face Recognition with PyTorch: Building AI Systems Instead of Just Using Them
Video

Face Recognition with PyTorch: Building AI Systems Instead of Just Using Them

Face Recognition with PyTorch: Building AI Systems Instead of Just Using Them The conversation around AI development tends to focus on a narrow slice of the work: prompting an LLM,...

15 min read

Face Recognition with PyTorch: Building AI Systems Instead of Just Using Them

The conversation around AI development tends to focus on a narrow slice of the work: prompting an LLM, calling someone else’s API, integrating a hosted model into your application. That’s a real and useful skill set. But there’s another kind of AI work that gets less attention—building systems that actually run the models, training them on your data, and shipping them as part of your product.

That’s the conversation I wanted to have at the Sin City Programmers meetup in April 2026: not how to use AI through somebody else’s service, but how to build your own AI capability from the ground up using PyTorch and a working face recognition demo.

Video: Sin City Programmers Meetup at Tech Alley Vegas - April 2026

The full codebase walked through in the session is open source and available at gitlab.com/kusog-opensource/pytorch-image. Less than 1,500 lines of Python total, runs on Windows, Mac, or Linux, and works with any standard webcam. You can clone it and have it running locally in a few minutes.

This post captures the methodology and the architectural choices behind that demo—why it’s structured the way it is, where the design tradeoffs live, and how the same patterns generalize beyond face recognition to the broader landscape of building AI-amplified systems.


The Fundamental Distinction

Most developers interacting with AI today are calling someone else’s hosted service. You send a prompt to OpenAI or Anthropic, you get a response back. That’s a perfectly valid layer to operate at, and for many applications it’s the right choice.

But it’s not the only layer. Underneath those services are the actual models—neural networks running on hardware, processing inputs, producing outputs. When you build at that layer, you control more: the model itself, the training data, the deployment cost, the latency, the privacy posture. You also take on more: managing GPU memory, structuring training pipelines, handling the operational complexity of running models locally.

The face recognition demo sits squarely in the second category. No external API calls. No hosted services. Everything runs on the local machine—webcam capture, face detection, identity recognition, and live retraining all happening on a laptop CPU.

The point isn’t that local is always better than hosted. The point is that knowing how to build at this layer opens up a class of problems the API-only approach can’t solve.


Two Models, Two Problems

The first architectural decision is recognizing that “face recognition” is actually two problems stacked on top of each other.

Face detection answers “where are the faces in this image?” It takes a full video frame and returns bounding box coordinates around every face it finds.

Face identification answers “whose face is this?” It takes a tightly cropped face image and returns a name (or “unknown”).

These run in sequence. Detect first, then feed each cropped face into identification. The codebase uses MTCNN for detection and InceptionResnetV1 (pretrained on the VGGFace2 dataset) for identification.

This is the two-stage detector pattern—the same architectural shape behind Faster R-CNN, license plate recognition systems, and retail shelf analytics. One model finds candidate regions. A second specialized model identifies what’s in each region.

It’s worth being honest about the model choices: MTCNN and InceptionResnetV1 are 2016-vintage. Fresh builds today would reach for RetinaFace or SCRFD on the detection side, and ArcFace or AdaFace on the recognition side. The models have moved on; the pattern hasn’t. Detect → crop → specialized recognizer is still how this class of system is built in 2026.

The pattern persists for one reason in particular: only the second stage needs retraining when the set of “whats” changes. The detector knows “this is a face” once and forever. Adding a new person doesn’t touch it. That property is what makes live learning possible without retraining the whole system.


Two Ways to Recognize a Face

The next architectural decision is how recognition itself works. There are two reasonable approaches, and the demo uses both.

Classifier Head

A classifier replaces the final layer of the recognition network with one that outputs N slots—one per person you care about. You train it with labeled images, and it produces a probability distribution over your known people. This is the standard supervised learning shape: feed it a face, get back a softmax probability for each known identity.

Strengths: precise when confident, and the softmax probability gives you a calibrated confidence value you can threshold against.

Weaknesses: adding a new person requires retraining—at minimum the final layer, and ideally a fine-tuning pass over the whole network.

Embedding + Cosine Similarity

The other approach treats the network not as a classifier but as a feature extractor. You strip off the final classification layer and use the 512-dimensional vector that comes out of the penultimate layer as an “embedding” of the face. Faces of the same person produce similar embeddings; faces of different people produce dissimilar ones.

To recognize someone, you compute the embedding of their face, then compare it via cosine similarity to a database of stored embeddings (one prototype per known person, computed by averaging that person’s training photos).

Strengths: adding a new person is instant. Compute their embedding once, drop it in the database, done. No retraining required.

Weaknesses: all the quality comes from the pretrained embedding network. If it doesn’t separate identities well, you have no dial to turn.

The Hybrid Rule

The demo uses both. The video loop runs the classifier first. If it’s confident (softmax probability above 0.7) and the prediction isn’t “unknown,” that’s the answer. Otherwise, it falls back to the embedding database and returns the closest match if cosine similarity exceeds 0.7.

This is the trick that makes live learning feel instant: when you teach the system a new face, the embedding database updates immediately and recognition starts working on the very next frame. The classifier fine-tune happens in a background thread and atomically swaps in once it’s done. You get the precision of a trained classifier and the instant enrollment of an embedding system.

If you’ve ever worked with Retrieval-Augmented Generation, the embedding path will feel familiar. It’s the same encode → index → cosine similarity → threshold pattern used to retrieve document chunks for an LLM. Same shape, different modality.


The Live Learning Flow

The piece of the demo that surprises people most is what happens when you click on an unknown face.

  1. Click a face in the video window
  2. Type a name in the terminal
  3. The system auto-captures ~50 face crops over the next couple of seconds, tracking that specific face across frames using embedding similarity
  4. The embedding database rebuilds immediately—recognition works right now
  5. A background thread starts fine-tuning the classifier on the new images
  6. Once training completes, the new classifier atomically swaps into the running video loop—no restart

The video keeps running smoothly through all of it. Detection, recognition, classifier fine-tuning, and the live video display happen concurrently on the same machine.

A small architectural detail makes this work: the classifier fine-tune happens on CPU, even on machines with a GPU available. Why? Because the GPU is already busy processing video frames in real time. Letting the heavy training run pull GPU cycles would cause the video to stutter. Fine-tuning on CPU is slower in absolute terms, but it’s running in the background—there’s no UX cost to it taking a few minutes.

That kind of resource-allocation decision is the sort of thing you control when you build at the model-execution layer. An API-only approach can’t make this trade because it doesn’t see the resources at all.


How Models Actually Learn

A 60-second primer for the audience members who hadn’t trained a model end-to-end before:

Start with pretrained weights. Don’t train from scratch. The pretrained InceptionResnetV1 weights came from training on VGGFace2 (3.3 million images of 9,131 people). That foundation is far better than anything you could bootstrap on a few dozen family photos. This is transfer learning: leverage what the model already knows about faces in general, then specialize it for the specific people you care about.

The training loop. Feed in a batch of face images. The model produces predictions. Compute a loss—a single number summarizing how wrong those predictions are. Compute the gradient—how each weight contributed to that loss. Use an optimizer to nudge each weight in the direction that reduces loss. Repeat.

The codebase uses cross-entropy loss (standard for classification), the Adam optimizer (a sensible default that doesn’t need much tuning), and a learning rate of 0.001 for initial training and 0.0005 for fine-tuning. The fine-tuning rate is deliberately gentler—you don’t want to disturb the pretrained weights too aggressively.

An epoch is one full pass over the training data. You typically run several. As training progresses, the loss should trend downward. The training loop in model.py’s fit() function is the same five-step cycle you’ll see in every PyTorch training script you ever read:

  1. Forward pass: model produces predictions
  2. Compute the loss
  3. loss.backward() to fill in gradients
  4. optimizer.step() to update the weights
  5. optimizer.zero_grad() to clear gradients for the next batch

Recognize that pattern and most PyTorch training code becomes much easier to follow.


Why One Photo Per Person Isn’t Enough

Real-world conditions vary: lighting, angle, expression, background, glasses, hats, hair changes. A model trained on a single photo of someone has learned essentially nothing about that person—it’s memorized one specific image.

The codebase addresses this two ways, used together:

Collect many frames. The click-to-capture flow grabs 50 frames over a couple of seconds. The user can move their head slightly during capture, providing some natural variation.

Data augmentation. During training, the codebase randomly rotates, shifts, brightens, recolors, and adds noise to each image every epoch. The model sees fresh variants of each photo every pass through the training data. This uses the Albumentations library—a fast image augmentation pipeline that complements PyTorch the same way torchvision.transforms does, but with a wider catalog and faster execution.

The rule of thumb most practitioners eventually internalize: more varied data beats a fancier model almost every time. A simple model trained on diverse data outperforms a sophisticated model trained on narrow data, and it’s not even close.


Architecture Choices Worth Understanding

A few decisions in the codebase are worth understanding because they generalize beyond face recognition:

Two-phase training. The training pipeline runs twice—first with simple transforms (resize, crop, normalize), then again with aggressive augmentations. The model converges on clean data first, then learns to be robust on harder data. Whether this beats mixing augmentations from the start is a legitimate open question.

Device abstraction. A small get_device() function checks for CUDA (NVIDIA GPUs), then MPS (Apple Silicon), then falls back to CPU. The same code runs on Linux GPU servers, MacBooks, and plain CPUs without changes. Eleven lines, but it removes a class of “works on my machine” friction.

In-memory caching in datasets. The training dataset caches processed images after the first epoch—face detection and transforms run once, then subsequent epochs reuse the cached crops. Trades memory for speed. Fine for small datasets, worth reconsidering at scale.

Atomic classifier swap. The fine-tuned model replaces the running one with a single reference assignment, so the video loop never sees a half-loaded model. Small detail, important consequence.

Separation of concerns. The package structure (config.py, device.py, data.py, datasets.py, transforms.py, model.py, training.py, face_detection.py, inference.py) is over-decomposed for a 1,500-line demo on purpose. The structure is what would let you lift pieces into a production system without dragging the whole demo with them.


A Specific Pattern: Python as a Child Process

One architectural choice from the talk that generated questions: how does this kind of Python-based AI capability fit into a larger production system written in something else?

The pattern that’s worked well in practice: keep the main service in a language built for service work (C#, Java, Go), and run Python as a child process for the AI workload. The C# or Java service starts the Python process when it needs an inference, pipes the input through standard in, and reads the result from standard out. For streaming responses, standard out streams. For large outputs like generated images, the Python process writes to a media store and returns a URL.

This isn’t the only valid pattern—running Python as a service via FastAPI or gRPC is a perfectly reasonable alternative, and many teams ship that way successfully. The child-process approach trades some serialization overhead for fault isolation and explicit resource control. If the Python process crashes, the parent service stays up and just restarts the child. You can pin GPU memory per process. You can apply different resource limits to AI workloads than to request-handling workloads.

The deeper point is that the language of your AI workload doesn’t have to be the language of your service. Once you can shell out to a process and pipe data through it, you have enormous flexibility. That same pattern works for Python AI workloads, C++ TensorFlow services, or even shell scripts that generate PDFs.


Honest Caveats

A demo is a demo. A few things this codebase deliberately doesn’t do, which matter if you’re thinking about deploying anything based on it:

No liveness detection. During the live demo, holding up a phone with a photo on its screen was enough to trigger recognition of the person in the photo. That’s a presentation attack, and there’s no defense against it in this code. Production face systems use depth sensors, near-infrared, challenge-response (blink, turn head), or texture analysis to detect spoofing. The relevant subfield is called Presentation Attack Detection, and it’s its own substantial body of work.

No demographic fairness evaluation. Face recognition systems have well-documented accuracy disparities across skin tone, age, and gender. The NIST FRVT studies showed this clearly. The codebase has no built-in evaluation harness for measuring this, and any deployment without one is operating blind on a known failure mode.

No production-hardening. No authentication, no rate limiting, no observability, no error handling beyond what’s needed for the demo to not crash on stage.

These aren’t reasons not to study or learn from the codebase. They’re reasons to be specific about what changes between “demo I learned from” and “system I deploy.” The architecture is sound; the missing pieces are well-known and addressable, but they’re not free.


Why Build at This Layer At All

The honest question worth asking: if hosted services exist and they work, why bother learning this layer?

A few answers:

Cost at scale. Hosted services are great until you’re processing millions of images a day. At that volume, the math on running your own inference shifts dramatically. Knowing how to build your own gives you the option.

Privacy and data residency. Some workloads can’t send images to third-party services—HIPAA workloads, defense workloads, EU data residency requirements. Local inference is the only option, and someone has to build it.

Latency. A round trip to a hosted service is tens to hundreds of milliseconds. For real-time video applications, that’s already too slow. Local inference can be sub-10ms.

Edge deployment. A drone flying a security perimeter, a camera in a remote location, an IoT device with no reliable internet—these can’t depend on hosted services. They run inference on-device or not at all.

Differentiation. If your AI capability is just “we use OpenAI,” that’s not a moat. Building specialized models trained on data only you have is a moat. That requires the lower-layer skills.

This isn’t about replacing API-based development. It’s about adding another layer to your toolkit so you can pick the right one for the work at hand.


The Underlying Truth About All of This

The deeper takeaway from spending an hour walking through face recognition code: it’s all neural nets.

The same architectural primitives—matrix multiplications through layers, gradients flowing backward, weights updating via an optimizer, embeddings as similarity-friendly vector representations—appear identically in:

  • Large language models (text in, text out)
  • Image generation (text in, pixels out)
  • Speech recognition (audio in, text out)
  • Speech synthesis (text in, audio out)
  • Object detection (image in, bounding boxes out)
  • Face recognition (face image in, identity out)

The data shapes change. The training objectives change. The model sizes change. But the underlying mechanism—pushing tensors through layers, computing loss, backpropagating gradients—is the same machinery in every case.

The face recognition codebase is an unusually good vehicle for understanding that machinery, because it’s small enough to read in an afternoon and concrete enough that you can watch it work in real time. Once you’ve seen the same five-step training loop, the same forward() method, the same embedding similarity search in this domain, you’ve seen them everywhere.

That transferability is the real reason to study this kind of code. The face recognition piece is incidental. The architectural fluency is what compounds.


The Bottom Line

AI as a developer skill set isn’t one thing. There’s the API-integration layer, where you’re calling hosted services and orchestrating them into applications. There’s the prompt-engineering and conversational-partnership layer, where you’re working with models as collaborators. And there’s the system-building layer, where you’re running and training the models yourself.

This meetup talk and codebase live at the third layer. Not because that’s the right layer—they’re all valid—but because it’s the layer that’s least talked about and the one that opens up the broadest set of architectural options.

The full repo is at gitlab.com/kusog-opensource/pytorch-image. Clone it, run it, change the augmentation pipeline, swap in a more modern detector, see what breaks. The code is structured to make that kind of experimentation easy—each piece lives in one place, the seams are clean, and the FAQ in the repo walks through the design choices in detail.

Whether you’re working with hosted AI services, conversational AI partnership, or building your own model-execution stack, the principle is the same: understand the layer you’re operating at, and know what the next layer down can do for you when you need it.


This meetup was hosted at Tech Alley Vegas. For more on the methodology of working with AI as a development partner, see my earlier post: AI Partnership Programming: Two Approaches to Staying in the Driver’s Seat

Related Articles

AI Partnership Programming: Two Approaches to Staying in the Driver's Seat Video
Jan 22, 2026 13 min read

AI Partnership Programming: Two Approaches to Staying in the Driver's Seat

AI Partnership Programming: Two Approaches to Staying in the Driver’s Seat The promise of AI-assisted development comes packaged with impressive statistics: “AI writes 50% of our code” or “10x productivity gains.” These numbers create a seductive narrative where developers offload coding work to AI while focusing on higher-level tasks—or perhaps...