TensorLab Example engagement

Scaling face detection & recognition from prototype to production

Refactoring an existing system to reduce manual human input, then rebuilding the pipeline to be GPU-accelerated and reliable under production-scale data volumes.

Scope Face detection • Face recognition • Pipeline refactor
Focus Automation • Reliability • Performance
Outcome 2-3× faster analysis via end-to-end GPU utilization

The problem

The client already had a face detection and recognition pipeline, but it behaved like a prototype in production. It required substantial human input to run, degraded under large data volumes, and was CPU-bound making it slow and difficult to scale.

Symptoms

  • High manual operational overhead
  • Unreliability at scale
  • CPU-only execution causing throughput bottlenecks

Production risk

  • Unpredictable runtimes and unstable performance
  • Increased operational cost due to manual steps
  • Scaling required system changes, not just more hardware

Constraints

The work had to improve a system already in use, while meeting real production constraints and maintaining stable behavior.

  • Existing pipeline could not be replaced wholesale
  • Large and growing datasets with increasing throughput requirements
  • GPU resources available in production but underutilized
  • Need to reduce human involvement and make execution reproducible
  • Stability and predictability mattered more than “best possible” offline metrics

What TensorLab did

The improvements were delivered in three phases: enabling bulk operation, refactoring for quality and efficiency, then removing the production bottleneck by making the pipeline GPU-aware end-to-end.

01 • Enable scale

Bulk pipeline implementation

  • Implemented batch processing on top of the existing solution
  • Made the workflow deterministic and repeatable
  • Prepared the system for higher-throughput workloads
02 • Improve core

Pipeline refactor + model improvements

  • Full refactor of the face detection/recognition pipeline
  • Integrated improved models for better efficiency and results
  • Improved stage boundaries and maintainability
03 • Fix production bottleneck

GPU utilization + post-processing rewrite

After deployment, production runtimes were significantly worse than local runs. Investigation showed GPU resources were not effectively utilized because key post-processing was CPU-bound in OpenCV. The solution was to move compute-heavy steps onto the GPU by rewriting post-processing using PyTorch.

  • Identified GPU underutilization in production
  • Rebuilt OpenCV-based post-processing logic in PyTorch
  • Improved data flow to reduce CPU to GPU overhead

Results

The system transitioned from a fragile, human-assisted pipeline into a stable production system with predictable performance under larger workloads.

Performance ~2–3× faster analysis after GPU-aware post-processing
Operations Minimal human intervention and more reproducible execution
Scalability Reliable behavior under much larger data volumes

Our role

  • Designed and implemented the bulk processing workflow
  • Refactored the face detection & recognition pipeline end-to-end
  • Integrated improved models and improved maintainability
  • Diagnosed and fixed production GPU underutilization
  • Rewrote CPU-bound OpenCV post-processing in PyTorch

Interested in a similar setup?

If you’re dealing with a CV system that works in demos but breaks at scale — or runs too slowly in production — a short intro call can help clarify feasibility, risks, and next steps.

Book a 30-minute intro call
Prefer async? Email: f.c.tensorlab@gmail