Scaling face detection & recognition from prototype to production
Refactoring an existing system to reduce manual human input, then rebuilding the pipeline to be GPU-accelerated and reliable under production-scale data volumes.
The problem
The client already had a face detection and recognition pipeline, but it behaved like a prototype in production. It required substantial human input to run, degraded under large data volumes, and was CPU-bound making it slow and difficult to scale.
Symptoms
- High manual operational overhead
- Unreliability at scale
- CPU-only execution causing throughput bottlenecks
Production risk
- Unpredictable runtimes and unstable performance
- Increased operational cost due to manual steps
- Scaling required system changes, not just more hardware
Constraints
The work had to improve a system already in use, while meeting real production constraints and maintaining stable behavior.
- Existing pipeline could not be replaced wholesale
- Large and growing datasets with increasing throughput requirements
- GPU resources available in production but underutilized
- Need to reduce human involvement and make execution reproducible
- Stability and predictability mattered more than “best possible” offline metrics
What TensorLab did
The improvements were delivered in three phases: enabling bulk operation, refactoring for quality and efficiency, then removing the production bottleneck by making the pipeline GPU-aware end-to-end.
Bulk pipeline implementation
- Implemented batch processing on top of the existing solution
- Made the workflow deterministic and repeatable
- Prepared the system for higher-throughput workloads
Pipeline refactor + model improvements
- Full refactor of the face detection/recognition pipeline
- Integrated improved models for better efficiency and results
- Improved stage boundaries and maintainability
GPU utilization + post-processing rewrite
After deployment, production runtimes were significantly worse than local runs. Investigation showed GPU resources were not effectively utilized because key post-processing was CPU-bound in OpenCV. The solution was to move compute-heavy steps onto the GPU by rewriting post-processing using PyTorch.
- Identified GPU underutilization in production
- Rebuilt OpenCV-based post-processing logic in PyTorch
- Improved data flow to reduce CPU to GPU overhead
Results
The system transitioned from a fragile, human-assisted pipeline into a stable production system with predictable performance under larger workloads.
Our role
- Designed and implemented the bulk processing workflow
- Refactored the face detection & recognition pipeline end-to-end
- Integrated improved models and improved maintainability
- Diagnosed and fixed production GPU underutilization
- Rewrote CPU-bound OpenCV post-processing in PyTorch
Interested in a similar setup?
If you’re dealing with a CV system that works in demos but breaks at scale — or runs too slowly in production — a short intro call can help clarify feasibility, risks, and next steps.