Serving Deep Learning Models for Enterprises: Part 1
Find out how Triton solves the most common pain points of serving DL Models
MLops is a wild world. Everybody has their own pipelines with different tradeoffs, and it’s usually much better than having nothing.
A fan favorite for model deployment though is Nvidia’s Triton. It’s kind of difficult to work with but the right expert can help you get your serving pipeline off the ground with Triton super fast. Like we did.
Here’s the story
There’s a client with a set of really heavy-duty generative AI models. They needed to serve more than 10 DL models in production and all of them had incompatible dependencies. Bummer.
Initially, they used AWS Lambda for deployment. Unfortunately, that didn’t quite work out when there were huge spikes in usage — the inference time took a huge hit. To top it all, it offered no GPU acceleration and was very costly.
They needed a robust and bulletproof platform to deploy and serve their DL models and monitor their usage and other metrics.
The good thing about Triton is: you can make it work with anything. It has full support for HTTP and gRPC protocols so it can connect with other applications like load balancers.
But it doesn’t stop there. You get model analytics for free:
- Average Inference Time
- Requests received
- GPU Utilization
- Server Throughput
- Server Latency etc.
You get the point.
But what if one model is written in PyTorch and the other in Tensorflow?
No worries about that either. Triton is framework-agnostic, to a large extent. It supports TensorRT, TensorFlow, PyTorch, ONNX, OpenVINO, just to name a few.
And it does a whole lot more: Model versioning, request prioritization based on parameters, ensemble models, model profiling to optimize your models, support for multi-GPU environments, running multiple instances of a model, automated restarts in case of crashing, optimized GPU usage and so on.
So that’s a pretty good sales pitch for Triton. But what did it accomplish exactly?
- Increased up time to 99%
- Optimized inference times by ~3.4X
- Reduced cloud GPU bills by 16%
In the next post, we’ll talk about how we extended Triton to support Kaldi models for a client who needed to serve his Automated Speech Recognition pipeline at scale.