Serving Deep Learning Models for Enterprises: Part 2

Learn how to extend Triton’s Kaldi backend to support multiple GPUs

Let’s get deep in the weeds.

The problem was that the client had an Automated Speech Recognition (ASR) Model that he wanted to deploy on scale. The model was Kaldi-based and Kaldi isn’t the most developer-friendly interface to work with.

The goal was to serve multiple instances of the model on multiple GPUs using Triton Inference server. Unfortunately, Triton’s Kaldi backend doesn’t come with multi-GPU support. And the Example Python Notebooks provided by Nvidia ran into connection issues (namely, gRPC unimplemented error).

Solution 1

As a first step, we tried different versions of various protocols (gRPC/protobuf) but that didn’t work. It turns out that the gRPC library used in the example notebooks wasn’t even compatible with the latest version of Triton’s server.

Luckily, TritonDeepLearningExamples has a repository for a minimal C++ client for testing the model loaded on Triton Inference server. By default, the server loads the Librispeech model and the client computes WER (Word Error Rate), SER (Sentence Error Rate) and text transcriptions for a sample OpenSLR dataset.

That meant we had to develop our own C++ inference client that supports multiple GPUs and provide Python bindings for it.

Solution 2

Implementing Custom Client

The first step was to investigate the kaldi-asr-client. Too convoluted. Multiple docker images involved with some hacks. In order to be able to use the client from its Python bindings, it needed to be compiled as a shared library or DLL.

Getting the client to build as a shared library was a lot of trouble. The dependencies we needed to use later were not preserved during the original build process.

We then managed to built object files fine, but final link fails because the C-make files don’t have the appropriate dependency information. Tried shared linking to the dependencies instead but failed due to needing openssl1.

The openssl1 error was solved by bundling openssl1 in the directory, building it from source and setting appropriate RPATH with patchelf.

At this point, our minimal client works fine and has all the steroids that we’d be injecting in it later.

Creating Python Bindings

Frankly, we didn’t have any experience building Python bindings for C++ clients, so that required some upskilling (mainly learning about ctypes in Python). Once you know how ctypes work, implementing the bindings is pretty straightforward.

As finishing touches to the wrapper, we did the following steps:

  • Elide all copies that were being created unnecessarily
  • Added support for batch inferencing
  • Propagated server side exceptions appropriately to the client

Adding Multi-GPU Support

When using Tensorflow or PyTorch Kaldi backends, you can specify in the configuration (config.pbtxt) which GPUs you want to use and how many instances of the given model to load and on which GPU.

This feature, unfortunately, is lacking in Triton’s Kaldi backend. But there’s a hack:

We launched a separate Triton server for each GPU our host machine had. Each Triton server was responsible for orchestrating model on the GPU it was running on. Our model took 18GB VRAM when loaded leaving us only a buffer of 6GB VRAM which is good to have when you’re expecting a lot of traffic. The C++ client gets the information about all the Triton servers running on the host machine from the user, establishes connection with them and distributes all of the incoming load evenly among them, thus maximizing their utilization and minimizing the inference time.

Using this hack, we can scale our solution to any number of GPUs on the host machine and have inferencing done on them in parallel.

Multi-GPU Python Bindings

So far our Python bindings were only using a single GPU. To achieve our target, we needed to scale them to multiple GPUs and also handle load-balancing.

The above hack took care of that. Just specify the ports of the different Triton servers running in the Python bindings and you’re good to go.

Now, our python client was able to use gRPC to communicate with the triton server, distribute the load evenly among them, propagate exceptions to the client and importantly, Pythonic.

The Result

Check out the resulting inference script:

import kaldi_asr_client

files = [
"./data/1.wav",
"./data/2.wav"
]

# Raw bytes corresponding to each file
wavs = []

for file in files:
with open(file, "rb") as f:
wavs.append(f.read())

# Using context manager to avoid memory leaks
with Client(
samp_freq=16000,
servers=["localhost:8001", "localhost:8002"], # multiple triton servers
model_name="kaldi_online",
ncontextes=10,
chunk_length=8160,
verbose=False,
) as client:
wav_bytes = list(map(lambda item: item[1], wavs))
for index, inference in enumerate(client.infer(wav_bytes)):
wav_file = wavs[index][0]
print(f"Inference for {wav_file}: {inference}")

We are also planning to release an async version of our inference client in future updates to further improve the response times.

Cross-posted from: https://muhammadrassam.substack.com/p/serving-deep-learning-models-for-7ee

Subscribe to the newsletter for weekly updates direct in your inbox: https://muhammadrassam.substack.com/

--

--

Antematter is a software development company specializing in Blockchain, DeFi & SaaS.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Antematter

Antematter is a software development company specializing in Blockchain, DeFi & SaaS.