Model Deployments GPU

Troubleshoot GPU model deployments.

Bootstrap Failed Because of the Model Size

In general, the model size should be is greater than 0 and less than the selected shape memory. Check the size of model and ensure that it's at least less than 70% of the memory size of the attached GPU or CPU with the Compute shape.

Runtime CUDA Out of Memory Error

If an CUDA out of memory (OOM) error occurs, it could be because a payload is too large and there isn't enough space on the GPU to save the input and output tensor. To optimize performance, adjust the WEB_CONCURRENCY factor in the application environment variable when using a service managed inference server.

Starting with a lower number, such as 1 or 2, might be beneficial because of the variability in model types, frameworks, and input and output sizes. While Data Science attempts to estimate the optimal number of model replicas for increased throughput, issues might occur at runtime. When this is the case, managing the number of model replicas on a GPU can be achieved by adjusting WEB_CONCURRENCY. The default WEB_CONCURRENCY factor computed by Data Science is found in model deployment logs.

When using a BYOC container, we recommend reducing the number of replicas loaded onto the GPU. If these options don't suffice, upgrading to a larger GPU Compute shape might be necessary.