Model Inference Endpoint Request Throttling
Inference endpoint requests might be throttled based on activity and resource consumption over time.
This is to maintain high availability and fair use of resources by protecting model serving application servers from being overwhelmed by too many requests, and prevent denial-of-service attacks. If you make too many requests too quickly, you might see some succeed while others fail. When a request fails because of throttling, the service returns response code 429 with one of the following error codes and description:
{ "code": "TooManyRequests", "message": "Tenancy request-rate limit exceeded.
Please use the OCI Console to submit a support ticket for your tenancy to increase the request limit."}
Or
{ "code": "TooManyRequests", "message": "LB bandwidth limit exceeded.
Consider increasing the provisioned load balancer bandwidth to avoid these errors." }