Rate Limiting

The Private AI Services Container provides a configurable method to control the number of requests per minute that the container can handle for different categories of endpoints. This improves scalability and helps prevent abuse, including denial-of-service (DOS) attacks.

The container distinguishes between two types of endpoints, each with an independently configurable rate limit:

Monitor endpoints: These are used for operational monitoring and health checks, including /health and /metrics, and are controlled by monitor_requests_per_min (default value 60) in the configuration file.
Service endpoints: These include all of the other API functions such as /v1/embeddings and /v1/models. These are controlled by service_requests_per_min (default value 3000) in the configuration file.

This separation ensures that health and metrics scraping, as well as monitoring, does not interfere with the ability to serve regular API traffic from users or client applications.

Container admins can start the container with rate limiting by adding a "ratelimiter" session in the configuration JSON file with "monitor_requests_per_min" and "service_requests_per_min", as in the following:

{
  "ratelimiter": {
    "service_requests_per_min": 3000,
    "monitor_requests_per_min": 60
  }
}

This example configuration file would set a limit of 3000 API (service) requests per minute per IP address and 60 monitor (health, metrics) requests per minute per IP address.

Each IP address is tracked independently for both service and monitor groups and counters reset every minute. If a client exceeds its assigned quota within a 60-second window, the server responds with HTTP 429 (Too Many Requests). This logic protects the core APIs from being overwhelmed while ensuring observability endpoints remain responsive.

Several response headers are provided to help monitor the requests usage:

x-ratelimit-limit-requests: This is the same as "service_requests_per_min" or "monitor_requests_per_min" defined in the configuration file, depending on the endpoints. It shows the maximum requests allowed per minute per IP address.
x-ratelimit-remaining-requests: This shows the number of requests remaining before the total request resets.
x-ratelimit-reset-requests: This indicates the number of seconds left until the total request is refilled.

If, for example, service_requests_per_min has a value of 500 and a client sends 400 scoring or inference requests in the first 25 seconds, the response will include: x-ratelimit-limit-requests = 500, x-ratelimit-remaining-requests = 100 (500 - 400), and x-ratelimit-reset-requests = 35s (60 seconds - 25 seconds).

With a monitor_requests_per_minute value of 120, if a client sends 60 health check requests in the first 15 seconds, the response will include: x-ratelimit-limit-requests = 120, x-ratelimit-remaining-requests = 60 (120-60), and x-ratelimit-reset-requests = 45s (60 seconds - 15 seconds).

Parent topic: General Considerations for Container Configuration