Predict with Response Streaming Endpoint

Use the /predictWithResponseStream endpoint for real-time streaming of inference results over HTTP/1.1.

It supports both standard chunked transfer encoding and Server-Sent Events (SSE), letting clients receive prediction outputs incrementally as they're generated. This is especially useful for applications that require low-latency responses, such as interactive AI experiences or large language model outputs.

The API responses are:
HTTP Status Code Error Code Description Retry

200

None

200 Success.

{
   <data_chunk_1>,
   <data_chunk_2>,
   ... 
   ... 
   <data_chunk_n>
}

{
  "headers": {
    "Content-Type": "text/event-stream",
    "transfer-encoding: chunked",
    "opc-request-id": "",
    "Trailer: StreamFailure"
  },
  "status": "200 OK"
}

None

404

NotAuthorizedOrNotFound

Model deployment not found or authorization failed.

No

405

MethodNotAllowed

Method not allowed.

No

409 ExternalServerIncorrectState Unexpected nonstreaming response detected from the deployed model. Check if the inference server returns the streaming response. No

411

LengthRequired

Missing content length header.

No

413

PayloadTooLarge

The payload size limit is 10 MB.

No

429

TooManyRequests

Too Many Requests.

LB bandwidth limit exceeded

Consider increasing the provisioned Load Balancer bandwidth to avoid these errors by editing the model deployment.

Tenancy request-rate limit exceeded

Maximum number of requests per second per tenancy is set to 150.

If you're consistently receiving error messages after increasing the LB bandwidth, use the OCI Console to submit a support ticket for the tenancy. Include the following details in the ticket.

  • Describe the issue with the error message that occurred, and indicate the new request per second needed for the tenancy.
  • Indicate that it's a minor loss of service.
  • Indicate Analytics & AI and Data Science.
  • Indicate that the issue is creating and managing models.

Yes, with backoff

500

InternalServerError

Internal Server Error.

  • Service Timeout.

    A 60 second timeout for the /predict endpoint exists. This timeout value can't be changed.

  • The score.py file returns an exception.

Yes, with backoff

503

ServiceUnavailable

Model server unavailable.

Yes, with backoff

Trailer Header

With HTTP 1.1 streaming, when any part of the response body is sent to the client, the response headers and status code can no longer be changed. As a result, even in cases of partial success, where not all expected inference data is received, the client still sees a 200 OK status code.

To provide better visibility on these model stream failures, we use HTTP Trailers. A trailer response header is used to send extra metadata back to the user on their request after the entire response stream or message body has been sent. This metadata is present in the trailer section of response body that clients can separately read apart from regular data chunks.

Trail Headers Response
Partial Failure Description
Response:
200 Success
 
{
   <data_chunk_1>,
   <data_chunk_2>,
   // failure occurred resulting in dropping
      of subsequent chunks
}
 
{
  "headers": {
    "Content-Type": "text/event-stream",
    "transfer-encoding: chunked",
    "opc-request-id": "",
    "Trailer: StreamFailure"
  },
  "status": "200 OK"
}
  
//Trailer passed in HTTP response
  
"StreamFailure" : {
       "ErrorCode" : "<error_code>",
       "ErrorReason" : "<error_reason>",
       "HttpCode" : <http_error_code>
}
StreamFailure: This is a trailer header field set in the response body at the end of stream to pass extra metadata to the client if any failures are seen after the request is accepted.
It captures the following types of error:
  • ErrorCode - RequestTimeout
    HttpCode - 408
    ErrorReason -  ServiceTimeout/ModelResponseTimeExceeded
     ServiceTimeout: This is same as 60 second idle timeout
     ModelResponseTimeExceeded: The model failed to finish
     sending the entire response within 5 mins time window.
  • ErrorCode - InternalServerError
    HttpCode - 500
    ErrorReason - InternalServerError

Invoking with the OCI Python SDK

This example code is a reference to help you invoke your model deployment:
# The OCI SDK must be installed for this example to function properly.
# Installation instructions can be found here: https://docs.oracle.com/iaas/Content/API/SDKDocs/pythonsdk.htm
import oci
from oci.signer import Signer
from oci.model_deployment import ModelDeploymentClient
  
config = oci.config.from_file("~/.oci/config") # replace with the location of your oci config file
auth = Signer(
    tenancy=config['tenancy'],
    user=config['user'],
    fingerprint=config['fingerprint'],
    private_key_file_location=config['key_file'],
    pass_phrase=config['pass_phrase'])
  
# For security token based authentication
# token_file = config['security_token_file']
# token = None
# with open(token_file, 'r') as f:
#     token = f.read()
# private_key = oci.signer.load_private_key_from_file(config['key_file'])
# auth = oci.auth.signers.SecurityTokenSigner(token, private_key)
  
model_deployment_ocid = '${modelDeployment.id}'
request_body = {} # payload goes here
  
model_deployment_client = ModelDeploymentClient({'region': config['region']}, signer=auth)
 
# Enable a realm-specific endpoint: https://docs.oracle.com/iaas/tools/python/2.152.0/sdk_behaviors/realm_specific_endpoint_template.html
model_deployment_client.base_client.client_level_realm_specific_endpoint_template_enabled = True
  
response = model_deployment_client.predict_with_response_stream(model_deployment_id=model_deployment_ocid,
                                                                request_body=request_body)
# Read chunks from the response stream
with open('example_file_retrieved_streaming', 'wb') as f:
    for chunk in response.data.raw.stream(1024 * 1024, decode_content=False):
        print(chunk)

Invoking with the OCI CLI

Use a model deployment in the CLI by invoking it.

The CLI is included in the OCI Cloud Shell environment, and is preauthenticated. This example invokes a model deployment with the CLI:

# Enable a realm-specific endpoint: https://docs.oracle.com/iaas/tools/oci-cli/3.56.0/oci_cli_docs/oci.html#cmdoption-realm-specific-endpoint
export OCI_REALM_SPECIFIC_SERVICE_ENDPOINT_TEMPLATE_ENABLED=true
 
oci model-deployment inference-result predict-with-response-stream
 --file '-' --model-deployment-id <model-deployment-url> --request-body {"data": "data"}