Predict with Response Streaming Endpoint

Use the /predictWithResponseStream endpoint for real-time streaming of inference results over HTTP/1.1.

It supports both standard chunked transfer encoding and Server-Sent Events (SSE), letting clients receive prediction outputs incrementally as they're generated. This is especially useful for applications that require low-latency responses, such as interactive AI experiences or large language model outputs.

The API responses are:


HTTP Status Code	Error Code	Description	Retry
200	None	200 Success. `{ <data_chunk_1>, <data_chunk_2>, ... ... <data_chunk_n> } { "headers": { "Content-Type": "text/event-stream", "transfer-encoding: chunked", "opc-request-id": "", "Trailer: StreamFailure" }, "status": "200 OK" }`	None
404	`NotAuthorizedOrNotFound`	Model deployment not found or authorization failed.	No
405	`MethodNotAllowed`	Method not allowed.	No
409	`ExternalServerIncorrectState`	Unexpected nonstreaming response detected from the deployed model. Check if the inference server returns the streaming response.	No
411	`LengthRequired`	Missing content length header.	No
413	`PayloadTooLarge`	The payload size limit is 10 MB.	No
429	`TooManyRequests`	Too Many Requests. LB bandwidth limit exceeded Consider increasing the provisioned Load Balancer bandwidth to avoid these errors by editing the model deployment. Tenancy request-rate limit exceeded Maximum number of requests per second per tenancy is set to 150. If you're consistently receiving error messages after increasing the LB bandwidth, use the OCI Console to submit a support ticket for the tenancy. Include the following details in the ticket. Describe the issue with the error message that occurred, and indicate the new request per second needed for the tenancy. Indicate that it's a minor loss of service. Indicate Analytics & AI and Data Science. Indicate that the issue is creating and managing models.	Yes, with backoff
500	`InternalServerError`	Internal Server Error. Service Timeout. A 60 second timeout for the `/predict` endpoint exists. This timeout value can't be changed. The `score.py` file returns an exception.	Yes, with backoff
503	`ServiceUnavailable`	Model server unavailable.	Yes, with backoff

Trailer Header

With HTTP 1.1 streaming, when any part of the response body is sent to the client, the response headers and status code can no longer be changed. As a result, even in cases of partial success, where not all expected inference data is received, the client still sees a 200 OK status code.

To provide better visibility on these model stream failures, we use HTTP Trailers. A trailer response header is used to send extra metadata back to the user on their request after the entire response stream or message body has been sent. This metadata is present in the trailer section of response body that clients can separately read apart from regular data chunks.

Partial Failure Description

Trail Headers Response
Partial Failure	Description
Response: `200 Success { <data_chunk_1>, <data_chunk_2>, // failure occurred resulting in dropping of subsequent chunks } { "headers": { "Content-Type": "text/event-stream", "transfer-encoding: chunked", "opc-request-id": "", "Trailer: StreamFailure" }, "status": "200 OK" } //Trailer passed in HTTP response "StreamFailure" : { "ErrorCode" : "<error_code>", "ErrorReason" : "<error_reason>", "HttpCode" : <http_error_code> }`	StreamFailure: This is a trailer header field set in the response body at the end of stream to pass extra metadata to the client if any failures are seen after the request is accepted. It captures the following types of error: ErrorCode - RequestTimeout `HttpCode - 408 ErrorReason - ServiceTimeout/ModelResponseTimeExceeded ServiceTimeout: This is same as 60 second idle timeout ModelResponseTimeExceeded: The model failed to finish sending the entire response within 5 mins time window.` ErrorCode - InternalServerError `HttpCode - 500 ErrorReason - InternalServerError`

Response:

200 Success
 
{
   <data_chunk_1>,
   <data_chunk_2>,
   // failure occurred resulting in dropping
      of subsequent chunks
}
 
{
  "headers": {
    "Content-Type": "text/event-stream",
    "transfer-encoding: chunked",
    "opc-request-id": "",
    "Trailer: StreamFailure"
  },
  "status": "200 OK"
}
  
//Trailer passed in HTTP response
  
"StreamFailure" : {
       "ErrorCode" : "<error_code>",
       "ErrorReason" : "<error_reason>",
       "HttpCode" : <http_error_code>
}

StreamFailure: This is a trailer header field set in the response body at the end of stream to pass extra metadata to the client if any failures are seen after the request is accepted.

It captures the following types of error:

ErrorCode - RequestTimeout

HttpCode - 408
ErrorReason -  ServiceTimeout/ModelResponseTimeExceeded
 ServiceTimeout: This is same as 60 second idle timeout
 ModelResponseTimeExceeded: The model failed to finish
 sending the entire response within 5 mins time window.

ErrorCode - InternalServerError

HttpCode - 500
ErrorReason - InternalServerError

Invoking with the OCI Python SDK

This example code is a reference to help you invoke your model deployment:

# The OCI SDK must be installed for this example to function properly.
# Installation instructions can be found here: https://docs.oracle.com/iaas/Content/API/SDKDocs/pythonsdk.htm
import oci
from oci.signer import Signer
from oci.model_deployment import ModelDeploymentClient
  
config = oci.config.from_file("~/.oci/config") # replace with the location of your oci config file
auth = Signer(
    tenancy=config['tenancy'],
    user=config['user'],
    fingerprint=config['fingerprint'],
    private_key_file_location=config['key_file'],
    pass_phrase=config['pass_phrase'])
  
# For security token based authentication
# token_file = config['security_token_file']
# token = None
# with open(token_file, 'r') as f:
#     token = f.read()
# private_key = oci.signer.load_private_key_from_file(config['key_file'])
# auth = oci.auth.signers.SecurityTokenSigner(token, private_key)
  
model_deployment_ocid = '${modelDeployment.id}'
request_body = {} # payload goes here
  
model_deployment_client = ModelDeploymentClient({'region': config['region']}, signer=auth)
 
# Enable a realm-specific endpoint: https://docs.oracle.com/iaas/tools/python/2.152.0/sdk_behaviors/realm_specific_endpoint_template.html
model_deployment_client.base_client.client_level_realm_specific_endpoint_template_enabled = True
  
response = model_deployment_client.predict_with_response_stream(model_deployment_id=model_deployment_ocid,
                                                                request_body=request_body)
# Read chunks from the response stream
with open('example_file_retrieved_streaming', 'wb') as f:
    for chunk in response.data.raw.stream(1024 * 1024, decode_content=False):
        print(chunk)

Invoking with the OCI CLI

Use a model deployment in the CLI by invoking it.

The CLI is included in the OCI Cloud Shell environment, and is preauthenticated. This example invokes a model deployment with the CLI:

# Enable a realm-specific endpoint: https://docs.oracle.com/iaas/tools/oci-cli/3.56.0/oci_cli_docs/oci.html#cmdoption-realm-specific-endpoint
export OCI_REALM_SPECIFIC_SERVICE_ENDPOINT_TEMPLATE_ENABLED=true
 
oci model-deployment inference-result predict-with-response-stream
 --file '-' --model-deployment-id <model-deployment-url> --request-body {"data": "data"}

Oracle Cloud Infrastructure Documentation

Predict with Response Streaming Endpoint

Trailer Header

Invoking with the OCI Python SDK

Invoking with the OCI CLI