Predict with Response Streaming Endpoint
Use the /predictWithResponseStream
endpoint for real-time streaming of
inference results over HTTP/1.1.
It supports both standard chunked transfer encoding and Server-Sent Events (SSE), letting clients receive prediction outputs incrementally as they're generated. This is especially useful for applications that require low-latency responses, such as interactive AI experiences or large language model outputs.
HTTP Status Code | Error Code | Description | Retry |
---|---|---|---|
200 |
None |
200 Success.
|
None |
404 |
NotAuthorizedOrNotFound |
Model deployment not found or authorization failed. |
No |
405 |
MethodNotAllowed |
Method not allowed. |
No |
409 | ExternalServerIncorrectState |
Unexpected nonstreaming response detected from the deployed model. Check if the inference server returns the streaming response. | No |
411 |
LengthRequired |
Missing content length header. |
No |
413 |
PayloadTooLarge |
The payload size limit is 10 MB. |
No |
429 |
TooManyRequests |
Too Many Requests.
|
Yes, with backoff |
500 |
InternalServerError |
Internal Server Error.
|
Yes, with backoff |
503 |
ServiceUnavailable |
Model server unavailable. |
Yes, with backoff |
Trailer Header
With HTTP 1.1 streaming, when any part of the response body is sent to the client, the response headers and status code can no longer be changed. As a result, even in cases of partial success, where not all expected inference data is received, the client still sees a 200 OK status code.
To provide better visibility on these model stream failures, we use HTTP Trailers. A trailer response header is used to send extra metadata back to the user on their request after the entire response stream or message body has been sent. This metadata is present in the trailer section of response body that clients can separately read apart from regular data chunks.
Partial Failure | Description |
---|---|
Response:
|
StreamFailure: This is a trailer header field set in the response body at the end
of stream to pass extra metadata to the client if any failures are seen after the
request is accepted. It captures the following types of error:
|
Invoking with the OCI Python SDK
# The OCI SDK must be installed for this example to function properly.
# Installation instructions can be found here: https://docs.oracle.com/iaas/Content/API/SDKDocs/pythonsdk.htm
import oci
from oci.signer import Signer
from oci.model_deployment import ModelDeploymentClient
config = oci.config.from_file("~/.oci/config") # replace with the location of your oci config file
auth = Signer(
tenancy=config['tenancy'],
user=config['user'],
fingerprint=config['fingerprint'],
private_key_file_location=config['key_file'],
pass_phrase=config['pass_phrase'])
# For security token based authentication
# token_file = config['security_token_file']
# token = None
# with open(token_file, 'r') as f:
# token = f.read()
# private_key = oci.signer.load_private_key_from_file(config['key_file'])
# auth = oci.auth.signers.SecurityTokenSigner(token, private_key)
model_deployment_ocid = '${modelDeployment.id}'
request_body = {} # payload goes here
model_deployment_client = ModelDeploymentClient({'region': config['region']}, signer=auth)
# Enable a realm-specific endpoint: https://docs.oracle.com/iaas/tools/python/2.152.0/sdk_behaviors/realm_specific_endpoint_template.html
model_deployment_client.base_client.client_level_realm_specific_endpoint_template_enabled = True
response = model_deployment_client.predict_with_response_stream(model_deployment_id=model_deployment_ocid,
request_body=request_body)
# Read chunks from the response stream
with open('example_file_retrieved_streaming', 'wb') as f:
for chunk in response.data.raw.stream(1024 * 1024, decode_content=False):
print(chunk)
Invoking with the OCI CLI
Use a model deployment in the CLI by invoking it.
The CLI is included in the OCI Cloud Shell environment, and is preauthenticated. This example invokes a model deployment with the CLI:
# Enable a realm-specific endpoint: https://docs.oracle.com/iaas/tools/oci-cli/3.56.0/oci_cli_docs/oci.html#cmdoption-realm-specific-endpoint
export OCI_REALM_SPECIFIC_SERVICE_ENDPOINT_TEMPLATE_ENABLED=true
oci model-deployment inference-result predict-with-response-stream
--file '-' --model-deployment-id <model-deployment-url> --request-body {"data": "data"}