Validate an Architecture
If you want to deploy an architecture similar to that depicted in this solution, you should validate it according to the criteria in the following topics. Consider GPU selection, cluster setup, distribution and deployment, autoscaling, performance testing, security, and cost.
Select GPUs
You need to deploy a GPU-powered inference service to convert text segments into speech, ensuring performance, scalability, and cost efficiency. You can evaluate GPU models (A10, A100, L40S) to determine the best fit for the use case.
You can perform benchmarking tests with different GPU types. The following table provides a template. Enter values for X and Y and use that data to compare your actual performance to cost.
GPU Type | Avg Latency (per 250 segments) | Throughput (segments/sec) | Notes |
---|---|---|---|
A10 | X ms | Y | Lower cost, suitable for dev/test |
A100 | X ms | Y | High performance, higher cost |
L40S | X ms | Y | Good balance between price/performance |
- Recommendation: Select GPU based on workload size, SLA latency requirements, and budget.
- Optimization: Run with FP16/mixed precision (where supported) for reduced latency.
Design OKE Clusters
Design your OKE clusters to suit your deployment.
In this example, we use a cluster setup with two node pools:
- NodePool 1 (CPU Nodes): Runs UI, workers, RabbitMQ, and internal DB
- NodePool 2 (GPU Nodes): Runs TorchServe inference pods
Ensure sufficient BV on the GPU nodes.
Label the node pools:
nodepool=cpu
on CPU nodesnodepool-gpu
andnvidia.com/gpu.present=true
on GPU nodes
Ensure the OKE NVIDIA device plugin add-on is enabled on GPU nodes and that nvidia-smi
works.
Validate GPUs
To ensure that GPUs are available to both the OKE node and TorchServe pods, run the following validation steps.
Design Model Distribution
You have two options for making the .mar
model available to inference pods.
- Use OCI Object Storage buckets as a persistant volume claim (PVC) using Container Storage Interface (CSI) driver.
- Transfer your model to OCI File Storage using Secure Copy Protocol (SCP) and use the filesystem as a mount point for a PVC. For example:
scp -i /path/to/private_key /path/to/local/model.mar opc@<file_storage_instance_ip>:/path/to/destination/mount/point/models/
We recommend OCI Object Storage for simplicity, or OCI File Storage if you have multiple models with frequent updates.
Deploy TorchServe
Validate your method for deploying TorchServe.
The following YAML files provide examples for deployment and configuration.
- Expose using OCI Load Balancing (Ingress to OKE to TorchServe pods).
- Secure with TLS at the load balancer.
-
Never expose the TorchServe management port (8081) to the public internet.
Note:
- Modify the container
image:
statement to point to your actual TorchServe container location and version. multispk_20250121_tts
is a custom model name used here as an example. You can replace it with your own model’s name and.mar
file.
torchserve-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: torchserve
labels:
app: torchserve
spec:
replicas: 1
selector:
matchLabels:
app: torchserve
template:
metadata:
labels:
app: torchserve
spec:
nodeSelector:
nodepool: torchserve
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
imagePullSecrets:
- name: ocir-creds
containers:
- name: torchserve
image: ocir.<your-region>.oci.oraclecloud.com/<tenancy-namespace>/<torchserve_image>:<version>
ports:
- containerPort: 8080
- containerPort: 8081
- containerPort: 8082
- containerPort: 7070
- containerPort: 7071
env:
- name: GPUS__DEVICES
value: "0"
- name: METRICS_MODE
value: "logical" # or "endpoint"
- name: ENABLE_METRICS_API
value: "true"
resources:
limits:
nvidia.com/gpu: 1
volumeMounts:
- name: models-volume
mountPath: /home/model-server/model-store
volumes:
- name: models-volume
persistentVolumeClaim:
claimName: fss-voiceengine-models
configmap-torchserve.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: config-properties
namespace: default # Adjust the namespace as necessary
data:
config.properties: |
inference_address=http://0.0.0.0:8080
management_address=http://0.0.0.0:8081
metrics_address=http://0.0.0.0:8082
number_of_netty_threads=32
job_queue_size=1000
model_store=/home/model-server/model-store
workflow_store=/home/model-server/wf-store
install_py_dep_per_model=true
max_request_size=196605000
max_response_size=196605000
default_workers_per_model=1
job_queue_size=100
metrics_mode=logical
load_models=/home/model-server/model-store/multispk_20250121_tts.mar
Set Up Autoscaling
Set up autoscaling at the pod and worker level.
Set up autoscaling at the pod level by using Kubernetes Event-driven Autoscaler (KEDA) with Prometheus metrics to scale TorchServe pods based on request queue depth or custom metrics or CPU/GPU utilization.
Set up worker-level autoscaling (TorchServe).
Note:
In the following examples, we are using TorchServe Model Lifecycle withmultispk_20250121_tts
.
Considerations for Security and Cost Optimization
When validating your design, consider security and cost optimization factors:
- Security considerations:
- Enforce TLS termination at the load balancer or ingress.
- Keep the TorchServe management API internal-only.
- Use OCI Identity and Access Management and Network Security Groups to limit access.
- Cost optimization considerations:
- Choose your GPU type based on a balance between service level agreement (SLA) vs. cost.
- Use scheduled scaling (scale down GPU node pool during non-peak hours).
- Use OCI Object Storage over OCI File Storage if your models are infrequently updated.
- Inference doesn’t always run 24×7. Share unused GPUs with training workloads during idle periods to maximize utilization and reduce costs.