Validate an Architecture

Select GPUs

You need to deploy a GPU-powered inference service to convert text segments into speech, ensuring performance, scalability, and cost efficiency. You can evaluate GPU models (A10, A100, L40S) to determine the best fit for the use case.

You can perform benchmarking tests with different GPU types. The following table provides a template. Enter values for X and Y and use that data to compare your actual performance to cost.

GPU Type	Avg Latency (per 250 segments)	Throughput (segments/sec)	Notes
A10	X ms	Y	Lower cost, suitable for dev/test
A100	X ms	Y	High performance, higher cost
L40S	X ms	Y	Good balance between price/performance

Recommendation: Select GPU based on workload size, SLA latency requirements, and budget.
Optimization: Run with FP16/mixed precision (where supported) for reduced latency.

Design OKE Clusters

Design your OKE clusters to suit your deployment.

In this example, we use a cluster setup with two node pools:

NodePool 1 (CPU Nodes): Runs UI, workers, RabbitMQ, and internal DB
NodePool 2 (GPU Nodes): Runs TorchServe inference pods

Ensure sufficient BV on the GPU nodes.

Label the node pools:

nodepool=cpu on CPU nodes
nodepool-gpu and nvidia.com/gpu.present=true on GPU nodes

Ensure the OKE NVIDIA device plugin add-on is enabled on GPU nodes and that nvidia-smi works.

Validate GPUs

To ensure that GPUs are available to both the OKE node and TorchServe pods, run the following validation steps.

If both checks succeed, it confirms that the NVIDIA device plugin addon is enabled, and no additional driver/software installation is needed.

Verify GPU visibility on the worker node. For example:

[opc@oke-cvxkfx3tnkq-nkteldpxwna-s3xcmkmmoda-0 ~]$ nvidia-smi -L
GPU 0: NVIDIA A10 (UUID: GPU-eae0552a-e1d7-7c0f-dc39-886f4eafb439)

Verify GPU access inside the TorchServe pod. For example:

[opc@glvoicepoc-bastion guru]$ kubectl exec -it torchserve-7859b89965-rtqw9 -- /bin/bash
model-server@torchserve-7859b89965-rtqw9:~$ nvidia-smi -L
GPU 0: NVIDIA A10 (UUID: GPU-d6d1852e-6d04-59b9-10de-f68422638fb3)

Design Model Distribution

You have two options for making the .mar model available to inference pods.

Use OCI Object Storage buckets as a persistant volume claim (PVC) using Container Storage Interface (CSI) driver.
Transfer your model to OCI File Storage using Secure Copy Protocol (SCP) and use the filesystem as a mount point for a PVC. For example:
```
scp -i /path/to/private_key /path/to/local/model.mar
      opc@<file_storage_instance_ip>:/path/to/destination/mount/point/models/
```

We recommend OCI Object Storage for simplicity, or OCI File Storage if you have multiple models with frequent updates.

Deploy TorchServe

Validate your method for deploying TorchServe.

The following YAML files provide examples for deployment and configuration.

Expose using OCI Load Balancing (Ingress to OKE to TorchServe pods).
Secure with TLS at the load balancer.
Never expose the TorchServe management port (8081) to the public internet.

Note:

Modify the container image: statement to point to your actual TorchServe container location and version.
multispk_20250121_tts is a custom model name used here as an example. You can replace it with your own model’s name and .mar file.

torchserve-deployment.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: torchserve
  labels:
    app: torchserve
spec:
  replicas: 1
  selector:
    matchLabels:
      app: torchserve
  template:
    metadata:
      labels:
        app: torchserve
    spec:
      nodeSelector:
        nodepool: torchserve
      tolerations:
        - key: "nvidia.com/gpu"
          operator: "Exists"
          effect: "NoSchedule"
      imagePullSecrets:
        - name: ocir-creds
      containers:
        - name: torchserve
          image: ocir.<your-region>.oci.oraclecloud.com/<tenancy-namespace>/<torchserve_image>:<version>
          ports:
            - containerPort: 8080
            - containerPort: 8081
            - containerPort: 8082
            - containerPort: 7070
            - containerPort: 7071
          env:
            - name: GPUS__DEVICES
              value: "0"
            - name: METRICS_MODE
              value: "logical"  # or "endpoint"
            - name: ENABLE_METRICS_API
              value: "true"  
          resources:
            limits:
              nvidia.com/gpu: 1
          volumeMounts:
            - name: models-volume
              mountPath: /home/model-server/model-store
      volumes:
        - name: models-volume
          persistentVolumeClaim:
            claimName: fss-voiceengine-models

configmap-torchserve.yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: config-properties
  namespace: default # Adjust the namespace as necessary
data:
  config.properties: |
    inference_address=http://0.0.0.0:8080
    management_address=http://0.0.0.0:8081
    metrics_address=http://0.0.0.0:8082
    number_of_netty_threads=32
    job_queue_size=1000
    model_store=/home/model-server/model-store
    workflow_store=/home/model-server/wf-store
    install_py_dep_per_model=true
    max_request_size=196605000
    max_response_size=196605000
    default_workers_per_model=1
    job_queue_size=100
    metrics_mode=logical
    load_models=/home/model-server/model-store/multispk_20250121_tts.mar

Set Up Autoscaling

Set up autoscaling at the pod and worker level.

Set up autoscaling at the pod level by using Kubernetes Event-driven Autoscaler (KEDA) with Prometheus metrics to scale TorchServe pods based on request queue depth or custom metrics or CPU/GPU utilization.

Set up worker-level autoscaling (TorchServe).

Note:

In the following examples, we are using TorchServe Model Lifecycle with multispk_20250121_tts.

Register the model. For example, run the following command from any remote client that can reach the TorchServe endpoint over HTTP to register your .mar file with one initial worker.
```
curl -X POST "http://10.0.20.160:8081/models?model_name=multispk_20250121_tts&url=multispk_20250121_tts.mar&initial_workers=1"

Configure initial_workers per model demand.
```
The IP (10.0.20.160) and port (8081) refer to the TorchServe inference management API endpoint.

Unregister a model. For example:

curl -X DELETE "http://10.0.20.160:8081/models/multispk_20250121_tts"

Scale/add workers. For example, to update the number of workers for the running model:
```
curl -X PUT "http://10.0.20.160:8081/models/multispk_20250121_tts?min_worker=12"
```

Test Performance

Validate your design with performance testing.

From the client application, send 250 concurrent segments. Capture:
- p50/p95 latency
- Throughput (segments/sec)
- GPU utilization (nvidia-smi)
- End-to-end job completion time

From the pod, run:

kubectl exec -it <pod_name> -- nvidia-smi

From the TorchServe metrics API (port 8082), run:
```
curl http://10.0.20.160:8082/metrics
```

Considerations for Security and Cost Optimization

When validating your design, consider security and cost optimization factors:

Security considerations:
- Enforce TLS termination at the load balancer or ingress.
- Keep the TorchServe management API internal-only.
- Use OCI Identity and Access Management and Network Security Groups to limit access.
Cost optimization considerations:
- Choose your GPU type based on a balance between service level agreement (SLA) vs. cost.
- Use scheduled scaling (scale down GPU node pool during non-peak hours).
- Use OCI Object Storage over OCI File Storage if your models are infrequently updated.
- Inference doesn’t always run 24×7. Share unused GPUs with training workloads during idle periods to maximize utilization and reduce costs.