Build a Scalable MLOps Pipeline on OCI Using OCI-Native Services and MLflow

This reference architecture describes how to implement a scalable and automated MLOps pipeline on Oracle Cloud Infrastructure (OCI).

The architecture helps organizations operationalize machine learning models with consistency, governance, speed, reproducibility, automated deployment, model lifecycle management, and observability.

The solution integrates OCI DevOps, Oracle Cloud Infrastructure Data Science, and Oracle Cloud Infrastructure Kubernetes Engine (OKE) to automate the machine learning lifecycle end-to-end. Training workloads are containerized and run as Oracle Cloud Infrastructure Data Science jobs triggered by DevOps pipelines, while MLflow deployed on OKE provides experiment tracking and model registry capabilities, with artifacts stored in OCI Object Storage. After training, OCI DevOps automatically deploys the latest approved model from the MLflow Model Registry to OKE, and access to both MLflow and inference services is provided through OCI Load Balancer.

Before You Begin

Before deploying this solution, ensure that the following prerequisites are met:

  • An active Oracle Cloud Infrastructure tenancy with sufficient service limits.
  • Configured compartments for environment isolation, such as development, test, and production.
  • OCI Identity and Access Management policies for:
    • OCI DevOps
    • Oracle Cloud Infrastructure Data Science
    • OKE
    • OCI Object Storage
    • OCI Vault
    • OCI Notifications
    • OCI Load Balancer
  • A configured virtual cloud network (VCN) with:
    • Private subnets for OKE and Oracle Cloud Infrastructure Data Science.
    • A public subnet for OCI Load Balancer if external access is required.
    • OCI Service Gateway.
    • NAT gateway.
  • An OKE cluster provisioned for:
    • MLflow as an MLOps service.
    • Inference workloads.
  • MLflow deployed on OKE, configured with:
    • OCI Object Storage as the artifact store.
    • The Model Registry enabled.
  • An OCI DevOps project with:
    • Source repositories.
    • Build and deployment pipelines.
  • OCI Notifications topics and subscriptions configured.
  • Familiarity with Docker, Kubernetes, and machine learning workflows.

Architecture

This architecture implements an automated MLOps pipeline where Oracle Cloud Infrastructure DevOps builds training containers and triggers Oracle Cloud Infrastructure Data Science jobs for model training.

The training jobs pull container images from OCI Container Registry (OCIR) and access datasets from Oracle Cloud Infrastructure Object Storage through Oracle Cloud Infrastructure Service Gateway. While running, training metrics and artifacts are logged to MLflow running on OCI Kubernetes Engine, with artifacts persisted in OCI Object Storage for durability and scalability.

After training completes, the model is registered in the MLflow Model Registry and promoted through defined stages. OCI DevOps automatically triggers a deployment pipeline that retrieves the latest approved model version and deploys it to OCI Kubernetes Engine as an inference service. Both the MLflow service (MLOps control plane) and inference endpoints are exposed through Oracle Cloud Infrastructure Load Balancer, providing a unified and scalable access layer. Throughout the pipeline, Oracle Cloud Infrastructure Notifications delivers real-time updates for build, training, and deployment stages. The solution runs within a secure VCN, using private networking, Oracle Cloud Infrastructure Vault for secrets management, and Oracle Cloud Infrastructure Logging and Oracle Cloud Infrastructure Monitoring for observability.

The following diagram illustrates this reference architecture.



auto-mlops-pipeline-ocidevops-arch-oracle.zip#GUID-3A0A729D-6AD6-4CC7-9EFA-51F02B8941EA

This architecture has the following components:

  • Infrastructure
    • Availability domain

      Availability domains are standalone, independent data centers within a region. The physical resources in each availability domain are isolated from the resources in the other availability domains, which provides fault tolerance. Availability domains don’t share infrastructure such as power or cooling, or the internal availability domain network. So, a failure at one availability domain shouldn't affect the other availability domains in the region.

    • Compartment

      Compartments are cross-regional logical partitions within an OCI tenancy. Use compartments to organize, control access, and set usage quotas for your Oracle Cloud resources. In a given compartment, you define policies that control access and set privileges for resources.

    • Internet gateway

      An internet gateway allows traffic between the public subnets in a VCN and the public internet.

    • OCI region

      An OCI region is a localized geographic area that contains one or more data centers, hosting availability domains. Regions are independent of other regions, and vast distances can separate them (across countries or even continents).

    • Security list

      For each subnet, you can create security rules that specify the source, destination, and type of traffic that is allowed in and out of the subnet.

    • Service gateway

      A service gateway provides access from a VCN to other services, such as Oracle Cloud Infrastructure Object Storage. The traffic from the VCN to the Oracle service travels over the Oracle network fabric and does not traverse the internet.

    • Tenancy

      A tenancy is a secure and isolated partition that Oracle sets up within Oracle Cloud when you sign up for OCI. You can create, organize, and administer your resources on OCI within your tenancy. A tenancy is synonymous with a company or organization. Usually, a company will have a single tenancy and reflect its organizational structure within that tenancy. A single tenancy is usually associated with a single subscription, and a single subscription usually only has one tenancy.

    • OCI virtual cloud network and subnet

      A virtual cloud network (VCN) is a customizable, software-defined network that you set up in an OCI region. Like traditional data center networks, VCNs give you control over your network environment. A VCN can have multiple non-overlapping classless inter-domain routing (CIDR) blocks that you can change after you create the VCN. You can segment a VCN into subnets, which can be scoped to a region or to an availability domain. Each subnet consists of a contiguous range of addresses that don't overlap with the other subnets in the VCN. You can change the size of a subnet after creation. A subnet can be public or private.

  • Oracle Service Network (OSN)
    • OCI Logging
      Oracle Cloud Infrastructure Logging is a highly-scalable and fully-managed service that provides access to the following types of logs from your resources in the cloud:
      • Audit logs: Logs related to events produced by OCI Audit.
      • Service logs: Logs published by individual services such as OCI API Gateway, OCI Events, OCI Functions, OCI Load Balancer, OCI Object Storage, and VCN flow logs.
      • Custom logs: Logs that contain diagnostic information from custom applications, other cloud providers, or an on-premises environment.
    • OCI Monitoring

      Oracle Cloud Infrastructure Monitoring actively and passively monitors your cloud resources, and uses alarms to notify you when metrics meet specified triggers.

    • OCI Notifications

      OCI Notifications broadcasts messages to distributed components by using a low latency publish-subscribe pattern, delivering secure, highly reliable, durable messages for applications hosted on OCI.

    • Oracle Services Network

      The Oracle Services Network (OSN) is a conceptual network on OCI that is reserved for Oracle services. These services have public IP addresses that you can reach over the internet. Hosts outside Oracle Cloud can access the OSN privately by using Oracle Cloud Infrastructure FastConnect or VPN Connect. Hosts in your VCNs can access the OSN privately through a service gateway.

    • OCI Vault

      Oracle Cloud Infrastructure Vault enables you to create and centrally manage the encryption keys that protect your data and the secret credentials that you use to secure access to your resources in the cloud. The default key management is Oracle-managed keys. You can also use customer-managed keys which use OCI Vault. OCI Vault offers a rich set of REST APIs to manage vaults and keys.

    • OCI Web Application Firewall

      Oracle Cloud Infrastructure Web Application Firewall (WAF) is a payment card industry (PCI) compliant, regional-based and edge enforcement service that is attached to an enforcement point, such as a load balancer or a web application domain name. WAF protects applications from malicious and unwanted internet traffic. WAF can protect any internet-facing endpoint, providing consistent rule enforcement across your applications.

  • Services and Products
    • OCI Data Science

      Oracle Cloud Infrastructure Data Science is a fully-managed, serverless platform that data science teams can use to build, train, and manage machine learning (ML) models on OCI. It can easily integrate with other OCI services such as Oracle Autonomous AI Lakehouse, Oracle Cloud Infrastructure Object Storage, and more. You can build and evaluate high-quality machine learning models that increase business flexibility by putting enterprise-trusted data to work quickly, and you can support data-driven business objectives with easier deployment of ML models.

      The Data Science Jobs feature enables data scientists to define and run repeatable machine learning tasks on a fully-managed infrastructure.

      The Data Science Model Deployment feature allows data scientists to deploy trained models as fully-managed HTTP endpoints that can provide predictions in real time, infusing intelligence into processes and applications, and allowing the business to react to relevant events as they occur.

    • OCI DevOps

      Oracle Cloud Infrastructure DevOps (developer operations) is a complete continuous integration/continuous delivery (CI/CD) platform for developers to simplify and automate their software development lifecycle. OCI DevOps enables developers and operators to collaboratively develop, build, test, and deploy software. Developers and operators get visibility across the full development lifecycle with a history of source commit through build, test, and deploy phases.

    • OCI Identity and Access Management

      Oracle Cloud Infrastructure Identity and Access Management (IAM) provides user access control for OCI and Oracle Cloud Applications. The IAM API and the user interface enable you to manage identity domains and the resources within them. Each OCI IAM identity domain represents a standalone identity and access management solution or a different user population.

    • Kubernetes cluster

      A Kubernetes cluster is a set of machines that run containerized applications. Kubernetes provides a portable, extensible, open source platform for managing containerized workloads and services in those nodes. A Kubernetes cluster is formed of worker nodes and control plane nodes.

    • Load balancer

      Oracle Cloud Infrastructure Load Balancer provides automated traffic distribution from a single entry point to multiple servers.

    • OCI Object Storage

      OCI Object Storage provides access to large amounts of structured and unstructured data of any content type, including database backups, analytic data, and rich content such as images and videos. You can safely and securely store data directly from applications or from within the cloud platform. You can scale storage without experiencing any degradation in performance or service reliability.

      Use standard storage for "hot" storage that you need to access quickly, immediately, and frequently. Use archive storage for "cold" storage that you retain for long periods of time and seldom or rarely access.

  • MLflow (on Kubernetes Engine)

    MLflow is an open source platform for managing the ML lifecycle, including experiment tracking and model registry. It can be deployed on Kubernetes for scalability. In this architecture, MLflow runs on Kubernetes Engine, stores artifacts in OCI Object Storage, and maintains the Model Registry as the source of truth for production models. It enables versioning, governance, and controlled promotion of models.

  • OCI Container Registry (OCIR)

    OCI Container Registry is a managed, private Docker registry for storing and managing container images. It integrates with OCI Identity and Access Management for secure access control. In this architecture, it stores versioned training and serving container images. These images are consumed by Data Science jobs and Kubernetes Engine deployments.

Recommendations

These recommendations help improve the security, scalability, and maintainability of the MLOps pipeline.
  • VCN
    • When you create a VCN, determine the number of CIDR blocks required and the size of each block based on the number of resources that you plan to attach to subnets in the VCN.
    • Use CIDR blocks that are within the standard private IP address space, and select CIDR blocks that do not overlap with any other network in Oracle Cloud Infrastructure, your on-premises data center, or another cloud provider to which you intend to set up private connections.
    • After you create a VCN, you can change, add, and remove its CIDR blocks.
    • When you design the subnets, consider your traffic flow and security requirements. Attach all resources within a specific tier or role to the same subnet, which can serve as a security boundary.
    • Use regional subnets.
  • Security
    • Use Oracle Cloud Guard to monitor and maintain the security of your resources in Oracle Cloud Infrastructure proactively. Cloud Guard uses detector recipes that you can define to examine your resources for security weaknesses and to monitor operators and users for risky activities. When any misconfiguration or insecure activity is detected, Cloud Guard recommends corrective actions and assists with taking those actions, based on responder recipes that you can define.
    • For resources that require maximum security, Oracle recommends that you use security zones. A security zone is a compartment associated with an Oracle-defined recipe of security policies that are based on best practices. For example, the resources in a security zone must not be accessible from the public internet and they must be encrypted using customer-managed keys. When you create and update resources in a security zone, Oracle Cloud Infrastructure validates the operations against the policies in the security-zone recipe and denies operations that violate any of the policies.
  • Cloud Guard
    • Clone and customize the default recipes provided by Oracle to create custom detector and responder recipes. These recipes enable you to specify what type of security violations generate a warning and what actions are allowed to be performed on them. For example, you might want to detect Object Storage buckets that have visibility set to public.
    • Apply Oracle Cloud Guard at the tenancy level to cover the broadest scope and to reduce the administrative burden of maintaining multiple configurations.
    • You can also use the Managed List feature to apply certain configurations to detectors.
  • Network security groups (NSGs)
    • You can use NSGs to define a set of ingress and egress rules that apply to specific VNICs. Use NSGs rather than security lists, because NSGs enable you to separate the VCN subnet architecture from the security requirements of your application.
  • OKE
    • Deploy MLflow and inference workloads in separate namespaces. Enable autoscaling and use multiple node pools for workload isolation. Use ingress controllers or load balancers to securely expose inference services.
  • OCI Object Storage
    • Use OCI Object Storage for datasets, trained models, and MLflow artifacts. Enable versioning and lifecycle policies to optimize storage and maintain model lineage.
  • Oracle Cloud Infrastructure Data Science
    • Use containerized jobs for training to ensure reproducibility. Avoid manual notebook-based workflows in production. Integrate MLflow for experiment tracking.
  • OCI Load Balancer
    • Use a load balancer to expose the MLflow UI/API and inference endpoints. Configure listeners and back-end sets for different services. Use HTTPS for secure access and integrate with DNS if needed.

Considerations

These considerations summarize the primary performance, security, availability, and cost factors for the MLOps pipeline.

Consider the following points when deploying this reference architecture.

  • Performance: Use autoscaling in OKE for inference workloads and optimize Data Science job shapes for training. Ensure that MLflow scales appropriately with Object Storage-backed artifacts and that the load balancer is properly sized to handle traffic.
  • Security: Apply least-privilege OCI Identity and Access Management policies and use OCI Vault for secret management. Restrict access to MLflow, OCI Object Storage, and inference endpoints.
  • Availability: Deploy across availability and fault domains. Use OKE high availability features and ensure that MLflow services are resilient.
  • Cost: Use autoscaling to optimize compute usage. Apply lifecycle policies in OCI Object Storage and right-size OKE node pools. Shut down unused resources.

Explore More

These Oracle documentation resources provide more information about the services used in this reference architecture.

To learn more about Data Science, OCI DevOps, OKE, and related services in this architecture, see the following resources:

Acknowledgments

  • Author: Prasanth Prasad
  • Contributors: Thangaraj, Karol Stuart