YellowDog: Workload Management Platform Deployment on Oracle Cloud Infrastructure

Since 2006, London-based YellowDog has helped companies get massive scale, while running their high-performance computing workloads in Oracle Cloud Infrastructure. By using a combination of virtual machines and bare metal servers, microK8 clusters, and intelligent provisioning of compute shapes, YellowDog can deploy as many as 198,000 compute cores in just minutes.

Customer Story

Learn more about YellowDog's journey to Oracle Cloud:

Architecture

YellowDog's multicloud workload management platform runs its infrastructure on Oracle Cloud Infrastructure (OCI). The architecture uses VMs inside a MicroK8s Kubernetes cluster node. Each cluster node contains three virtual machines (VM) in a single subnet spread across multiple availability domains.

The following diagram illustrates this reference architecture.

Description of yellowdog-architecture-oci.png follows
Description of the illustration yellowdog-architecture-oci.png

yellowdog-architecture-oci-oracle.zip

YellowDog has number of key services including database services, event streaming services, observability, and management services with replica sets deployed to Microk8s cluster across the worker nodes. The NGINX ingress gateway manages all of the incoming traffic using Oracle Cloud Infrastructure Domain Name System (DNS) round robin method. Long-running requests are put in through the messaging queue which distributes the load further across the cluster and provides intrinsic load balancing.

YellowDog also has clustered compute environments. One of them is provisioned using Oracle Cloud Infrastructure and another is configured with an on-premises environment.

Description of yellowdog-architecture-context-ap.png follows
Description of the illustration yellowdog-architecture-context-ap.png

yellowdog-architecture-context-ap-oracle.zip

The architecture has the following components:

  • Region

    An Oracle Cloud Infrastructure region is a localized geographic area that contains one or more data centers, called availability domains. Regions are independent of other regions, and vast distances can separate them (across countries or even continents).

  • Availability domains

    Availability domains are standalone, independent data centers within a region. The physical resources in each availability domain are isolated from the resources in the other availability domains, which provides fault tolerance. Availability domains don’t share infrastructure such as power or cooling, or the internal availability domain network. So, a failure at one availability domain is unlikely to affect the other availability domains in the region.

  • Virtual cloud network (VCN) and subnets

    A VCN is a customizable, software-defined network that you set up in an Oracle Cloud Infrastructure region. Like traditional data center networks, VCNs give you complete control over your network environment. A VCN can have multiple non-overlapping CIDR blocks that you can change after you create the VCN. You can segment a VCN into subnets, which can be scoped to a region or to an availability domain. Each subnet consists of a contiguous range of addresses that don't overlap with the other subnets in the VCN. You can change the size of a subnet after creation. A subnet can be public or private.

  • Compute

    The Oracle Cloud Infrastructure Compute service enables you to provision and manage compute hosts in the cloud. You can launch compute instances with shapes that meet your resource requirements for CPU, memory, network bandwidth, and storage. After creating a compute instance, you can access it securely, restart it, attach and detach volumes, and terminate it when you no longer need it.

  • File storage

    The Oracle Cloud Infrastructure File Storage service provides a durable, scalable, secure, enterprise-grade network file system. You can connect to a File Storage service file system from any bare metal, virtual machine, or container instance in a VCN. You can also access a file system from outside the VCN by using Oracle Cloud Infrastructure FastConnect and IPSec VPN.

  • Internet gateway

    The internet gateway allows traffic between the public subnets in a VCN and the public internet.

  • Object storage

    Object storage provides quick access to large amounts of structured and unstructured data of any content type, including database backups, analytic data, and rich content such as images and videos. You can safely and securely store and then retrieve data directly from the internet or from within the cloud platform. You can seamlessly scale storage without experiencing any degradation in performance or service reliability. Use standard storage for "hot" storage that you need to access quickly, immediately, and frequently. Use archive storage for "cold" storage that you retain for long periods of time and seldom or rarely access.

  • DNS

    The Oracle Cloud Infrastructure Domain Name System (DNS) service is a highly scalable, global anycast domain name system (DNS) network that offers enhanced DNS performance, resiliency, and scalability, so that end users connect to customers’ application as quickly as possible, from wherever they are.

  • VM DB System

    Oracle VM Database System is an Oracle Cloud Infrastructure (OCI) database service that enables you to build, scale, and manage full-featured Oracle databases on virtual machines. A VM database system uses OCI Block Volumes storage instead of local storage and can run Oracle Real Application Clusters (Oracle RAC) to improve availability.

Considerations

YellowDog considered the following points when deploying this architecture.

  • Performance

    Oracle Cloud Infrastructure provides YellowDog's clustered compute environment with great price-performance and scale. Because the provisioned clusters include many hundreds of thousands of cores, price and performance to provision and deprovision instances is critical for their customers. YellowDog uses a range of provision strategies, including spot instances, on-demand instances, instance pools, VMs, and bare metal instances based on user requirements. YellowDog uses a waterfall strategy, which uses the ordered preference of computing requirements from customers. YellowDog manages the requirements based on the first bucket of priority nodes, and then moves onto next level of priority nodes. In the future, YellowDog is also looking at provisioning GPU shapes to media and entertainment-specific customers.

  • Security

    For security, YellowDog's main concern is data security for different customer requirements. If a customer has a secure access requirement, YellowDog can serve the data to customer using IPSec VPN. If secure access is not a concern, the data is served over public internet using an internet gateway.

  • Availability

    YellowDog uses the concept of intrinsic load balancing. With this technique, long-running requests are put in through the messaging queue, and the intrinsic load balancer distributes the load further across the cluster with long-running requests.

  • Storage

    YellowDog chose Oracle Cloud Infrastructure Object Storage because it provides them with consistent interactions. YellowDog has a high-level service that has access to Oracle Cloud Infrastructure Object Storage to push and pull input and output dependencies that are defined by customers in their workloads. High-performance computing (HPC) workloads, especially highly interconnected tasks, often require high-performance storage services to collaborate between worker nodes. YellowDog uses Oracle Cloud Infrastructure File Storage service to meet high-performance storage requirements.

Get Featured in Built and Deployed

Want to show off what you built on Oracle Cloud Infrastructure? Care to share your lessons learned, best practices, and reference architectures with our global community of cloud architects? Let us help you get started.

  1. Download the template (PPTX)

    Illustrate your own reference architecture by dragging and dropping the icons into the sample wireframe.

  2. Watch the architecture tutorial

    Get step by step instructions on how to create a reference architecture.

  3. Submit your diagram

    Send us an email with your diagram. Our cloud architects will review your diagram and contact you to discuss your architecture.

Explore More

Learn more about the features of this architecture.