Enable Oracle Cloud Infrastructure Service Mesh on your Kubernetes applications

Oracle Cloud Infrastructure (OCI) customers have increasingly moved towards a microservices architecture that along with its benefits also brings new challenges. In a microservices architecture, monolithic applications are broken down into smaller microservices that communicate over the network using an API. This causes a surge of network traffic and increases the complexity and overall attack surface in the architecture.

Adding a service mesh to the microservices alleviates many of the challenges introduced with a microservices architecture and provides the following benefits:
  • Allows you to control the microservices traffic flow.
  • Provides visibility into your applications.
  • Enables microservices to connect securely without any changes to the application code.
With OCI Service Mesh, you can deploy a managed service mesh architecture to your Oracle Cloud Infrastructure Kubernetes Engine (OCI Kubernetes Engine or OKE) clusters. This reference architecture provides a detailed example of an OCI Service Mesh architecture deployed in an OKE cluster.

OCI Service Mesh Capabilities

Security
  • Enforcement of security-related policies

    OCI Service Mesh uses access policies to define access rules. Access policies enforce the communication between microservices and only allow validated requests that originate inside and outside the application. Access policies are also used to define permitted communication to external services.

  • Zero Trust

    OCI Service Mesh implements a zero-trust security architecture automatically across all microservices. Data between microservices are encrypted. Microservice-to-microservice identification is required at the beginning of the communication. The two parties must exchange credentials with their identity information. This allows the services to identify each other to determine if they are authorized to interact. This is implemented with mutual TLS with automated certificate and key rotation using the Oracle Cloud Infrastructure Certificates (OCI Certificates) service and Oracle Key Management Cloud Service to manage certificates and keys.

Traffic Management
  • Traffic Shifting

    OCI Service Mesh allows you to do canary deployments. When you publish a new version of your code to production, you only allow a portion of traffic to reach it. The feature enables you to deploy quickly and causes the least disturbance to your application. You can define routing rules that govern all inter-microservice communication inside the mesh. You might route a portion of the traffic to a certain version of the service.

Observability
  • Monitoring and Logging

    OCI Service Mesh is uniquely positioned to provide telemetry information as all inter-microservice communication must pass through it. This allows the service mesh to capture telemetry data such as source, destination, protocol, URL, duration, status code, latency, logging, and other detailed statistics. You can export logging information to the Oracle Cloud Infrastructure Logging (OCI Logging) service. OCI Service Mesh provides two types of logs: error logs and traffic logs. You can use these logs to debug 404 or 505 issues or generate log-based statistics. Metrics and telemetry data can be exported to Prometheus and visualized with Grafana. Both can be deployed directly into an OKE cluster.

Architecture

Oracle Cloud Infrastructure Service Mesh (OCI Service Mesh) uses a sidecar model. This architecture encapsulates the code implementing the network functionality into a network proxy and then relies on traffic from and to services to be redirected into the sidecar proxy. It is called a sidecar because a proxy is attached to each application, much like a sidecar attached to a motorbike. In OKE, the application container sits alongside the proxy sidecar container in the same pod. Since they are in the same pod, they share the same network namespace and IP address, allowing the containers to communicate via “localhost.”

OCI Service Mesh has the following two main components:
  • Control plane

    The OCI Service Mesh control plane manages and configures the entire collection of proxies to route traffic. It handles forwarding, health checking, load balancing, authentication, authorization, and aggregation of telemetry. The control plane interacts with the OCI certificate service and OCI key management service to provide each proxy with its certificate.

  • Data plane

    The data plane is composed of the collection of sidecar proxies deployed in the environment and is responsible for the security, network functions, and observability of the application. They also collect and report telemetry on all mesh traffic. The Envoy proxy is used for the data plane of OCI Service Mesh.

The following diagram illustrates this reference architecture.



oci_service_mesh_oke_arch-oracle.zip

This reference architecture shows an application deployed in an OKE cluster with three services. The namespace in which the application is deployed has been “meshified”. A “meshified” namespace indicates that services deployed within the namespace will be part of a service mesh, and each new pod deployed will be injected with an envoy proxy container. As each pod is deployed, configurations and certificates are sent to each of the proxy containers by the OCI Service Mesh control plane. The OCI Service Mesh control plane communicates with the OCI Certificates service and Oracle Key Management Cloud Service to obtain certificates for each proxy.

An ingress gateway is deployed to provide external access to the application. The ingress gateway is part of the OCI Service Mesh data plane and is also an envoy proxy that receives configuration and certificates from the OCI Service Mesh control plane.

It is the responsibility of the proxy container to perform service discovery, traffic encryption, and authentication with the destination service. The proxy containers also apply network policies such as how traffic is distributed between different service versions and enforce access policies. The ingress gateway performs the same function for traffic outside the service mesh.

Prometheus and Grafana are deployed within the OKE cluster in a separate namespace that is not part of the service mesh. The service mesh data plane sends key operating statistics like latency, failures, requests, and telemetry to the Prometheus deployment. Grafana pulls data from the Prometheus deployment, which can be used to create dashboards for visualization.

OCI Service Mesh is integrated with the OCI Logging service, and logging can be enabled when the service mesh is created. OCI Service Mesh provides two types of logs: error logs and traffic logs. These logs can be used to debug 404 or 505 issues or generate log-based statistics.

This architecture has the following OCI Services:

  • OCI Kubernetes Engine (OKE)

    Delivers a highly available, scalable production-ready Kubernetes cluster to deploy your containerized applications in the cloud.

  • Load Balancer

    Provides access to the ingress gateway in the OKE cluster. The ingress directs traffic to requested services in the OKE cluster.

  • Certificate Authority

    Manages the TLS certificates for the OCI Service Mesh service.

  • Key Management

    Manages the keys used by the Certificate Authority service.

The architecture has the following components:

  • Region

    An Oracle Cloud Infrastructure region is a localized geographic area that contains one or more data centers, called availability domains. Regions are independent of other regions, and vast distances can separate them (across countries or even continents).

  • Virtual cloud network (VCN) and subnets

    A VCN is a customizable, software-defined network that you set up in an Oracle Cloud Infrastructure region. Like traditional data center networks, VCNs give you control over your network environment. A VCN can have multiple non-overlapping CIDR blocks that you can change after you create the VCN. You can segment a VCN into subnets, which can be scoped to a region or to an availability domain. Each subnet consists of a contiguous range of addresses that don't overlap with the other subnets in the VCN. You can change the size of a subnet after creation. A subnet can be public or private.

  • Security list

    For each subnet, you can create security rules that specify the source, destination, and type of traffic that must be allowed in and out of the subnet.

OCI Service Mesh Resources

When configuring OCI Service Mesh deployed in an OKE cluster, several Kubernetes resources are defined that map to key components of your application.

The following diagram depicts how the configured OCI Service Mesh resources: Access Policies, Ingress Gateway, Virtual Service, and Virtual Deployment map to your application resources: K8s Service, K8s Service Load Balancer, Deployments, and Pods.


Description of oci_service_mesh_oke_config.png follows
Description of the illustration oci_service_mesh_oke_config.png

Recommendations

Use the following recommendations as a starting point. Your requirements might differ from the architecture described here.
  • VCN

    When you create a VCN, determine the number of CIDR blocks required and the size of each block based on the number of resources that you plan to attach to subnets in the VCN. Use CIDR blocks that are within the standard private IP address space.

    Select CIDR blocks that don't overlap with any other network (in Oracle Cloud Infrastructure, your on-premises data center, or another cloud provider) to which you intend to set up private connections.

    After you create a VCN, you can change, add, and remove its CIDR blocks.

    When you design the subnets, consider your traffic flow and security requirements. Attach all the resources within a specific tier or role to the same subnet, which can serve as a security boundary.

  • Load balancer bandwidth

    While creating the load balancer, you can either select a predefined shape that provides a fixed bandwidth, or specify a custom (flexible) shape where you set a bandwidth range and let the service scale the bandwidth automatically based on traffic patterns. With either approach, you can change the shape at any time after creating the load balancer.

Considerations

Consider the following options when deploying this reference architecture.

  • Cost

    There is no charge for the control plane of OCI Service Mesh on the OKE cluster. Customers are charged for the resource utilization of the proxy containers for the Service Mesh data plane. In practice, however, customers are already paying for the resources for the node pool in an OKE cluster, and unless the utilization of the proxy containers pushes the utilization of the node pool over 100 percent, there is no additional charge for adding OCI Service Mesh to your microservice architecture.

  • Availability

    The control plane of the OCI Service Mesh is always deployed with high availability.

Acknowledgments

  • Author: Chiping Hwang
  • Contributors: Dusko Vukmanovic, Anupama Pundpal