Monitor Kubernetes and OKE clusters with OCI Logging Analytics

Kubernetes provides a highly robust and extremely customizable platform for managing automatically deploying and scaling containerized workloads. Building a monitoring and troubleshooting system for this entire environment is a very challenging task, and may take away valuable time from development and IT teams. A large number of Kubernetes based initiatives fail to take off because of lack of a complementary monitoring solution, which is as robust, customizable, scalable, and is automatically deployable. Oracle Cloud Infrastructure (OCI) Logging Analytics bridges this monitoring gap by providing a one-click end-to-end Kubernetes monitoring solution for the underlying infrastructure, Kubernetes platform and cloud native applications.

A Kubernetes-based environment can be divided into three tiers, each consisting of numerous and continuously evolving components driven by business needs.

  • Infrastructure tier: contains numerous components, including networking resources, compute instances, and the Kubernetes nodes hosts.
  • Kubernetes platform tier: contains the various Kubernetes services such as network, kubelet service, and DNS, which power the kubernetes platform.
  • Application tier: contains the different technologies, databases, and applications.

Architecture

This architecture shows how you can use Oracle Cloud Infrastructure (OCI) Logging Analytics to monitor a Kubernetes platform and cloud native applications.

Building a monitoring and troubleshooting system for this entire environment is a very challenging task, and may take away valuable time from development and IT teams. A large number of Kubernetes based initiatives fail to take off because of lack of a complementary monitoring solution, which is as robust, customizable, scalable, and is automatically deployable. OCI Logging Analytics bridges this monitoring gap by providing a one-click end-to-end Kubernetes monitoring solution for the underlying infrastructure, Kubernetes platform and cloud native applications.

The following diagram is a sample topology of a Kubernetes Cluster in a single Oracle Cloud Infrastructure region, as discussed in Set up a Kubernetes cluster for deploying containerized applications on Oracle Cloud solution playbook. It shows the infrastructure tier and the second diagram highlights the kubernetes and application tiers.

Description of kubernetes-master-worker-nodes.png follows
Description of the illustration kubernetes-master-worker-nodes.png

The following diagram illustrates this reference architecture for Kubernetes Monitoring with Logging Analytics. This solutions offers collection of various logs of a Kubernetes cluster into OCI Logging Analytics and offer rich analytics on top of the collected logs. Users can customize the log collection by modifying the out of the box configuration.

Description of k8s-oke-monitoring.png follows
Description of the illustration k8s-oke-monitoring.png

k8s-oke-monitoring-oracle.zip

The architecture has the following components:

  • Tenancy

    A tenancy is a secure and isolated partition that Oracle sets up within Oracle Cloud when you sign up for Oracle Cloud Infrastructure. You can create, organize, and administer your resources in Oracle Cloud within your tenancy. A tenancy is synonymous with a company or organization. Usually, a company will have a single tenancy and reflect its organizational structure within that tenancy. A single tenancy is usually associated with a single subscription, and a single subscription usually only has one tenancy.

  • Region

    An Oracle Cloud Infrastructure region is a localized geographic area that contains one or more data centers, called availability domains. Regions are independent of other regions, and vast distances can separate them (across countries or even continents).

  • Compartment

    Compartments are cross-region logical partitions within an Oracle Cloud Infrastructure tenancy. Use compartments to organize your resources in Oracle Cloud, control access to the resources, and set usage quotas. To control access to the resources in a given compartment, you define policies that specify who can access the resources and what actions they can perform.

  • Virtual cloud network (VCN) and subnets

    A VCN is a customizable, software-defined network that you set up in an Oracle Cloud Infrastructure region. Like traditional data center networks, VCNs give you complete control over your network environment. A VCN can have multiple non-overlapping CIDR blocks that you can change after you create the VCN. You can segment a VCN into subnets, which can be scoped to a region or to an availability domain. Each subnet consists of a contiguous range of addresses that don't overlap with the other subnets in the VCN. You can change the size of a subnet after creation. A subnet can be public or private.

  • Load balancer

    The Oracle Cloud Infrastructure Load Balancing service provides automated traffic distribution from a single entry point to multiple servers in the back end.

  • Service gateway

    The service gateway provides access from a VCN to other services, such as Oracle Cloud Infrastructure Object Storage. The traffic from the VCN to the Oracle service travels over the Oracle network fabric and never traverses the internet.

  • Logging Analytics

    Logging Analytics is a fully managed SaaS regional service available in more than 27 regions that provides collection, indexing, enrichment, query, visualization, and alerting for logs from any IT component running on on-premises, OCI or 3rd party cloud.

  • Logging Analytics Source

    A configuration resource Logging Analytics that provides specifications for parsing, extractions, labeling, data masking, and other enrichment to ensure logs are properly ingested and indexed for analysis and monitoring. This architecture uses more than 30 pre-defined sources for Kubernetes services, applications, and objects. These sources are continuously enhanced to provide deeper analytics capabilities.

  • Kubernetes System Pods

    Kubernetes System Pods are small deployable units of computing that you can create and manage in Kubernetes. A Pod is one or more containers, with shared storage and network resources, and rules for running the containers.

  • User Pods

    Applications launched on the Kubernetes cluster. All the logs from application pods writing STDOUT/STDERR are typically available under /var/log/containers/. Applications that have custom log handlers may route their logs differently, but in general are available on the node (through a volume).

  • Control Plane Services & Pods

    Kubernetes platform Control Plane Services and pods. The Control Plane manages the worker nodes and the Pods in the Kubernetes cluster. The worker nodes run the containerized applications. Every cluster has at least one worker node. The worker node(s) host the Pods that are the components of the application workload.

  • Node OS Services

    Linux services running on the instance on which Kubernetes is installed. Logs are collected on OS services.

  • Log & Object Collector Pods

    Log & Object Collector Pods are made up of replica sets, FluentD, and daemon sets.

    • FluentD Collector

      FluentD is an open-source data collector that provides a unified logging layer between data sources and backend systems. It allows unified data collection and consumption for a building data processing pipelines. This architecture uses containerized FluentD container that runs as daemon set and replicat set on kubernetes cluster. It uses logging analytics fluentd output plugin to upload logs to Oracle Cloud Logging Analytics.

    • Logging Analytics FluentD Plugin

      The FluentD output plugin that connects to Oracle Cloud Logging Analytics service in your tenancy to upload or ingest logs collected by FluentD collector.

    • Kubernetes Objects

      Kubernetes objects are persistent entities in the Kubernetes system. Kubernetes uses these entities to represent the state of your cluster. In this architecture, the following kubernetes object states are collected as logs for historical analysis and troubleshooting:

    • Kubernetes Daemon Set

      A Kubernetes DaemonSet is a type of workload that runs on Kubernetes and ensures that all (or some) Nodes run a copy of a Pod. As nodes are added to the cluster, Pods are added to them. As nodes are removed from the cluster, those Pods are garbage collected.

    • Kubernetes Replica Set

      A Kubernetes ReplicaSet is a type of workload that runs on Kubernetes. It maintains a stable set of replica Pods running at any given time. As such, it is often used to guarantee the availability of a specified number of identical Pods

  • Container Engine for Kubernetes

    Oracle Cloud Infrastructure Container Engine for Kubernetes is a fully managed, scalable, and highly available service that you can use to deploy your containerized applications to the cloud. You specify the compute resources that your applications require, and Container Engine for Kubernetes provisions them on Oracle Cloud Infrastructure in an existing tenancy. Container Engine for Kubernetes uses Kubernetes to automate the deployment, scaling, and management of containerized applications across clusters of hosts.

  • Service connectors

    Service Connector Hub is a cloud message bus platform. You can use it to move data between services in Oracle Cloud Infrastructure. Data is moved using service connectors. A service connector specifies the source service that contains the data to be moved, the tasks to perform on the data, and the target service to which the data must be delivered when the specified tasks are completed. One service connector is provisioned in this architecture to collect network and load-balancer logs.

  • OCI Services

    Oracle Cloud Infrastructure (OCI) services are a platform of cloud services that enable you to build and run a wide range of applications in a highly-available, consistently high-performance environment.

  • Service and Audit Logs

    Service and Audit Logs are captured in OCI Logging service. OCI Logging is a highly scalable and fully managed service that is used to access the VCN and Load Balancer service logs through the Service Connector.

By default, Kubernetes System Services Logs and kubernetes object data are collected.

OKE or Kubernetes has built-in services where each one has different responsibilities and they run on one or more nodes in the cluster either as Deployments or DaemonSets.

Kubernetes System Services Linux System Services Kubernetes Control Plane Kubernetes Objects (Default: every 15 mins) Custom Application Logs
  • Kube Proxy
  • Kube Flannel
  • Kubelet
  • CoreDNS
  • CSI Node Driver
  • DNS Autoscaler
  • Cluster Autoscaler
  • Proxymux Client
  • Syslog
  • Secure logs
  • Cron logs
  • Mail logs
  • Audit logs
  • Ksplice Uptrack logs
  • Yum logs
  • Kube API Server
  • Kube Scheduler
  • Kube Controller Manager
  • Cloud Controller Manager
  • etcd
  • Nodes
  • Namespaces
  • Pods
  • DaemonSets
  • Deployments
  • ReplicaSets
  • Events
  • All custom application logs running the cluster are collected. See Recommendations for customizing this behavior.

Note:

Kubernetes control plane logs are not covered as part of out of the box collection, as these logs are not exposed by Oracle Cloud Infrastructure Container Engine for Kubernetes (also known as OKE). Control plan logs from non-OKE Kubernetes clusters can be enabled.

Recommendations

Use the following recommendations as a starting point. Your requirements might differ from the architecture described here.

  • Log Groups

    Multiple Log Groups should be defined provide right access permissions to different teams and avoid sharing sensitive data. Log Groups can be based on Oracle E-Business Suite, Database, OCI Infrastructure, Hosts Logs.

  • Cost Management

    Logging Analytics service is charged on the volume of data in active and archival storage. In order to be allow troubleshooting of day to day issues and get benefits of anomaly detection, pattern detection and other ML capabilities - we recommend active storage period of 90 days and moving logs older than 90 days to archival storage. Logs from archival stored can be recalled on demand quickly.

  • FluentD Multi-worker

    FluentD should be configured in multi-worker mode for time-sensitive logs.

  • Custom Application Logs

    This solution automatically captures all the logs generated by applications running in a Kubernetes cluster. By default, these logs are mapped to Kubernetes Generic Container Logs log source. Application logs specific parser, sources, and enrichment should be created in Oracle Cloud Infrastructure Logging Analytics to extract required fields and attach problem labels to logs.

  • Authentication

    This architecture supports instance principal and Oracle Cloud Infrastructure config file based authentication. Instance principal based authentication is recommended for Oracle Container Engine for Kubernetes (OKE).

Considerations

Consider the following points when deploying this reference architecture.

  • Performance

    Query performance is based on time-range and number of operations such as filters, group-by, and so on. For better query performance it is recommended to enrich logs with specific labels and fields at the time of ingestion. This is a part of continuous improvement for IT operations.

  • Security & RBAC

    Customize Log Source definitions to filter any PII data and enable geolocation enrichment.

  • Availability

    Oracle Cloud Logging Analytics is a fully managed highly available SaaS service.

Deploy

The Kubernetes manifests and helm charts for deploying the Logging Analytics DaemonSets and ReplicaSets are available in GitHub.

  1. Go to GitHub.
  2. Clone or download the repository to your local computer.
  3. Follow the instructions in the README document.

Acknowledgements

Author: Kumar Varun

Contributors: Zubair Ansari, Kiran Palukuri, Santhosh Vuda