About Best Practices for Operating Cloud Deployments Efficiency

Operational efficiency relates to identifying appropriate processes and procedures to automate and optimize the operation of all cloud services. It's important to consider best practices to deploy, operate, and monitor applications and infrastructure to deliver maximum business value. With day-to-day deployments there's a need to see what's happening with the cloud resources. Monitoring has to be in place to know if an environment is working correctly and if adjustments are needed.

Perform Operations as Code

Provision, scale, and manage your environment using automation and an infrastructure as code methodology.
  • Adopt an infrastructure as code (IaC) methodology

    Automate the deployment of workloads and operational procedures, limit human interaction, and improve response to events by using infrastructure as code.

  • Define the Workload Infrastructure

    When you define the infrastructure as code, it’s possible to automatically and repeatedly provision workloads on a consistent infrastructure. Parameterization allows the reuse of common templates, promoting cross-environment standardization and minimizing rework across teams.

  • Develop and Deploy Applications

    Automating code deployment on existing infrastructure allows for application consistency on multiple infrastructure deployments.

  • Manage the Infrastructure Configuration

    Consistency is crucial when configuring and updating the infrastructure configuration on multiple cloud resources. With configuration management, it's possible to manage the infrastructure configuration deployment during design, implementation, testing, patching, and new releases.

Make Frequent and Iterative Deployments

Minimize risk by using automation and an iterative development process when testing and deploying code.

  • Automate your application deployment process

    Automate as many processes as possible. If possible, eliminate manual deployments in production; although, this might be acceptable in lower environments to promote velocity and flexibility.

  • Leverage automation to test your code prior to deployment

    Testing for bugs, security vulnerabilities, functionality, performance, and integrations are critical to minimizing problems that users discover. Testing failures should prevent code from being released into production.

  • Implement iterative and incremental deployments

    Reduce risks by testing and validating deployments more frequently. Smaller, more frequent changes can lead to less exposure to failures and delays in identifying issues.

Define Operational Procedures

Define procedures to leverage available tools and automate procedures.

  • Automate patching and maintenance

    Leverage tools to automatically update and patch compute instances, database instances, and servers that are part of your customer maintenance responsibility.

  • Leverage configuration management utilities

    Use configuration management tools to automate and reduce risk when updating resource configurations.

  • Monitor system performance metrics

    Understand the metrics provided by the infrastructure services. Set up monitoring and alerting to provide visibility on the state of all workloads and proactive indicators of failure.

  • Document and test your disaster recovery plan

    Write a disaster recovery plan that reflects the business impact of application failures. Understand application dependencies and their impact on applications. Automate the recovery process as much as possible, and document any manual steps. Regularly test your disaster recovery process to validate and improve the plan.

  • Plan for Oracle Cloud Infrastructure support interactions

    Before the need arises, establish a process for contacting Oracle Cloud Infrastructure support.

Expect Failure, and Learn

Unanticipated failures will happen throughout the lifecycle of an application. Learn from a failure and improve response and recovery processes.

  • Learn from failures

    Conduct root-cause analysis and tune operations processes for better and more agile responses to failures in the future.

  • Continually improve incident response

    Distribute lessons learned from failure and past issues to prevent future problems and reduce Mean Time to Repair.

  • Practice for failure

    Periodically test and rehearse incident management and recovery processes to fine tune for future responses.

Identify and Monitor Workload Key Performance Indicators

Identify baseline performance and key performance indicators (KPIs) for your workloads. Use the KPIs and logs to monitor application workload health and performance.

Consider using the following to monitor workload performance:

  • Implement tracing around service calls

    Baseline performance data can help provide trend data that you can use to proactively identify performance problems before they affect users.

  • Implement health checks

    Run health checks and probes regularly from outside the application to identify degradation of application health and performance. The health checks and probes should be more than just static page tests, they should be reflective of holistic application health.

  • Check long-running workflows

    Catching issues early can minimize the need to roll back the entire workflow or to execute multiple compensating transactions.

  • Maintain system, application, and audit logs

    Utilize a centralized logging service to store and analyze your logs.

  • Set up an early warning system

    Identify the key performance indicators (KPIs) of an application's health, such as transient exceptions and remote call latency, and set appropriate threshold values for each of them. Send an alert to operations when the threshold value is reached.

  • Train multiple operators to monitor the application and to perform manual recovery steps

    Make sure there is always at least one trained operator active.

  • Create scaling policies that take action based on KPI's

    Scaling policies help to provide consistent performance for your end users during periods of high demand, and help you reduce your costs during periods of low demand.

Leverage Managed Services

Use managed cloud services to ensure that your cloud resources run efficiently and in a cost-effective manner. Your IT organization can offload the tactical and undifferentiated heavy-lifting associated with managing cloud resources so that they can focus on their core competencies.

Identify Your Responsibilities

Cloud providers document what their platform is accountable for, and what the customer is responsible for. Identify your customer responsibilities and ensure that you have operational procedures for each of these responsibilities.