ChorusOS 4.0 Hot Restart Programmer's Guide

Chapter 1 Introduction

The purpose of this chapter is to provide an introduction to hot restart.

By the end of this chapter, you should have sufficient knowledge of hot restart to understand the information provided in the rest of this book.

1.1 What Is Hot Restart?

The ChorusOS(TM) 4.0 system's hot restart feature has been designed and implemented to address the high-availability requirements of ChorusOS system builders. Hot restart provides an advanced mechanism for restarting ChorusOS applications or the entire system when a serious error or failure occurs. Traditionally, system recovery from such errors or failures involves terminating applications and reloading them from stable storage, or rebooting the system. This causes system downtime, and can mean that important application data is lost. Such behavior is unacceptable for system builders seeking '7 by 24' or 'five nines' system availability.

The ChorusOS 4.0 hot restart feature solves the problem of downtime and data loss by using persistent memory, that is, memory which can persist beyond the lifetime of a particular run-time instance of an actor. When an actor which uses the hot restart feature fails, or terminates abnormally, the system uses the actor data stored in persistent memory to reconstruct the actor without accessing stable storage. This reconstruction of an actor from persistent memory instead of from stable storage is known as hot restarting (or simply restarting) the actor.

Hot restarting one or more actors is significantly faster than conventional failure recovery techniques (application reload or cold system reboot) because it protects critical information that allows the failed portions of a system to be reconstructed quickly, with minimal interruption in service.

1.1.1 Feature Services

ChorusOS hot restart comprises an API and run-time architecture which offer the following services:

The combination of these services provides a powerful framework for highly-available systems and applications, dramatically reducing the time it takes for a failed system or component to return to service.

1.2 Basic Concepts

This section introduces the basic concepts central to the hot restart feature and services. These concepts are: persistent memory, restartable actor, restart group, and site restart.

1.2.1 Persistent Memory

The foundation of the hot restart mechanism is the use of persistent memory to store data which can persist across an actor or site restart. Persistent memory is used internally by the system, to store the actor image (text and data) from which a restartable actor can be reconstructed. Any actor can also allocate persistent memory to store data. This data could, for example, be used to checkpoint application execution.

At the lowest level, persistent memory is a bank of memory loaded by the ChorusOS kernel at cold boot. The content of this bank of memory is preserved across an actor or site restart. In the current implementation, the only supported medium for the persistent memory bank is RAM: in other words, persistent memory is simply a reserved area of physical memory. For this reason, persistent memory will resist a hot restart, but not a board reset. The size of the area of RAM reserved for persistent memory is governed by a tunable parameter.

The allocation and de-allocation (freeing) of persistent memory are managed by a ChorusOS actor known as the Persistent Memory Manager (PMM). The Persistent Memory Manager exports an API for this purpose. This API is distinct from the API used for allocating and de-allocating traditional ChorusOS memory regions (rgnAllocate(2K), rgnFree(2K), svPagesAllocate(2K), and svPagesFree(2K)).

The Persistent Memory Manager API is described in more detail in Chapter 3, Programming With Persistent Memory and in the pmmAllocate(2RESTART), pmmFree(2RESTART) and pmmFreeAll(2RESTART) man pages.

1.2.2 Restartable Actors

A restartable actor is any actor which can be rapidly restarted without accessing stable storage, when it abnormally terminates. A restartable actor is restarted from an actor image which comprises the actor's text and initialized data regions. The actor image is stored in persistent memory (unless the actor is executed in place, in which case the actor image is the actor's executable file, stored in non-persistent, physical memory). Restartable actors can use additional blocks of persistent memory to store their own data.

Figure 1-1 shows the state of a typical restartable actor at its initialization, during execution, and after having been hot restarted as a result of an error. The actor uses persistent memory to store some state data. After hot restart, the actor is reconstructed from its actor image, also in persistent memory. It is then re-executed from its initial entry point, and can retrieve the persistent state data which has been stored.

Figure 1-1 Typical restartable actor

Graphic

In the hot restart architecture, management of restartable actors is assured by a ChorusOS supervisor actor known as the Hot Restart Controller. Restartable actors are monitored by the Hot Restart Controller, in that the Hot Restart Controller will detect a restartable actor's abnormal termination and automatically take the appropriate restart action if an abnormal termination occurs. In the context of hot restart, abnormal termination cases include unrecoverable errors such as division by zero, a segmentation fault, unresolved page fault, or invalid op code, and so on.

Restartable actors, like traditional ChorusOS actors, can be run in either user or supervisor mode. In addition, they can be run from the sysadm.ini file or C_INIT console, or spawned dynamically during system execution. Indeed, the restartable nature of restartable actors remains transparent to system actors such as the AM actor, responsible for loading and starting restartable actors. This is because restartable actors do not declare themselves restartable, but are run as restartable actors. More specifically, the way a restartable actor is initially run determines how it will be restarted when a restart occurs:

The distinction between direct and indirect restartable actors provides a useful framework for the construction of restartable groups of actors, as described in the next section.

C_INIT and the Hot Restart Controller provide an interface specifically for running and spawning restartable actors. This interface is described in detail in Chapter 4, Programming With Restartable Actors.

1.2.3 Restart Groups

Many applications are made up of not one but several actors, which cooperate to provide a service. As these actors cooperate closely, any failure in one of them can have repercussions in the others. For instance, assume that actors A and B cooperate closely (using CHORUS/IPC for instance), and that A fails. Simply terminating, reloading or hot-restarting A will probably not be sufficient, and will most certainly cause B either to fail itself, or to go through some special recovery action. This recovery action may in turn affect other actors which cooperate with actor B. Building cooperating applications which can cope with the large number of potential fault scenarios is a very complex task, as the complexity grows exponentially with the number of actors.

In response to this problem, the hot restart feature uses the concept of restart group. A restart group in its most common sense is a group of cooperating restartable actors which can be restarted in the event of the failure or abnormal termination of one or more actors within the group. In other words, when one actor in the group fails, all actors in the group will be stopped and then restarted (either directly, by the system, or indirectly, through spawning). In this way, closely cooperating actors are guaranteed a consistent, combined operating state.

Every restartable actor in a ChorusOS 4.0 system is a member of a restart group. Restart groups of actors are mutually exclusive: a running actor can only be a member of one actor group (declared when the actor is run), and group containment is not permitted. A restart group is created dynamically when a direct actor is declared to be a member of the group: thus, each group contains at least one direct actor. An indirect actor is always a member of the same group as the actor which spawned it. A restart group is therefore populated through spawning from one or more direct restartable actors.

Figure 1-2 illustrates the possible organization of restartable actors in groups within a system.

Figure 1-2 Restart Groups in a ChorusOS System

Graphic

When a group is restarted, it is restarted from the point at which it initially started. Figure 1-3 shows the state of a group of restartable actors when it is initially created, during execution, and when it is restarted following the failure of one of its member actors. The group contains two direct actors and one indirect (spawned) actor. The failure of the indirect actor causes a group restart: the two direct actors automatically re-execute their code from their initial entry point. Time runs vertically down the page.

Figure 1-3 Group restart

Graphic

Of course, simply restarting a group of actors may still not bring the system to the error-free state desired. Such a situation is possible when the failure which provokes an actor group restart is in fact the consequence of an error or failure elsewhere in the system. For this reason, the hot restart feature supports the concept of site restart, as described in the next section.

1.2.4 Site Restart

A site restart is the reinitialization of an entire ChorusOS site (system) following the repeated failure of a group of restartable actors. It is the most severe action which can be automatically provoked by the Hot Restart Controller. A site restart involves the following:

The precise frequency of group restarts which provokes a site restart is determined by the system's restart policy. The basic policy implemented by the hot restart feature is based on a set of system tunable parameters described in Chapter 2, Getting Started With Hot Restart. You can extend this basic restart policy within your own applications, for example by choosing to provoke a group or site restart when particular application-specific exceptions are raised, or particular events occur.

1.3 Architecture Components

As described in the previous sections, the hot restart feature uses the following two restart-specific actors to implement hot restart services:

The Persistent Memory Manager and Hot Restart Controller principally use the services of the following:

The resulting architecture is summarized in the following diagram. Hot restart-specific components appear in gray, together with the API calls they provide. Other components appear in white. Arrows from A to B say that A calls functions which are implemented in B.

Figure 1-4 Hot Restart Architecture

Graphic

Further information about the hot restart API is provided in the rest of this guide, and in the corresponding man pages.