ChorusOS 4.0 Hot Restart Programmer's Guide

1.2.3 Restart Groups

Many applications are made up of not one but several actors, which cooperate to provide a service. As these actors cooperate closely, any failure in one of them can have repercussions in the others. For instance, assume that actors A and B cooperate closely (using CHORUS/IPC for instance), and that A fails. Simply terminating, reloading or hot-restarting A will probably not be sufficient, and will most certainly cause B either to fail itself, or to go through some special recovery action. This recovery action may in turn affect other actors which cooperate with actor B. Building cooperating applications which can cope with the large number of potential fault scenarios is a very complex task, as the complexity grows exponentially with the number of actors.

In response to this problem, the hot restart feature uses the concept of restart group. A restart group in its most common sense is a group of cooperating restartable actors which can be restarted in the event of the failure or abnormal termination of one or more actors within the group. In other words, when one actor in the group fails, all actors in the group will be stopped and then restarted (either directly, by the system, or indirectly, through spawning). In this way, closely cooperating actors are guaranteed a consistent, combined operating state.

Every restartable actor in a ChorusOS 4.0 system is a member of a restart group. Restart groups of actors are mutually exclusive: a running actor can only be a member of one actor group (declared when the actor is run), and group containment is not permitted. A restart group is created dynamically when a direct actor is declared to be a member of the group: thus, each group contains at least one direct actor. An indirect actor is always a member of the same group as the actor which spawned it. A restart group is therefore populated through spawning from one or more direct restartable actors.

Figure 1-2 illustrates the possible organization of restartable actors in groups within a system.

Figure 1-2 Restart Groups in a ChorusOS System

Graphic

When a group is restarted, it is restarted from the point at which it initially started. Figure 1-3 shows the state of a group of restartable actors when it is initially created, during execution, and when it is restarted following the failure of one of its member actors. The group contains two direct actors and one indirect (spawned) actor. The failure of the indirect actor causes a group restart: the two direct actors automatically re-execute their code from their initial entry point. Time runs vertically down the page.

Figure 1-3 Group restart

Graphic

Of course, simply restarting a group of actors may still not bring the system to the error-free state desired. Such a situation is possible when the failure which provokes an actor group restart is in fact the consequence of an error or failure elsewhere in the system. For this reason, the hot restart feature supports the concept of site restart, as described in the next section.