ChorusOS 4.0 Hot Restart Programmer's Guide

Chapter 4 Programming With Restartable Actors

This chapter covers programming and running restartable actors on a ChorusOS 4.0 system. In particular, it provides the following:

4.1 Introduction

As described in Chapter 1, Introduction, a restartable actor is an actor which can be rapidly reconstructed from an actor image (text and data) , without accessing stable storage. The management of restartable actors is handled by a ChorusOS supervisor actor known as the Hot Restart Controller. The Hot Restart Controller is responsible for:

This chapter looks at the API provided by the Hot Restart Controller, and the corresponding restart-related commands provided by the C_INIT actor. Before proceeding to a description of the API, however, it is important to understand how restartable actors are managed within the system.

4.1.1 Types of Restartable Actor

As explained in Chapter 1, Introduction, it is important to understand that actors do not explicitly declare themselves restartable, that is, there is no function call to declare an actor restartable at the start of its main() program. Instead, an actor can be run as a restartable actor. More precisely, an actor can be run as either a direct or indirect restartable actor:

This distinction between direct and indirect actors is important for understanding the automatic restart mechanism provided by the Hot Restart Controller. When an error occurs, the Hot Restart Controller will first stop all actors in the group, and then only restart the concerned direct restartable actors. These actors, re-executed from their initial entry point, are responsible for restarting any indirect actors they may have spawned. An illustration of this is provided in Figure 1-3.

4.1.2 Restartable Actor Credentials

Restartable actors, just like traditional ChorusOS actors, are identified in the system by a unique capability and identifier (actor ID). Restartable actors also run in a user group (with a user ID), like traditional ChorusOS actors. The lifetime of each of these credentials is the same as the lifetime of a particular run-time instance of the actor: when a restartable actor is restarted, it is given a new capability, actor ID and user ID.

Hot restartable actors also have two additional credentials, which persist across an actor restart, and serve to characterize them in the Hot Restart Controller:

4.1.3 Restartable Actors and Persistent Memory

As explained in "2.1.2 Memory Requirements and Design Considerations", the system uses persistent memory to store the following data for each executing restartable actor:

This data is stored in three persistent memory blocks: one memory block for the actor image, one memory block for the executed text and one memory block for the actor data. These blocks are allocated and freed by requests from the Hot Restart Controller to the Persistent Memory Manager. Other actors cannot access or free these persistent memory blocks, although restartable actors can place additional blocks which they allocate under the control of the Hot Restart Controller. This is described in "4.2.3 Freeing Persistent Memory".

4.2 The Restartable Actor Lifecycle

One approach to understanding how the API provided by the Hot Restart Controller is used, is to consider it in the context of the run-time life-cycle of a restartable actor. Indeed, a restartable actor's code is not simply executed once, from the start of the main() program to its final return, but could be re-executed many times if there are many restarts. Data which is initialized and actors which are initially loaded during the first execution will only need to be retrieved or restarted on subsequent executions. This is why it is important to view the restart API in the context of this first execution, and then subsequent executions.

For this reason, this section looks at the way the Hot Restart Controller API is used in the context of the life-cycle of a typical restartable actor.

4.2.1 Initial Load

Use the C_INIT command arun with the -g option, or the function call hrfexec() to load a restartable actor from stable storage into persistent memory. Both arun and hrfexec() provide support for specifying the persistent credentials of a restartable actor when the actor is initially loaded.

Actors created directly using actorCreate(2K) or acreate(2K) are not hot restartable and cannot use the Hot Restart Controller API.

When an actor is run as a restartable actor, the Hot Restart Controller checks whether an actor identified with the specified name is already registered. If this is not the case (as is the case for an initial load), the Hot Restart Controller first solicits the Persistent Memory Manager to allocate the persistent memory blocks which will store the actor's actor image and executing image. If successful, it registers the name of the new actor as a restartable actor, running in the specified group.

The subsequent load and start of the persistent actor is the same as for an actor run using a member of the afexec(2K) function family (see the man page for a description of this process). The difference is that the actor is loaded from its actor image (in persistent memory) and not from stable storage.


Note -

A restartable actor's name remains registered in the Hot Restart Controller for the lifetime of its actor group. The lifetime of the group may extend beyond the lifetime of the actor. It is the programmer's responsibility to ensure that no two restartable actors will attempt to register with the same name in the Hot Restart Controller.


Once a restartable actor has been registered and loaded, it runs under the control of the Hot Restart Controller. If the actor fails, the failure will provoke the restart of all the direct members of its restart group. These direct actors are then responsible for restarting any indirect actors registered in the group. To query an actor's restart group, use hrGetActorGroup(2RESTART):

#include < hr/hr.h >
hrGetActorGroup(int aid)

4.2.2 Group Restart

In the context of hot restart, an actor is considered to have abnormally terminated (and will therefore provoke the restart of its group) if any of the following occur:

There is no single API call which can explicitly force a group of actors to restart. For cases in which it may be desirable to provoke a restart (for example, for testing purposes), the easiest way to do so is to deliberately provoke one of the above cases. In the "hello world" example introduced in the previous chapter, this was done by causing a segmentation fault.

When an actor fails, all actors in the failed actor's restart group stop executing and the Hot Restart Controller restarts all direct actors in the group from their initial entry point. The direct actors are responsible for restarting any indirect actors, using hrfexec(). When hrfexec() is called with a name which is already registered in the Hot Restart Controller, the Controller recognizes the actor name and simply restarts the actor from the actor's actor image, instead of loading it from stable storage.

A restartable actor is always restarted at the same address. Its capacity, actor ID and user ID are not guaranteed to be the same after restart. All system resources obtained before the restart are lost: in particular, open files, including those that had been inherited at the time of initial creation are lost. This may include the standard I/O connected to an rsh connection.

A restarted actor uses the same arguments and environment parameters that were specified when the actor was initially started. For direct restartable actors, a new set of pre-open stdin/stdout/stderr is provided, which are connected to /dev/console. For indirect members, a new set of pre-open stdin/stdout/stderr is provided by the invoker of hrfexec(), just as for afexec(2K).

4.2.3 Freeing Persistent Memory

Just like any actor, a restartable actor can free persistent memory blocks using pmmFree() or pmmFreeAll(). This is described in "3.4.2 Freeing a Persistent Memory Block Explicitly".

Restartable actors which allocate memory with pmmAllocate() can also use a simple, automatic de-allocation mechanism provided by the Hot Restart Controller. This saves the actor from having to free its persistent memory explicitly. Instead, the persistent memory will remain allocated for the lifetime of the actor's group, and then be freed automatically by the Hot Restart Controller when the last member of the actor's restart group terminates cleanly. The disadvantage of this system is that the lifetime of the restart group may extend well beyond the point at which the memory block is no longer needed. In this case the memory block will take up space in persistent memory unnecessarily.

To mark a persistent memory block for automatic de-allocation by the Hot Restart Controller, pass the macros HR_GROUP_KEY and HR_GROUP_KEYSIZE as the delKey and delKeySize arguments respectively in the call to pmmAllocate(). These macros tie the lifetime of the persistent memory block to the lifetime of the calling actor's restart group.

A block marked for automatic de-allocation by the Hot Restart Controller can still be freed explicitly by calling pmmFree() with the block's PmmName. However, attempting to call pmmFreeAll() by passing the HR_GROUP_KEY and HR_GROUP_KEYSIZE macros will result in an error, as this is not permitted.

4.2.4 Clean Termination

As described in "4.2.2 Group Restart", any actor that exits before the expected completion of its task is considered to have aborted abnormally and will cause the restart of its actor group. This is useful for cases in which the actor does indeed exit prematurely as a result of an error. This mechanism is also useful for provoking an actor restart where this is explicitly desired, for example, when an execution problem is detected.

To allow a restartable actor to terminate cleanly without causing a restart, use the HR_EXIT_HDL() macro prior to the call to exit(3STDC):

#include <hr/hr.h>
HR_EXIT_HDL();

The purpose of this macro is to add an additional hot restart exit handler to the actor's atexit(3STDC) function. The hot restart exit handler effectively removes the concerned actor from the Hot Restart Controller's responsibility: once an actor has called HR_EXIT_HDL(), the Hot Restart Controller will no longer monitor it for abnormal termination. As a result, when the actor exits, it will terminate cleanly and no longer trigger a restart.

Call the HR_EXIT_HDL() macro shortly before the actor exits. Calling the macro earlier in the actor code will mean that any unexpected exit between the macro call and the final exit will not be detected by the Hot Restart Controller. As a result, the actor will not be restarted if it exits abnormally.

Cleanly terminating an actor does not deregister the actor in the Hot Restart Controller, or remove the actor's actor image and executing image from persistent memory. This is because a cleanly terminated actor will still be restarted if its group is restarted, since a group is always restarted in its initial state. In other words, when a group is restarted, all direct restartable actors will recommence execution at their initial entry point, regardless of whether or not they had already exited before the restart occurred. This is shown in the following diagram. Both direct actor 1 (DA1) and indirect actor 2 (A2) terminate cleanly, but are restarted when direct actor 2 (DA2) crashes.

Figure 4-1 Restart of Cleanly Terminated Actors

Graphic

Because of this behavior, it can be useful to record the clean termination of restartable actors which will never need to be re-executed completely during a group's lifetime by setting a flag in persistent memory. A restarted actor can check the state of this flag at the start of its execution, and thus detect whether it should re-execute or not.

4.2.4.1 Group Termination

For each group of restartable actors present in a ChorusOS system, the Hot Restart Controller stores a list of the actors in the group in a persistent memory block. An actor is added to the list when it is first started. When an actor cleanly terminates, the Hot Restart Controller notes this in the list. When all actors in the list have terminated cleanly, the Hot Restart Controller does the following:

A group of actors can only terminate if all of its member actors terminate cleanly. This is important to remember in situations where not all indirect actors are restarted after a group restart. This is a matter of execution flow: if certain conditions in a direct actor change the actor's flow from one execution to the next, the direct actor may not restart an indirect actor which was running prior to the restart. As a result, the indirect actor will never terminate cleanly and so the group will not be able to terminate.

For example, consider the situation in the following figure. The direct actor spawns the indirect actor only if a certain condition is fulfilled. This condition is fulfilled the first time the direct actor runs. After the direct actor restarts, the condition is no longer fulfilled, so the indirect actor is no longer spawned.

Figure 4-2 Conditional Spawning of a Restartable Actor

Graphic

In the situation illustrated above, the actor group will not be able to terminate until the indirect actor has been rerun using hrfexec(), and has terminated cleanly.

When a restart group cannot terminate because of one or more direct actors in this situation, the Hot Restart Controller detects the fact and prints the following message on the target console:


HR_CTRL: group gid blocked, some members have not terminated: list_of_actors

gid is the ID of the group concerned, and list_of_actors provides the name of each actor which prevents the group from terminating. When this message appears, a basic solution is to kill the actor group using the akill command with the -g option, as described in "4.3 Killing Restartable Actors". This solution is only useful if none of the indirect actors need to be run to complete the group's task.

A better solution is to use careful application design. If the situation is likely to occur, flags can be stored in persistent memory to indicate indirect actors which have not terminated cleanly. An actor can then be made responsible for cleaning up the group, that is, restarting each indirect actor which is flagged. This clean-up actor can be run using the arun -g command when the Hot Restart Controller notification appears on the target console. Alternatively, the group could be designed so that the clean-up actor is always run just before the group is expected to terminate, in which case the problem is solved without the need for access to the C_INIT console.

4.3 Killing Restartable Actors

At times it may be desirable to circumvent the automatic restart mechanism provided by the Hot Restart Controller and explicitly terminate (kill) a restartable actor. Actors which are killed will not be restarted. Killing an actor automatically kills all actors within the actor's restart group. This is because a restart group must remain consistent, and may not be able to function properly if an actor is no longer present.

Restartable actors can be explicitly killed using either of the following:

Either method has the same result: all actors in the associated restart group are killed, and the Hot Restart Controller terminates the group as though all actors had exited cleanly (see "4.2.4.1 Group Termination").

4.4 Site Restart

A site restart is a hot restart of the whole system. All data of boot actors are reset to their original values from the previously loaded archive, and the system enters its start-up phase again. As C_INIT restarts, sysadm.ini is executed again. Any calls to start restartable actors in the sysadm.ini file are ignored for a site restart, as all direct restartable actors are restarted automatically by the system once sysadm.ini has been read.


Note -

When the system is restarted, previously mounted disks are not automatically remounted. To solve this problem, ensure that they are mounted in the sysadm.ini file, or create a hot restartable actor that will mount the disks.


A site restart can be provoked automatically, by the Hot Restart Controller, according to the tunable parameters defining the system's restart policy. This is described in "2.1.3 Tunable Parameters".

To provoke a site restart programmatically, use the sysShutdown(2K) function call with the -i 1 arguments:

int sysShutdown (int argv, char** argc)

To provoke a site restart from the C_INIT command-line console, use the command shutdown -i 1 or restart(1M).

4.5 Putting It All Together: the restartSpawn Example Program

A programming example, restartSpawn, illustrates many of the function calls covered in this chapter and previous chapters. The example is provided as a framework illustration of the restart mechanism and the use of persistent memory. Parts of the example could be used as the basis of a more complex user application that incorporates hot restart.

The restartSpawn example uses two restartable actors, a parent actor, HR_parent.r and a child, HR_child.r which is spawned by the parent. Both actors should be compiled as supervisor actors. The source code for the two actors is provided in Appendix B, Example Application Code. The example can be summarized as follows:

The message is printed independently of the number of times the parent actor crashes or the site is restarted.