The ChorusOS(TM) 4.0 system's hot restart feature has been designed and implemented to address the high-availability requirements of ChorusOS system builders. Hot restart provides an advanced mechanism for restarting ChorusOS applications or the entire system when a serious error or failure occurs. Traditionally, system recovery from such errors or failures involves terminating applications and reloading them from stable storage, or rebooting the system. This causes system downtime, and can mean that important application data is lost. Such behavior is unacceptable for system builders seeking '7 by 24' or 'five nines' system availability.
The ChorusOS 4.0 hot restart feature solves the problem of downtime and data loss by using persistent memory, that is, memory which can persist beyond the lifetime of a particular run-time instance of an actor. When an actor which uses the hot restart feature fails, or terminates abnormally, the system uses the actor data stored in persistent memory to reconstruct the actor without accessing stable storage. This reconstruction of an actor from persistent memory instead of from stable storage is known as hot restarting (or simply restarting) the actor.
Hot restarting one or more actors is significantly faster than conventional failure recovery techniques (application reload or cold system reboot) because it protects critical information that allows the failed portions of a system to be reconstructed quickly, with minimal interruption in service.
ChorusOS hot restart comprises an API and run-time architecture which offer the following services:
The hot restart API allows actors to allocate and free portions of persistent memory while they are executing. This service is available to all ChorusOS actors once hot restart is configured.
actor restart
With hot restart, the system is capable of detecting the abnormal termination of one or more actors and restarting them automatically from persistent memory. In addition, actors are organized into restart groups, enabling the simultaneous restart of all actors in a predefined group when a single actor in the group fails.
site restart
With hot restart, in addition to restarting one or more actors, the system is capable of restarting all restartable actors, plus the kernel and boot actors, for a given ChorusOS site.
The combination of these services provides a powerful framework for highly-available systems and applications, dramatically reducing the time it takes for a failed system or component to return to service.