7 Starting and Stopping Cluster Members
This chapter includes the following sections:
- Starting Cache Servers
Cache servers are cluster members that are responsible for storing cached data. - Starting Cache Clients
Cache clients are cluster members that join the cluster to interact with the cluster's services. - Stopping Cluster Members
You can stop cluster members from the command line or programmatically. - Performing a Rolling Restart
A rolling restart is a technique for restarting cache servers in a cluster that ensures no data is lost during the restart.
Parent topic: Using Coherence Clusters
Starting Cache Servers
This section includes the following topics:
- Overview of the DefaultCacheServer Class
- Starting Cache Servers From the Command Line
- Starting Cache Servers Programmatically
Parent topic: Starting and Stopping Cluster Members
Overview of the DefaultCacheServer Class
The com.tangosol.net.DefaultCacheServer
class is used to start a cache server. A cache server can be started from the command line or can be started programmatically. The following arguments are used when starting a cache server:
-
The name of a cache configuration file that is found on the classpath or the path to a Grid ARchive (GAR). If both are provided, the GAR takes precedence. A GAR includes the artifacts that comprise a Coherence application and adheres to a specific directory structure. A GAR can be left as a directory or can be archived with a
.gar
extension. See Building a Coherence GAR Module in Administering Oracle Coherence. -
An optional application name for the GAR. If no name is provided, the archive name is used (the directory name or the file name without the
.gar
extension). The name provides an application scope that is used to separate applications on a cluster. -
The number of seconds between checks for stopped services. Stopped services are only automatically started if they are set to be automatically started (as configured by an
<autostart>
element in the cache configuration file). The default value if no argument is provided is 5 seconds.
Parent topic: Starting Cache Servers
Starting Cache Servers From the Command Line
Cache servers are typically started from the command line. Use the Java -cp
option to indicate the location of the coherence.jar
file and the location where the tangosol-coherence-override.xml
and coherence-cache-config.xml
files are located. The location of the configuration files must precede the coherence.jar
file on the classpath; otherwise, the default configuration files that are located in the coherence.jar
file are used to start the cache server instance. See Understanding Configuration.
The following example starts a cache server member, uses any configuration files
that are placed in the COHERENCE_HOME
\config
directory, and checks for service restarts every two seconds.
java -server -Xms512m -Xmx512m -cp COHERENCE_HOME
\config;COHERENCE_HOME\lib\coherence.jar com.tangosol.net.DefaultCacheServer 2
The following example starts a cache server member and uses the Coherence application artifacts that are packaged in the MyGar.gar
file. The default name (MyGAR
) is used as the application name.
java -server -Xms512m -Xmx512m -cp COHERENCE_HOME
\config;COHERENCE_HOME\lib\coherence.jar com.tangosol.net.DefaultCacheServer D:\example\MyGAR.gar
Note:
The cache configuration file that is packaged in a GAR file takes precedence over a cache configuration file that is located on the classpath.
The COHERENCE_HOME
\bin\cache-server
script is provided as a convenience and can start a cache server instance. The script is available for both Windows (cache-server.cmd
) and UNIX-based platforms (cache-server.sh
). The script sets up a basic environment and then runs the DefaultCacheServer
class. The scripts are typically modified as required for a particular cluster.
Tip:
During testing, it is sometimes useful to create multiple scripts with different names that uniquely identify each cache server. For example: cahe-server-a
, cache-server-b
, and so on.
Lastly, a cache server can be started on the command line by using the java -jar
command with the coherence.jar
library. Cache servers are typically started this way for testing and demonstration purposes. For example:
java -jar COHERENCE_HOME\lib\coherence.jar
Parent topic: Starting Cache Servers
Starting Cache Servers Programmatically
An application can use or extend the DefaultCacheServer
class as required when starting a cache server. For example, an application may want to do some application-specific setup or processing before starting a cache server and its services.
The following example starts a cache server using the main
method:
String[] args = new String[]{"my-cache-config.xml", "5"}; DefaultCacheServer.main(args);
The DefaultCacheServer(ConfigurableCacheFactory)
constructor uses a factory class to create a cache server instance that uses a specified cache configuration file. The following example uses the ExtensibleConfigurableCacheFactory
implementation and creates a DefaultCacheServer
instance and also uses the startAndMonitor(long)
method to start a cache server as in the previous example:
ExtensibleConfigurableCacheFactory.Dependencies deps = ExtensibleConfigurableCacheFactory.DependenciesHelper.newInstance("my-cache-co nfig.xml"); ExtensibleConfigurableCacheFactory factory; factory = new ExtensibleConfigurableCacheFactory(deps); DefaultCacheServer dcs = new DefaultCacheServer(factory); dcs.startAndMonitor(5000);
The static method startDaemon()
method starts a cache server on a dedicated daemon thread and is intended for use within managed containers.
Two additional static start methods (start()
and start(ConfigurableCacheFactory)
) are also available to start a cache server and return control. However, the cache factory class is typically used instead of these methods, which remain for backward compatibility.
Applications that require even more fine-grained control can subclass the DefaultCacheServer
class and override its methods to perform any custom processing as required.
Parent topic: Starting Cache Servers
Starting Cache Clients
This section includes the following topics:
Parent topic: Starting and Stopping Cluster Members
Disabling Local Storage
Cache clients that use the partition cache service (distributed caches) should not maintain any partitioned data. Cache clients that have storage disabled perform better and use less resources. Partitioned data should only be distributed among cache server instances.
Local storage is disabled on a per-process basis using the coherence.distributed.localstorage
system property. This allows cache clients and servers to use the same configuration descriptors. For example:
java -cp COHERENCE_HOME\config;COHERENCE_HOME\lib\coherence.jar -Dcoherence.distributed.localstorage=false com.MyApp
Parent topic: Starting Cache Clients
Using the CacheFactory Class to Start a Cache Client
Applications that use the com.tangosol.net.CacheFactory
class to get an instance of a cache become cluster members and are considered cache clients. The following example demonstrates the most common way of starting a cache client:
CacheFactory.ensureCluster();
NamedCache cache = CacheFactory.getCache("cache_name");
When starting an application that is a cache client, use the Java -cp
option to indicate the location of the coherence.jar
file and the location where the tangosol-coherence-override.xml
and coherence-cache-config.xml
files are located. The location of the configuration files must precede the coherence.jar
file on the classpath; otherwise, the default configuration files that are located in the coherence.jar
file are used to start the cache server instance. See Understanding Configuration.
The following example starts an application that is a cache client, uses any configuration files that are placed in the COHERENCE_HOME
\config
directory, and disables storage on the member.
java -cp COHERENCE_HOME\config;COHERENCE_HOME\lib\coherence.jar -Dcoherence.distributed.localstorage=false com.MyApp
The COHERENCE_HOME
\bin\coherence
script is provided for testing purposes and can start a cache client instance. The script is available for both Windows (coherence.cmd
) and UNIX-based platforms (coherence.sh
). The script sets up a basic environment, sets storage to be disabled, and then runs the CacheFactory
class, which returns a prompt. The prompt is used to enter commands for interacting with a cache and a cluster. The scripts are typically modified as required for a particular cluster. The class can also be started directly from the command line instead of using the script. For example:
java -cp COHERENCE_HOME\config;COHERENCE_HOME\lib\coherence.jar -Dcoherence.distributed.localstorage=false com.tangosol.net.CacheFactory
If a Coherence application is packaged as a GAR, the GAR can be loaded by the CacheFactory
instance using the server
command at the prompt after the client member starts.
server [<path-to-gar>] [<app-name>]
The following example loads the Coherence application artifacts that are packaged in the MyGar.gar
file. The default name (MyGAR
) is used as the application name.
Map (?) server D:\example\MyGAR.gar
Parent topic: Starting Cache Clients
Stopping Cluster Members
This section includes the following topics:
- Prerequisites for Stopping All Cluster Members
- Stopping Cluster Members From the Command Line
- Stopping Cache Servers Programmatically
Parent topic: Starting and Stopping Cluster Members
Prerequisites for Stopping All Cluster Members
To ensure there is no data loss during a controlled shutdown, you can leverage the service suspend feature. A service is considered suspended only after all the data is fully written, including active persistence mode, asynchronous persistence tasks, entries in the write-behind queue of a read-write backing map, and other asynchronous operations. Outstanding operations are completed and no new operations are allowed against the suspended services.
Thus, for a controlled complete shutdown of a cluster, Oracle recommends to execute
the Coherence ClusterMBean operation suspendService("Cluster")
,
which shuts down all services gracefully before shutting down the cluster
members.
Parent topic: Stopping Cluster Members
Stopping Cluster Members From the Command Line
Cluster members are most often shutdown using the kill
command when on the UNIX platform and Ctrl+c
when on the Windows platform. These commands initiate the standard JVM shutdown hook which is invoked upon normal JVM termination.
Note:
Issuing the kill -9
command triggers an abnormal JVM termination and the shutdown hook does not run. However, a graceful shutdown is generally not required if a service is known to be node-safe (as seen using JMX management) before termination.
The action a cluster member takes when receiving a shutdown command is configured in the operational override file within the <shutdown-listener>
element. The following options are available:
-
none
- Perform no explicit shutdown actions. -
force
- Perform a hard-stop on the node by callingCluster.stop()
. -
graceful
- (Default) Perform a normal shutdown by callingCluster.shutdown(
)
. -
true
- Same asforce
. -
false
- Same asnone
.
The coherence.shutdown.timeout
system property configures the duration to
wait for shutdown to complete before timing out. The default value is 2 minutes. If a
Coherence application uses persistence, write-behind cache store, or any other asynchronous
operation that takes longer than 2 minutes, then it is necessary to configure the timeout
to be longer to allow the pending asynchronous operations to complete. The system property
value is configured as a time duration such as "3s" for 3 seconds, "5m" for 5 minutes, or
"1hr" for 1 hour.
When the time duration to complete graceful shutdown exceeds the
coherence.shutdown.timeout
time, the JVM process is considered hung and
the JVM process terminates abruptly using halt. There exists a chance that not all
outstanding asynchronous operations completed when the shutdown times out. Therefore, it is
important to ensure that coherence.shutdown.timeout
is configured to a
time duration that is sufficiently long for the number of outstanding asynchronous
operations an application environment may have.
The following example sets the shutdown hook to none
.
<?xml version='1.0'?> <coherence xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://xmlns.oracle.com/coherence/coherence-operational-config" xsi:schemaLocation="http://xmlns.oracle.com/coherence/coherence-operational-config coherence-operational-config.xsd"> <cluster-config> <shutdown-listener> <enabled system-property="coherence.shutdownhook">none</enabled> </shutdown-listener> </cluster-config> </coherence>
The coherence.shutdownhook
system property is used to specify the shutdown hook behavior instead of using the operational override file. For example:
-Dcoherence.shutdownhook=none
Parent topic: Stopping Cluster Members
Stopping Cache Servers Programmatically
The DefaultCacheServer
class provides two methods that are used to shutdown a cache server:
Note:
Shutdown is supposed to be called in a standalone application where it shuts down the instance which the DefaultCacheServer
class itself maintains as a static member.
-
shutdown()
– This is a static method that is used to shut down a cache server that was started on a different thread using theDefaultCacheServer.main()
orDefaultCacheServer.start()
methods. -
shutdownServer()
– This method is called on aDefaultCacheServer
instance which an application keeps hold of.
Parent topic: Stopping Cluster Members
Performing a Rolling Restart
Rolling restarts are commonly performed when a cache server or its host computer must be updated or when upgrading a cache server to a new patch set release or patch set update release. However, the technique can also be used whenever you want to restart a cache server that is currently managing a portion of cached data.
Note:
When upgrading a cluster, a rolling restart can only be used to upgrade patch set releases or patch set update releases, but not major or minor releases. See Release Number Format in Administering Oracle Fusion Middleware.
This section includes the following topics:
- Prerequisites for a Rolling Restart
- Restarting Cache Servers for a Rolling Restart
- Persistence Roll Back After or During a Rolling Restart
Parent topic: Starting and Stopping Cluster Members
Prerequisites for a Rolling Restart
A rolling restart requires initial considerations and setup before you restart a cluster. A rolling restart cannot be performed on a cluster that does not meet the following prerequisites:
-
The cache servers in a cluster must provide enough capacity to handle the shutdown of a single cache server (n minus 1 where n is the number of cache servers in the cluster). An out-of-memory exception or data eviction can occur during a redistribution of data if the cache servers are running at capacity. See Cache Size Calculation Recommendations in Administering Oracle Coherence.
-
Remote JMX management must be enabled on all cache servers and at least two cache servers must contain an operational MBean server. Ensure that you can connect to the MBean servers using an MBean browser such as JConsole. See Using JMX to Manage Oracle Coherence in Managing Oracle Coherence.
If a cache service is configured to use asynchronous backups, use the
shutdown
method to perform an orderly shut down instead of the
stop
method or kill -9
. Otherwise, a member may
shutdown before asynchronous backups are complete. The shutdown
method
guarantees that all updates are complete.
If using persistence, while a rolling restart is agnostic to the format of the data stored on disk, there are scenarios where you can take precautionary steps to ensure that there is a 'savepoint' to which the cluster can be rolled back if faced with a catastrophic event. On-disk persisted files may become unreadable when going from a later to an earlier version.
Therefore, Oracle recommends you to perform a persistent snapshot of the relevant services before the roll. See Using Snapshots to Persist a Cache Service. This can be performed while suspending the service (to ensure global consistency) or not based on the tolerance of the application. The snapshot offers a 'savepoint' to which the cluster can be rolled back if an issue occurs during the roll.
Note:
Coherence may change the format of the data saved on disk. When rolling from a version that includes a change in the storage format, Oracle strongly recommends creating a snapshot before roll/upgrade.Parent topic: Performing a Rolling Restart
Restarting Cache Servers for a Rolling Restart
Use these instructions to restart a cache server. If you are restarting the host computer, then make sure all cache server processes are shutdown before shutting down the computer.
Note:
The instructions below assume that none of the cache servers share the same Coherence JAR or ORACLE_HOME
. If some of the cache servers share the same Coherence JAR or ORACLE_HOME
, then treat the servers as a logical unit and perform the steps on each of the servers at the same time.
To restart a cache server:
- Connect to a Coherence MBean server using an MBean browser. Ensure that the MBean server is not hosted on the cache server that is being restarted.
- From the Coherence Service MBean, select a cluster service that corresponds to a cache that is configured in the cache configuration file.
- Check the
StatusHA
attribute for any cluster member to ensure that the attribute's value isMACHINE-SAFE
. TheMACHINE-SAFE
state indicates that all the cache servers running on any given computer could be stopped without data loss. If the attribute is notMACHINE-SAFE
, then additional cache servers, possibly on different computers, must be started before performing a restart. See ServiceMBean in Managing Oracle Coherence. - Shutdown the cache server.
- From the MBean browser, recheck the
StatusHA
attribute and wait for the state to return toMACHINE-SAFE
. - Restart the cache server.
- From the MBean browser, recheck the
StatusHA
attribute and wait for the state to return toMACHINE-SAFE
. - Repeat steps 4 to 7 for additional cache servers that are to be restarted.
Parent topic: Performing a Rolling Restart
Persistence Roll Back After or During a Rolling Restart
If an issue is observed during the rolling restart, it is advisable to roll back the entire cluster. You can roll back the cluster by rolling the nodes with the new version of the application (and/or Coherence) back to their prior versions, or in more severe cases, a complete restart of the cluster.
If the new version being migrated to includes a Coherence patch that modifies the storage format (as highlighted in the Oracle Coherence Release Notes) the persistent stores could be in a mixed state of the old and new format.
During a full cluster restart, the mixed state can result in a failure to
recover, causing the failed stores being moved into the trash. At this point, you can
choose the forceRecovery
operation of the
PersistenceManagerMBean
(in JConsole, for example,
Coherence/Persistence/<servicename>/PersistenceCoordinator
) to force Coherence to
reinstate empty versions of those partitions that could not be recovered. If you have
created a snapshot of the relevant services, it is possible to recover the snapshot,
reverting the state of those services.
Parent topic: Performing a Rolling Restart