The Oracle Private Cloud Appliance is designed for high availability at every level of its component make-up.
Management Node Failover
During the factory installation of an Oracle PCA, the management nodes are configured as a cluster. The cluster relies on an OCFS2 file system exported as a LUN from the ZFS storage to perform the heartbeat function and to store a lock file that each management node attempts to take control of. The management node that has control over the lock file automatically becomes the master or active node in the cluster.
When the Oracle PCA is first initialized, the
o2cb
service is started on each management
node. This service is the default cluster stack for the OCFS2 file
system. It includes a node manager that keeps track of the nodes
in the cluster, a heartbeat agent to detect live nodes, a network
agent for intra-cluster node communication and a distributed lock
manager to keep track of lock resources. All these components are
in-kernel.
Additionally, the ovca
service is started on
each management node. The management node that obtains control
over the cluster lock and is thereby promoted to the master or
active management node, runs the full complement of Oracle PCA
services. This process also configures the Virtual IP, that is
used to access the active management node, so that it is 'up' on
the active management node and 'down' on the standby management
node. This ensures that, when attempting to connect to the Virtual
IP address that you configured for the management nodes, you are
always accessing the active management node.
In the case where the active management node fails, the cluster
detects the failure and the lock is released. Since the standby
management node is constantly polling for control over the lock
file, it detects when it has control of this file and the
ovca
service brings up all of the required
Oracle PCA services. On the standby management node the Virtual
IP is configured on the appropriate interface as it is promoted to
the active role.
When the management node that failed comes back online, it no longer has control of the cluster lock file. It is automatically put into standby mode, and the Virtual IP is removed from the management interface. This means that one of the two management nodes in the rack is always available through the same IP address and is always correctly configured. The management node failover process takes up to 5 minutes to complete.
Oracle VM Management Database Failover
The Oracle VM Manager database files are located on a shared file system exposed by the ZFS storage appliance. The active management node runs the MySQL database server, which accesses the database files on the shared storage. In the event that the management node fails, the standby management node is promoted and the MySQL database server on the promoted node is started so that the service can resume as normal. The database contents are available to the newly running MySQL database server.
Compute Node Failover
High availability (HA) of compute nodes within the Oracle PCA is enabled through the clustered server pool that is created automatically in Oracle VM Manager during the compute node provisioning process. Since the server pool is configured as a cluster using an underlying OCFS2 file system, HA-enabled virtual machines running on any compute node can be migrated and restarted automatically on an alternate compute node in the event of failure.
Storage Redundancy
Further redundancy is provided through the use of the ZFS storage appliance to host storage. This component is configured to support RAID-Z2 providing integrated redundancy with a fault tolerance of up to two failed drives with zero data loss. Furthermore, the storage appliance includes two storage heads or controllers that are interconnected in a clustered configuration. The pair of controllers operate in an active-passive configuration, meaning continuation of service is guaranteed in the event that one storage head should fail. The storage heads share a single IP in the storage subnet, but both have an individual management IP address for convenient maintenance access.
Network Redundancy
All of the customer-usable networking within the Oracle PCA is configured for redundancy. Only the internal administrative Ethernet network, which is used for initialization and ILOM connectivity, is not redundant. There are two of each switch type to ensure that there is no single point of failure. Networking cabling is equally duplicated and switches are interconnected as described in Section 1.2.4, “Network Infrastructure”.