1 System Overview
High Availability Overview
Introduction
An HA environment should have minimal or no downtime caused by unplanned outages. Outages can be caused by disk drive failures, network failures, system processing unit (SPU) failures, improper system configuration, and application software failures due to application errors or temporarily unavailable system resources.
Additionally, an HA environment should minimize downtime required for planned system and application maintenance and upgrades. Routine system and application upgrades (such as installing kernel or application patch, or new applications) should occur without taking the critical application services off line.
Oracle highly recommends that client applications be configured so that they detect connection problems and automatically attempt to reconnect when a connection is lost.
Key HA features
Convergent Charging Controller can remain available in various failure conditions. Convergent Charging Controller in an HA environment has the following key features:
- Distributed multiprocess, multi-node, multi-system, and multi-site deployment with application resiliency and fault tolerance
- Application service HA with automatic process recycling and failover
- Hardware HA through redundancy and configuration
Disaster recovery
Disaster recovery requires that you set up a remote instance of Convergent Charging Controller that can be activated in the event of a catastrophic failure at the production site. An HA system for Convergent Charging Controller, consisting of multiple clustered servers, is usually limited by the length of the cables connecting the shared data disk devices and the network interfaces. A remote disaster recovery site that is geographically dispersed requires access to the same resources as the production site, including:
- Network connectivity to clients
- Hardware
- Up-to-date Convergent Charging Controller configuration data
- Dynamic provisioning data
An HA environment requires regular system backups and data replication mechanisms. Data backup must be implemented independent of Convergent Charging Controller.
Hardware requirements for HA
You achieve hardware availability by using redundant backup components for each subsystem that may fail:
- Mirrored dual-port data disks to protect the application from loss of critical data
- Redundant network interfaces and networks to ensure that application clients can connect to the network
- Redundant SPUs to guard against entire system failures
Redundancy by Node Setup
Hardware redundancy on its own does not guarantee the HA of application services offered by Convergent Charging Controller. It is achieved by ensuring that all software components included in the entire solution are built and configured for fault tolerance. When you set up an HA environment, you must eliminate single points of failure that prevent Convergent Charging Controller from processing orders for an extended period of time.
Each Convergent Charging Controller node type has a different redundant architecture to ensure continuous service availability beyond the availability of each hardware element.
Redundancy by SMS Node
The default installation of SMS nodes does not provide HA. The redundant SMS node setup includes the following deployment types:
- Small deployments: In small deployments, SMS is deployed as a single node, where SMS is not redundant and its service availability is based on the availability of the underlying server. In such a setup, operational integrity is maintained through the use of a secure backup mechanism. When the SMS node is offline, the network routing offered by the SLC and VWS nodes i.e. the end subscriber handset based services, will be unaffected.
- Highly available deployment: : More typically, the SMS is deployed in HA setup, which has two separate servers, each able to host the SMS node arranged in an active/active topology. To achieve this, the SMS is installed on each node with a common disk array for shared application file data, and access to the HA servers is via network switch that accesses one of the SMS servers.The disk array hosts a common application file system. A common SMS database instance (RAC One Node) is installed remotely from the SMS servers. RAC One Node provides HA resilience on the remote database system. With this configuration, a planned or unplanned shutdown of one active SMS node results in clients being directed by the switch to the existing active SMS node. The failover time is mostly taken up by the time taken to restart the SMS screens and related daemons on the active node. Other processes already running on the active node such as ccsCDRLoader do not need to be restarted. The remote SMS database is not affected and continues to provide service to the active SMS node.
These options are based around a single site. To overcome this constraint, introduce Oracle Data Guard:
- Provide an additional SMS disaster recovery option at a second site. The SMS disaster recovery option receives database transaction updates directly from the primary node through Oracle Data Guard, which maintains the SMS disaster recovery database up to date with the primary SMS in near real time.
Note:
Activation of the disaster recovery site is typically undertaken following a total and catastrophic loss of the primary site. Activation would typically take 30 to 60 minutes, depending upon the number of connected nodes and the familiarity of the operations staff with the necessary proceduresRedundancy by SLC Nodes
The SLC hosts the service logic and network interfaces and integrates with an external online charging system (OCS) for rating and charging services. The SLCs rely on the connected network elements to manage the distribution of traffic between nodes. These might be on a load-share or active/standby basis to one or several nodes.
Some service providers dedicate particular groups of SLC nodes to specific traffic types or to some other grouping. Other service providers configure all SLCs to handle all types of traffic. If subscribers are provisioned into Convergent Charging Controller, all SLCs will host all provisioned subscribers.
Transactions started on one SLC node continue to be serviced by that node, that is the transaction data remains local to each SLC node and is not shared between the SLC nodes. In the event of either a planned or unplanned outage of an SLC node, all active transactions on that node are lost. New transactions would then be targeted to one of the other available SLC nodes.
Planned outages (for example, maintenance activities) are typically scheduled during quiet traffic periods. In this situation, it is normal for the network operator to reduce traffic for the selected SLC node so that all the new transactions target other SLC nodes. By allowing the existing transactions to complete on the chosen SLC node, maintenance activity can be undertaken with minimum service interruption.
Voice calls and data sessions have periodic commits, which further minimize the opportunity for revenue loss from a planned or unplanned outage of an SLC node. This is achieved since the revenue loss is limited to the amounts reserved but not committed which, through configuration, will be only part of a session and limited to the most recent reservation chunk within those sessions that remain active.
Geographic redundancy of the SLC nodes is achieved by locating SLC nodes on different sites. The total number of required SLC nodes depends on the number of nodes for the required traffic level, the number of sites, and whether complete site failure in the busy hour is a required scenario, or a maintenance outage of a single site.
The worst case would typically be a dual site setup, where long-term catastrophic failure of one site is a required scenario. In this case, each site would need N+1 nodes, requiring a total of 2(N+1) SLC nodes.
Redundancy by VWS Node
The VWS is exclusively deployed in a 2N mated pair architecture, where one node is active and the second node is a hot standby. Each node has its own separate database, with transaction data copied from the active to the standby at the application layer, such that the active service transactions can be started on one node, and in case a failover occurs, the service transactions can complete on the second node.
The client systems of the VWS are the SMS and SLC nodes. A mated pair of VWS nodes forms a logical construct termed a ‘Domain’. Each VWS Domain hosts one or more voucher batches. Through data mastered on the SMS and replicated to the other nodes, the target VWS Domain can be identified. Each node then maintains connections to both VWS nodes within each domain and exclusively uses the connection to the active node within the required domain. The two nodes within a domain are designated the primary and secondary nodes. If the primary node is available, it will be the active node. When the primary node fails, the secondary node becomes the active node. If the primary node becomes unavailable, due to either planned or unplanned outages, the transactions initiated on one node are continued on the previously standby node. As such, failover between the VWS nodes within a domain is seamless and happens in real time.
When the primary VWS node returns to service, it initially needs to catch up with the active secondary node. It first processes the incoming synchronization files before notifying the client systems that it is now active. After it becomes active, it continues to process any in-bound synchronization files. Geographic redundancy of the VWS nodes is achieved by locating the two nodes in any given domain, on separate sites.
Redundancy by Traffic/Service Type
To assess the impact of the loss of any single node, to look at the impact with respect to each traffic or service type, such as:
- Customer care operations that come to the platform through these interfaces:
- SMS screens
- PI on the SMS
- OSD on the SLC nodes
- Customer care operations can come to the platform through these interfaces:
- WEB 2.0 through the PI to the SMS
- WEB 2.0 through OSD to the SLC(s)
- USSD request to the SLC(s)
- SMS text to the SLC(s)
- IVR session managed through the SLC(s)
- Session-based traffic services that are categorized by having a back and forth message exchange between the serving network element and the Convergent Charging Controller that is the controlling network element. For each node:
- Failure of an SMS node will have negligible impact to an active session or a new. If there is no subscriber data, no network-side updates occur.
- Failure of the primary VWS node has no affect on active or new sessions, that is any transactions initiated on the primary, but not completed, will complete on the secondary.
- Where the SLC node serving the session is lost, the connection to the adjacent network element will drop. For established bearer sessions, the controlling network element may hold the session up until either it is terminated by one of the parties involved and/or until the controlling element requires further direction from Convergent Charging Controller. This would typically be an additional reservation of funds to continue the session. At this point, for voice services, the session failure would be recognized and the bearer session dropped. For data services, the serving node may maintain the bearer session and attempt a new transaction to one of the other SLCs. The sessions that were being set up at the time of failure may either fail back through the network, or be re-attempted to one of the other SLC nodes, as determined by the serving network element. The result on the end subscriber is that the session attempt might fail or an established session might be dropped.
- Event-based traffic services are categorized by having a request-response transaction, that is the response concludes the message exchange between the serving network element and the Convergent Charging Controller. Failure of the serving SLC node will result in loss of connection to the serving network element and failure of that request, following a short timeout. In that situation, serving network elements will typically re-attempt the transaction to one of the other serving SLC nodes, that is the end subscriber does not perceive any issue.