Scalability and Redundancy

Unified Assurance is a highly scalable platform that can range from a single server to dozens of horizontally scalable, inter-dependent servers. A single instance of Unified Assurance supports an organization’s growing needs without requiring additional instances to be installed and hooked together in an ad-hoc fashion.

Each tier of the Unified Assurance product is scalable and redundant through several means. This guide will cover each tier and component of the software.

Presentation Tier

Internal Presentation

Presentation Redundancy

Description of illustration presentation-redundancy.png

The internal presentation server(s) encompass the Web user interface, the API, rules repository, image repository, package repository, dashboard integrations, and message bus. The first server installed is considered the primary presentation server, and only 1 additional redundant, internal presentation server can be installed. The internal presentation servers are considered stateful due to the Unified Assurance management database being collocated with it.

Redundancy of the internal presentation servers is provided by several means. The Unified Assurance database is configured with multi-master MySQL replication. Unique table IDs are separated into even and odd pairs between the servers. This allows for a repair of a split-brain scenario to be conflict free and provide eventual consistency. The RabbitMQ message bus redundancy is enabled by configuring federation to replicate exchanges between both servers. Package installation is mirrored between servers based on the configured roles but installed independently. Configurations outside of the database and repositories are synchronized with Unison, a bi-directional application utilizing the rsync protocol. Unison may encounter conflicts after the repair of a split-brain scenario but will never overwrite data until the authoritative files are manually identified.

Access to the internal presentation server(s) is defined with a configuration option known as the WebFQDN. The WebFQDN can be the same as the HostFQDN (as is common in single server development instances) but can be set to any valid FQDN as a vanity URL. Care must be taken when choosing a separate WebFQDN than the HostFQDN, especially when redundancy is enabled. For true HA failover between the internal presentation servers, the WebFQDN should be defined in DNS to a Virtual IP (VIP) that can failover between the two servers. This can be done with Keepalived in the same data center and network, but usually will be performed by a load balancer that must be run in Layer4 mode instead of Layer7. Unified Assurance uses TLS client certificate authentication and must terminate the secure session instead of the load balancer offloading this responsibility. The load balancer needs to perform cross data center failover and must be redundant itself. In environments without a VIP, the internal presentation servers can each have a separate WebFQDN, but several components (SVN, API calls, image repository, package repository) can only support a single destination. In the dual WebFQDN setup, only partial presentation failover can be achieved, and changes should be minimized.

External Presentation

For customers with multi-tenant configurations or who wish to scale the presentation layer beyond 2 servers, one or more external presentation servers can be installed. These servers do not contain the stateful Unified Assurance management database and are a good fit to be deployed in a DMZ for public exposure.

Redundancy of the external presentation servers is the same as that for internal presentation servers with the exception of the multi-master MySQL replication.

Database Tier

Event Database

Event and Analytics Redundancy

Description of illustration event-and-analytics-redundancy.png

The event database is stored in the MySQL RDBMS. Scalability of the event database is done by means of sharding. Sharding events requires specific event collectors to be configured for a specific shard ID. This ID is defined for each database instance or pair of redundant instances. Mapping of collectors to shards is an administrative responsibility and is usually done for several reasons including regulator separation, regional separation, or spreading processing load across several databases.

Redundancy of the event database is provided by configuring multi-master MySQL replication. Unique event IDs are separated into even and odd pairs between the servers. This allows for a repair of a split-brain scenario to be conflict free and provide eventual consistency.

Event / Log Analytics Database

The event analytics and log analytics databases (a.k.a Historical database) are stored in Elasticsearch. Elasticsearch is easily scaled by adding additional servers to an existing cluster. Each cluster must reside in the same data center for low latency interconnect.

Each analytics cluster is independent and is not configured for cross-data center redundancy. Redundancy for event analytics is provided by the real-time Event database MySQL redundancy. The primary cluster is a backup to the primary event database and the backup cluster is a primary to the backup event database. Logs are redundant by means of configuring Filebeat on each server to send logs to both clusters at the same time. Filebeat will handle outages and make sure the logs are eventually consistent.

Metric Database

Metric Kibana Ingestion

Description of illustration metric-kibana-ingestion.png

The metric database is stored in InfluxDB, a time-series database. Scalability of the metric database is done by means of sharding. Sharding by metric is done by configuring devices with a specific shard ID. This ID is defined for each database instance or pair of redundant instances. Collectors and pollers will send metrics to the appropriate shard without being configured in the application directly, unlike event collectors. Each Unified Assurance server has an instance of Telegraf, that not only polls the local system and also batches metrics to the appropriate InfluxDB server.

Redundancy of the metric database is provided by linking Kafka message busses to the ingestion pipeline. The Telegraf instance on each server is configured to send metrics to the Kafka message bus instead of directly to InfluxDB. Telegraf will failover to the redundant Kafka server as needed. An additional Telegraf instance is then installed on each metric server to read from both primary and redundant Kafka message buses and batch insert those metrics into the local InfluxDB server. This redundancy will handle failovers and split-brain scenarios to provide eventual consistency of data between both InfluxDB servers.

Graph Database

The graph database is stored in Neo4j, a NoSQL document and graph database. Scalability of the graph database is done by means of Neo4j's clustered architecture providing flexible horizontal and vertical scaling solutions and in-memory graph caching. This same scalability handles redundancy of up to 2 servers. These two redundant servers can cluster locally or across data centers. Each server has unique cluster IDs for all documents, vertices, and edges. This allows for a repair of a split-brain scenario to be conflict free and provide eventual consistency.

SOA Collection and Processing Tiers

The Service Oriented Applications on the collection and processing servers are managed by a local instance of the Unified Assurance Broker. The Broker is responsible for making sure enabled services are always running every minute, and starting jobs based on their cron schedules. Multiple collection and processing servers can be added to a single Unified Assurance instance to horizontally scale the processing and analytics of data.

Broker General Failover

Each Broker sends a heartbeat message to the Rabbit MQ message bus broadcasting to all other Brokers every second. Each Broker knows the state of every other Broker, alive or dead. To determine when failover or clustering changes must be made, there are several considerations. When the Broker first boots up, it will wait to catch up on heartbeats from other Brokers. If the Broker was cold started (from the server booting up), this wait is 60 seconds. If the Broker was warm started (e.g. from an update) and the server has been running for some time, the wait is 5 seconds. Finally, while the Broker is running and heartbeats are not heard from any Broker partnering in application redundancy for 30 seconds, failover is initiated.

Failover Jobs

Jobs that are scheduled can be configured for failover. Both jobs should be configured with the same application configuration. The primary job is configured by choosing a redundant job of the same application type that must be located on a separate server. The primary job will always be run if the Broker running that job is online. If the redundant Broker for the job cannot reach its primary Broker, the redundant job is run. In the event of a split-brain scenario, where Brokers may still be running but unable to communicate with each other, both jobs will run. How conflicts are handled is up to each application but are minimized with the default jobs.

Failover Services

Services can be configured for failover the same way as jobs. Both services should be configured with the same application configuration. The primary service is configured by choosing a redundant service of the same application type that must be located on a separate server. Unlike jobs though, both services are running at the same time. Each application will request its state from the local Broker and handle failover internally. The active service (usually the primary) will perform all the work necessary just like a standalone service. The standby service (usually the redundant) will configure itself and hold off on processing data. If the standby service passively receives data, it will log the collection, buffer a small amount as configured, and the discard the oldest data. If the standby service actively collects data, it will not perform any action during its poll time. When the redundant Broker for the server cannot reach its primary Broker, the redundant server will be told to become active. All collection or polling will proceed as normal. In the event that rolling maintenance or updates are performed, the primary Broker can tell the redundant service to immediately become active. In only this case, will the primary Broker still be running, but its primary service will be stopped.

Cluster Services

Services can be clustered to spread the responsibility for work across multiple applications. Only applications that actively poll data can be clustered. Additionally, there is no requirement where clustered applications are running, and several can be on the same server or separate servers. The services should be configured with the same application configuration. Each service will have a cluster ID which is a numeric value appended to the name of the application binary. Multiple clusters of the same application type can be run on separate cluster IDs. Services in a cluster will communicate with their local Brokers to determine the current number of applications in the cluster. This number of members determines how the work is divided amongst themselves (e.g. split processing based on devices). When members join or leave the cluster, the division of work is recalculated for the next poll cycle.

Microservice Collection and Processing Tiers

Microservices run in the Kubernetes container orchestration platform. Microservices can be configured to run as independent pods, or be configured to run in deployments, stateful sets, or daemon sets. Kubernetes has a declarative model where it will attempt to run the count and location of microservices if possible based on resources. When multiple servers are added to a cluster, failover will automatically be handled in the event of a server outage by moving the microservices to another server if possible. The details are numerous and can be configured in great detail. For more information, refer to the Kubernetes documentation https://kubernetes.io/docs/home/. To understand more about microservices in Unified Assurance, refer to Understanding Microservices.