C H A P T E R  1

Sun StorageTek 5800 System Overview

This chapter introduces the hardware and software that make up the Sun StorageTek 5800 system. It provides an introduction to the system components and software functions and contains the following sections:


Product Overview

The Sun StorageTek 5800 system is an online storage appliance featuring a fully integrated hardware and software architecture in which the disk-based storage nodes are arranged in a symmetric cluster. Data can be associated with metadata for easy reference as discussed in Metadata.

Both data and metadata are distributed across disks and nodes. There is no dedicated metadata server or master node, and the system presents a simple, single image for client and administrator access. The clustered and redundant design provides high availability, good performance, and exceptional data integrity.

Clustered Design

The Sun StorageTek 5800 system uses a symmetric, clustered, architecture (FIGURE 1-1). All storage control, data, and metadata path operations are distributed across the cluster to provide both reliability and performance scaling. Each node is independent of all other nodes, and there is complete symmetry in both hardware and software on each node.


FIGURE 1-1 Sun StorageTek 5800 System Clustered Architecture

Figure displays the clustered architecture of the Sun StorageTek 5800 System.


The Sun StorageTek 5800 system's stateless design principles ensure that there is no single point of failure or contention for resources. Its cost of ownership, maintainability, and reliability designs also help improve cluster scaling.

A Sun StorageTek 5800 system consists of a cluster of nodes in which each node contains a standard processor, memory, and storage disks. The cluster is accessed using an application programming interface (API) or a file system view. For more information, see Data Access in Sun StorageTek 5800 System Storage.

Each node in the system is capable of handling data operations using this API. A load-spreading switch balances traffic from the clients to the nodes. Some housekeeping functions are delegated to an elected master node, although those services can fail over to any other node in the system without affecting reliability or availability.

Sun StorageTek 5800 System Cell

A Sun StorageTek 5800 system or silo consists of a half-cell of 8 nodes or a full cell of 16 nodes. (Multicell systems are not currently available.)


FIGURE 1-2 Sun StorageTek 5800 System Half-Cell and Single Cell Silo

Figure shows both a half-cell and single cell StorageTek 5800 silo. The half-cell has 8 storage nodes, while the full cell has 16 storage nodes.


The cell is the basic building block of the system and consists of 8 or 16 Sun StorageTek 5800 system storage nodes, 2 Load Balancers, and 1 Service Node. Each node of the Sun StorageTek 5800 system is connected to both Load Balancers for high availability, forming a dual-star topology. The remaining ports of the Load Balancer are used for front-end and uplink connections and heartbeat connections for switch failover (FIGURE 1-3).


FIGURE 1-3 Sun StorageTek 5800 System Redundant Network

Figure shows the Sun StorageTek 5800System redundant network.



Hardware Overview

The Sun StorageTek 5800 system (FIGURE 1-4) is a rack-mounted system designed to have no single point of failure and to be serviceable without disruption. Each 1U Sun StorageTek 5800 system server node runs the Solaris OS and consists of a 1U socket 939 AMD Opterontrademark processor, a server management board, and 4 Serial ATA (SATA), 3.5 inch (8.89 cm) drives.

A pair of Load Balancers provide for failover and load spreading across nodes. Because the system features a fail-in-place self-healing design, much of the urgency normally associated with switch, disk, Network Interface Card (NIC), Central Processing Unit (CPU), or other hardware failures is removed.

The minimum system configuration consists of a half-cell of eight storage nodes with four SATA 500-gigabyte (GB) drives per node, two Load Balancers, and one Sun Firetrademark x2100 server acting as the service node. The maximum configuration consists of one cell of 16 storage nodes with four SATA 500-GB drives per node, two Load Balancers, and one Sun Fire x2100 server acting as the service node (FIGURE 1-4).


FIGURE 1-4 Sun StorageTek 5800 System View (Front)

Figure shows a StorageTek 5800 Cell. It calls out the Sun x2100 Service Node, the 16 Storage Nodes, and the 2 filler panels with the rear-facing Load Balancers behind them.


Serviceability

The Sun StorageTek 5800 system features a series of field-replaceable units (FRUs). Components can fail in place without being immediately serviced. When a component fails, alerts are sent through email. If a disk fails, a drive fault LED on the front panel (FIGURE 1-5) lights up to indicate the fault.

In addition, command-line interface (CLI) commands such as hwstat provide status information on the failed components. The CLI Command Reference contains more information on this and all of the other CLI commands. For instructions on connecting to the system to execute the CLI commands, see the Sun StorageTek 5800 System Getting Started Guide.


FIGURE 1-5 Front Panel Storage Node Controls

Figure shows the front panel LEDs and switches on the storage node.



Software Overview

The Sun StorageTek 5800 system is a unique product that combines servers, storage, networking, and distributed-systems software in a single solution. Many of its features are implemented in software, thus permitting:

The Sun StorageTek 5800 system software includes a variety of layers that all contribute to the overall product goal of reducing costs while improving data manageability. For example, the following components work seamlessly and operate with complete transparency to end users:

Metadata

Metadata is extra information about the data object. There are two main types of metadata in the Sun StorageTek 5800 system: system and user, or extended, metadata. The system metadata includes a unique identifier for each stored object, called the object ID or OID, as well as information on creation time (ctime), data length, and data hash.

The OID is returned by the API when an object is stored and used to retrieve the object. It is also returned when queries are made against user metadata that has been associated with the OID. For more information, see Understanding Metadata and the System Schema.

The Sun StorageTek 5800 system's user or extended metadata provides the ability to store application-level attributes associated with data objects. User metadata also allows you to arbitrarily define a schema (using the EXtensible Markup Language (XML). Typically, user queries are executed against application-stored, user metadata. Optionally, they can be issued against system metadata as well.

You can define a set of metadata attributes associated with your stored objects. For example, in an application that stores medical records, the metadata attributes include things like patient name, doctor's name, reason for visit, deductible, medical record number and insurance company. You can then run a query to retrieve the record using these fields, or combine the query to retrieve all records for a given doctor or insurance company on a particular date. Extensible metadata provides unlimited scope when designing the application.

The schema defines the way that the Sun StorageTek 5800 system metadata is structured. It consists of attributes, each of which has a defined type. For example, the previous medical record might contain the names of attributes shown in TABLE 1-1.


TABLE
1-1 Sample Medical Record Schema

Attribute

Type

PatientName

String

DoctorName

String

MedicalRecordNumber

Long

InsuranceCompany

String

Deductible

Double

ReasonForVisit

String


The attributes defined in the schema can be assigned values when data is stored. There is a single schema for the system. However, you do not need to specify all attributes when storing an object.

Because metadata is stored on the Sun StorageTek 5800 system and not with the client application, it scales along with stored data, and gains the same advantages of integrity and availability as the data.

Data and Administrative System Access

The Sun StorageTek 5800 system exports two IP addresses: one for data access and one for administrative access. Data is accessed using a single IP address, called the data virtual IP (VIP) address. Your interaction with the Sun StorageTek 5800 system does not require a knowledge of the underlying hardware. Instead, you access it as a single very large storage system through the API.

You perform administrative tasks on the Sun StorageTek 5800 system using the CLI available through the administrative VIP. You access the CLI with the ssh® command, and if you wish, you can script CLI commands. Administrators can monitor individual components including disks and nodes through the CLI and can also enable and disable individual disks and nodes.

The CLI supports standard administrative tasks such as shutting down, powering down, rebooting, and obtaining CLI help. If you wish, you can also configure the Sun StorageTek 5800 system to send email alerts. For more information on each of the specific CLI commands and their functions, see the CLI Command Reference.

Data Access in Sun StorageTek 5800 System Storage

You access data in Sun StorageTek 5800 system storage in one of two ways:

APIs

The Sun StorageTek 5800 system Java and C APIs enable you to store, retrieve, query, and delete data and metadata through Javatrademark and C client libraries. Sample applications and command-line routines are provided in the SDK to demonstrate the Sun StorageTek 5800 system's capabilities as well as provide good programming examples. The SDK also provides an emulator that enables you to test client applications without having to set up a StorageTek 5800 system server. For more information on the SDK, see the Sun StorageTek 5800 System SDK Developer's Guide. For more information on the Java and C client APIs, see the Sun StorageTek 5800 System Client API Reference Guide.

File System Views

The Sun StorageTek 5800 system contains no internal hierarchical path structure. Virtual views are queries against metadata that are expressed externally as file system paths and file names. A virtual file system view is defined using the metadata attributes defined in the schema.

For example, using the preceding medical record schema shown in TABLE 1-1, you can define a view that is organized at the top level directory by doctor name, then by patient name at the second level, and so on. Opening the top-level view for a given doctor shows a list of patients for that doctor (FIGURE 1-6).


FIGURE 1-6 File System View Example

Figure displays an example of a top-level filesystem view.


The system enables you to browse through virtual file system views using the Web-based Distributed Authoring and Versioning (WebDAV) protocol. WebDAV is a set of extensions to the HTTP protocol that provides a network protocol for creating interoperable, collaborative applications. For more information on WebDAV, see Using WebDAV for File Browsing.

Data Reliability and Availability

To provide reliability, files are protected in the Sun StorageTek 5800 system using the Reed Solomon (RS) encoding algorithm. The RS algorithm is part of a code family that efficiently builds redundancy into a file to guarantee reliability in the face of multiple failures in the storage system.


The Sun StorageTek 5800 system stores fragments of files across multiple disks and nodes using 5+2 encoding. Thus, when an object of any type (for example, an MP3 binary or a text file) is stored in the Sun StorageTek 5800 system, it is divided into five data fragments and two corresponding parity fragments (FIGURE 1-7).FIGURE 1-7 5+2 Encoding Example With 5 Data Fragments and 2 Parity Fragments

Figure shows a 5+2 encoding example.



These fragments are stored on different disks in the system (FIGURE 1-9). The 5+2 encoding means that the Sun StorageTek 5800 system can tolerate up to two missing data or parity fragments. For example, if D2 and P1 are not accessible, the Sun StorageTek 5800 system can still reconstruct the object using the remaining fragments displayedin FIGURE 1-8.FIGURE 1-8 Decoding Example With Missing Fragments

Figure shows a decoding example with missing fragments.




Note - The placement of each fragment is designed with the physical hardware in mind and is done in such a way that it maximizes the separation of fragments from one another and minimizes the chance of a component failure leading to a loss of more than one fragment.



Placement Algorithms

The Sun StorageTek 5800 system uses internal placement algorithms to determine where data and parity fragments are placed in the cluster. The algorithms' goals are to:

A table requires constant updating as locations change in response to failures or changes in cluster membership.

Stored Data Object Fragment Placement

The fact that an object is divided into data and parity fragments is invisible to the user. The system gathers and decodes the appropriate fragments to reconstruct stored data objects as read requests are made. Any node can encode and store data, and any node can decode and return it.

FIGURE 1-9 provides an example of how the fragments may be distributed in a Sun StorageTek 5800 system cluster.


FIGURE 1-9 Data and Parity Fragment Storage Example

Figure displays an example of how fragments might be distributed in a Sun StorageTek 5800 System cluster.


In FIGURE 1-9, two different object files are shown as divided into five data fragments each that are then placed on different disks in different nodes. The two parity fragments that were generated for each of the two different objects are also placed on different disks on different nodes. Because the parity fragments are generated based on the contents of the data fragments, they can be decoded in combination with data fragments to recreate missing data fragments.

The data encoding and reconstruction process means that any two disks on different nodes, and more than two disks on the same node, can be lost without losing data. Sun StorageTek 5800 system nodes do not contain any more than one fragment of a file. Also, a node failure or even two node failures are unlikely to result in a loss of data because the fragments are distributed across nodes.

The objective here is to encode and distribute data and parity fragments in such a manner that the loss of a disk or even multiple disks does not result in a loss of data. Thus, if you lose a disk, the system regenerates the lost data elsewhere, and when you replace disks, they are automatically populated with data through background healing tasks. See the section that follows for more information.

Node Failure and Recovery

FIGURE 1-10 shows a graphical representation of a node failure. It first shows a failure in node 4 (X) and then displays the fragments on that failed node being relocated to different disks in different nodes. Specifically, the parity fragment of Object 1 is relocated to node 13, disk 1, while the data fragment of Object 2 is relocated to node 12, disk 3.


FIGURE 1-10 Node Failure and Fragment Relocation Example

Figure shows a graphical representation of a node failure and fragement relocation.


As shown in FIGURE 1-10, the recovery process restores reliability by re-creating fragments on the failed node to different nodes in the system. Once the recovery process is complete, the system is again at full reliability. Thus, the urgency to replace the failed node is reduced, allowing for deferred maintenance.

A design that allows for deferred maintenance is a major advantage the Sun StorageTek 5800 system possesses over more traditional storage systems. When a Sun StorageTek 5800 system cell heals itself following the failure of a disk or server component, lost fragments are fully and reliably recovered without the use of dedicated, hot-spare components. As long as spare capacity remains in the system, recovery can occur over and over again without requiring the replacement of failed components. Indeed, for normal component failures, the Sun StorageTek 5800 system is intended to go four to six months without servicing.