Sun StorageTek 5800 System Client API Reference Manual

Chapter 1 Sun StorageTek 5800 System Client API

The Sun^TM StorageTek^TM 5800 system client API provides programmatic access to a 5800 system server to store, retrieve, query, and delete object data and metadata. Synchronous versions are provided in C and Java^TM languages. A future release will implement a non-blocking C API for use with POSIX operations.

This chapter provides a summary of the changes for the Sun StorageTek 5800 System 1.1 release, and overviews of the client APIs and query language.

The following topics are discussed:

Changes in Version 1.1

The following general changes have been made in Version 1.1.

Handling is added for storing, retrieving, and querying the following metadata types:
- char — for Latin 1 character set
- string — for Unicode character set
- binary
- date
- time
- timestamp
Query and queryplus are merged.
Prepared statements (pstmts) are introduced to handle the values of queries that cannot be placed inline, and a new query is introduced to handle them.
The handling of strings that are longer than the string length of the associated field has changed.

In 5800 system version 1.1, an attempt to store a value that is longer than the associated field generates an immediate error.

5800 System Overview

This section provides an overviews of the 5800 system, the 5800 system history, and a summaries of the key points of the 5800 system usage model.

The following topics are discussed:

5800 System Summary

The 5800 system is an object-based storage archive appliance for fixed-content data and metadata. The 5800 system is designed from the ground up to be reliable, affordable, and scalable, and to integrate data storage with intelligent data retrieval. It is designed to store huge amounts of data for decades at a time. At that scale, issues of how and where the data is stored — and how that changes over time — can be quite cumbersome. The 5800 system usage model is designed to manage those issues for you, so that your application can deal with just the data.

A custom Application Programming Interface (the 5800 Client API) is provided so that your applications can take advantage of all the features in the 5800 system usage model. The API provides the following capabilities:

Store a new object into the archive (storeObject)
Associate a new metadata record with stored object data (storeMetadata)
Retrieve the data from an object that was previously stored (retrieveData)
Retrieve the metadata from an object that was previously stored (retrieveMetadata)
Delete an object (delete)
Query for matching objects given a query expression of specific object characteristics (query)

The 5800 system API Release 1.1 provides two APIs:

The Java API is described in Chapter 2, Sun StorageTek 5800 System Java Client API
The C API is described in Chapter 3, Sun StorageTek 5800 System C Client API

This chapter provides a summary of key points of the 5800 system usage model that are useful for understanding either API.

In the following sections, the terms from the Java API are used as an aid to exposition. In all cases, a simple equivalent using the C API is available.

Chapter 4, Sun StorageTek 5800 System Query Language provides a detailed description of query capabilties and query syntax.
Chapter 5, Programming Considerations and Best Practices provides programming considerations and best practices that can help you create efficient 5800 system applications.

The 5800 System and Honeycomb

The original code name for the project that grew into the 5800 system was Project Honeycomb. The Honeycomb name lives on as the name of an Open Solaris community that is bringing the Honeycomb software stack into the world of Open Source. The first realization of the Honeycomb storage model as a real product is the 5800 system as described in this guide and related guides.

As a model for programmable storage systems, however, the Honeycomb API has a much broader reach than just the 5800 system. The programming model is designed to scale both up and down to any storage archive system that needs to abstract and separate the issues of how data is stored from how it is used. In recognition of both the past and the future, the string “honeycomb” and the initials “hc“ still live on in certain aspects of the API described in this guide. When the 5800 system API is used in contexts outside of the 5800 system, the API is referred to as the Project Honeycomb API.

The 5800 System Data Model

The 5800 system stores two types of data: arbitrary object data and structured metadata records. Every metadata record is associated with exactly one data object. Every data object has at least one metadata record. A unique object identifier (OID) is returned when a metadata record is stored. This OID can later be used to retrieve the metadata record or its data object. In addition, metadata records can be retrieved by a query:

OID ↔ Metadata Record -> Underlying Object Data

There are two types of metadata, system metadata and user metadata. You cannot override the names and types of system metadata.

Each object in the 5800 system archive consists of some arbitrary bytes of data together with associated metadata that describes the data. Once an object is stored, it is immutable. The 5800 system programming model does not allow the data or the metadata associated with an object to be changed once the object has been stored, in other words the system is a Write-Once Read-Multiple (WORM) archive. Each object corresponds to a single stream of data and a single set of metadata; there are no “grouped objects” or “compound objects” other than by application convention.

Each object corresponds to a single stream of data and a single set of metadata. There are no “grouped objects” or “compound objects” other than by application convention. Similarly, there are no “links” or “associations“ from one object to another. The customer application is shielded from all details of how or where the object is stored. Internally, the actual location of an object might change over time, or several objects might even share the same underlying storage. The customer application can retrieve the object without knowing these details.

A stream of data is stored in the object archive using storeObject. Once stored, each such object is associated with an object identifier or objectid (OID). The storeObject operation takes both a stream of data and an optional set of user metadata information and returns an OID. The OID can be remembered outside of the 5800 system and may later be used to retrieve the data associated with that object using the retrieveObject operation.

Every object has metadata whether or not user metadata was supplied at the time of the store. For example each object has system metadata that is system assigned and can never be modified by the user. The OID is associated with the metadata record that represents this object as a whole; the metadata record is then associated with the underlying data:

OID ↔ Metadata Record -> Underlying Object Data

The retrieveObject operation takes an OID as input and returns a stream of bytes as output that are identical to the bytes stored during the storeObject operation. Both the storeObject and retrieveObject operations handle the data in a streaming manner. Not all of the data need be present in client memory or in server memory at the same time, which is a crucial point for working with large objects.

For the 5800 system Release 1.1, data sizes up to 400 GBytes are tested and supported. Using sizes even smaller than this may be appropriate as a best practice. For more information, see Chapter 5, Programming Considerations and Best Practices.

From within a customer application, the storing of an object into the archive is an all-or-nothing event. Either the object is stored or it is not; there are no partial stores. If a store operation is interrupted, the entire storeObject call fails. Once an OID is returned to the customer application, the object is known to be durable. In the event of an outage that causes some data loss, the system should be no more likely to lose a newly stored object than any other object. There is no way to tie together two different store operations so that both either succeed together or fail together.

Note –

A stored object may or may not immediately be queryable. For more information, see The 5800 System Query Integrity Model.

The 5800 System Metadata Model

Metadata means “data about the data”; it describes the data and helps to determine how the data should be interpreted. In addition, metadata can be used to facilitate querying the 5800 system for objects that match a particular set of search criteria.

For the 5800 system, the supported metadata option is in the form of name-value fields stored with each object. The set of possible fields is defined in the metadata schema. Setting up a metadata schema is an important system administration task that is described in the 5800 System Administration Guide, and is analogous to the process of database design that goes into creating a data management application. The metadata schema determines what field names, types, and lengths may be used with the metadata stored with each object. In addition, the layout of fields into tables within the schema, together with the definition of views that speed certain searches, determine which kinds of queries about that metadata will be both possible and effective. As such, the metadata schema should match the characteristics of the expected range of applications that will deal with the stored data. The underlying software is designed to support multiple different kinds of metadata to aid in searching. For example, eventually there might be a specialized index to facilitate full-text search within the data objects. This document describes only the API for dealing with the name-value metadata type.

Fields in the schema can be either queryable or non-queryable. The values for non-queryable fields may be retrieved later but may not be used in queries. The 5800 system supports only single-valued fields. Each object can have only a single name-value pair of a given name. There is no built-in support for multiple-valued fields, such as a list of authors of a book in the form of multiple fields named 'author'.

Each data object is associated with a set of name-value pairs at the time the object is stored. Some metadata (system metadata) is assigned by the5800 system as each object is stored. For example, each object contains an “object creation time” (system.object_ctime) and an OID (system.object_id), both of which are assigned by the system at the time an object is created. Some metadata (the computed metadata) is implicit in the stored data, and is made explicit at the time of the object store. For example, the system exposes the object data length as a metadata field (system.object_size). In addition, the 5800 system computes a Secure Hash Algorithm (SHA1) hash of the stored data as the data is stored and stores the hash as a metadata field (system.object_hash). There is also an associated field (system.object_hash_alg) to specify which hash algorithm was used in computing the system.object_hash. It is currently always set to “sha1.”

Finally, some metadata (the user metadata) is supplied by the customer application in the API call at the time an object is stored. Each store operation is allowed to include a NameValueRecord that indicates a set of name-value pairs to be associated with the data object as metadata. Each name in the name-value record must match a field name from the metadata schema; in addition, the data value supplied for each field must match the type and length for the field as specified in the schema. If the names or values supplied for the user metadata do not match the active schema, then an exception is generated and the object is not stored.

The metadata associated with an object is immutable. There is no operation to modify the metadata associated with an object after the object has been stored. Instead, the storeMetadata operation can be used to create a completely new object by associating new user metadata with the underlying data and system-metadata of an existing object. The storeMetadata operation does not merge the new metadata in with the metadata from the original OID; instead, the storeMetadata operation creates a new metadata record pointing to the same data object. To accomplish a merge of new field values into existing metadata, the customer application must manually retrieve the existing metadata from the original object, perform the merge into a single NameValueRecord on the client side, and then call storeMetadata to create a new object with the merged metadata.

When creating a new object using storeMetadata, a new system.object_id and new system.object_ctime are generated, to indicate that a new object has been created. The metadata computed from the object data itself (system.object_length, system.object_hash_alg, and system.object_hash) does not change. Both the storeObject and the storeMetadata operations return a SystemRecord value that includes all of the system-assigned fields.

While retrieving the OID is the most common use of the SystemRecord, the other system fields can also be helpful. For example, the customer application might use the system.object_length, the system.object_hash_alg and the system.object_hash fields to verify that the data as stored matches the data as present in the customer application. If a hash independently computed on the client matches the hash stored on the 5800 system, then the data store has been validated.

The metadata values associated with an object can be retrieved using the retrieveMetadata operation. The retrieveMetadata operation takes an OID as input, and returns the entire set of user, system, and system-computed metadata. The retrieved metadata is in the form of a NameValueRecord that contains the value of each field as originally stored. The system fields occur using their field names, for example. the field system.object_ctime contains the object creation time. There is no operation to retrieve just a single field or a subset of fields by supplying a list of field names. The retrieveMetadata operation retrieves the values of both queryable and non-queryable fields.

The 5800 System Query Model

One of the primary methods for retrieving data is to specify the characteristics of the desired data and then let the system find it for you. In the 5800 system, a query expression specifies a set of conditions on metadata field values. The system then returns a list of all the objects whose metadata values match the query conditions. Each object is considered individually without reference to any other objects. There are no queries that compare fields in one object with fields in a different object.

Query expressions can use much of the power of Structured Query Language (SQL). Each query expression combines SQL functions and operators, field names from the metadata schema, and literal values. There are no query expressions that select objects based on the data stored in the object itself; all queries apply only to the metadata fields associated with the object. Only queryable fields can be used in query expressions. For an object to show up in a query result set, the object must have a value for each of the fields mentioned in the query; in other words, there is an implicit INNER JOIN between the fields in the query.

A query may optionally specify that the result set should include not just the OID of each matching object, but also the values from a set of selected fields of each matched object . The value retrieved by Query With Select for some field may be a canonical equivalent of the value originally stored in that field. For example, values in numeric fields may have been converted to standard numeric format. Trailing spaces at the end of string fields will have been truncated (The value that is returned will be some value that would match the original data as stored, in the SQL sense.) To be included in the result set, an object must include values for all queried fields and all selected fields. In other words, there is an implicit INNER JOIN between all the fields in the query and in the select list.

There are significant limitations on which queries may be executed efficiently, or at all. See Chapter 4, Sun StorageTek 5800 System Query Language and Chapter 5, Programming Considerations and Best Practicess for details of these limitations.

There are no ordering guarantees between queries and store operations that are proceeding at the same time. If an object is added to the 5800 system while a query is being performed, and the object matches the query, then the object may or may not show up in the query result set.

For a detailed description of query syntax and query semantics, including a description of exactly what it means for an object to match a query, see Chapter 4, Sun StorageTek 5800 System Query Language.

The 5800 System Query Integrity Model

The result set of any query will only return results that match the query. But will it return ALL the matching results? That is the concept of query completeness, referred to here as query integrity. 100% query integrity for a result set is defined as a state in which the result set contains all the objects in the 5800 system that match that particular query. The 5800 system is not always in a state of 100% query integrity. Various system events can induce a state in which the set of objects that are available for query is smaller than the total set of objects stored in the archive. Each query result set supports an operation (isQueryComplete) whereby the customer application can ask, once all the results from the query result set have been processed, whether that set of results constitutes a complete set.

Note –

The format of records as stored in the reliable and scalable object archive is not suitable for fast query. To enable searching, the queryable fields from the metadata are indexed in a query engine that can provide fast and flexible query services. The query engine is basically an SQL database. This is why the 5800 system's query language can borrow so heavily from SQL. At various times, the data as indexed in the query engine can get out of date compared to what is stored in the archive. When this happens, query result sets are not known to be complete until the contents of the query engine can be brought back up to date with the actual contents of the archive again.

The 5800 system concept of query integrity as actually implemented is somewhat looser than that of 100% query integrity. Even if a query result set indicates the result set is complete, the 5800 system allows certain objects, known as store index exceptions, to be missing from the query result set, as long as those exceptions were communicated to the customer application at the time the object was stored.

A store index exception is an object for which the original store of the object into the archive succeeded, but at least some part of the insert into the query engine (database) did not succeed. The object may or may not show up in all of the queries that it matches. A store index exception is communicated to the customer application at the time of store by means of a method SystemRecord.isIndexed. A value of false from isIndexed means that the object is not immediately available for query.

A store index exception is said to be resolved when the object becomes available for query. The checkIndexed method can be used to attempt to resolve a store index exception under program control. The checkIndexed operation checks if the object has been added to the query engine, and attempts to insert it if the object has not been added. If the insert into the query engine succeeds, the object is thereby restored to full queryability.

All store index exceptions will also eventually be resolved automatically by ongoing system healing. Each query result set also exports a method getQueryIntegrityTime that can be used to get detailed status on which store index exceptions might still be unresolved. The query integrity time is a time such that all store index exceptions from before that time have been resolved. There is an “ideal” query integrity time, which is the time of the oldest still-unresolved store index exception: an ideal implementation when asked for the query integrity time would always report this ideal value. In actual implementation, the reported query integrity time might be hours or even days earlier than the ideal query integrity time, depending on how far the ongoing system healing has progressed.

Deleting Objects from the 5800 System

The 5800 system client API exports an operation to delete a specific object as specified by its OID. Once a delete operation completes normally, subsequent attempts to retrieve that object will fail with an exception. In addition, the object will stop showing up in query result sets that match the original object metadata. There are no transactional guarantees regarding ordering of queries and delete operations that are occurring at the same time. If an object is being deleted at the same time that a query that matches that object is being performed, then that object may or may not show up in the query result set, with no guarantee either way.

Note –

When all objects that share an underlying block of data storage been deleted, the underlying block of data storage will itself be scavenged and returned to the supply of free disk space. But all details of how objects are stored, and how and whether they ever share data — or ever are scavenged — are outside of the scope of this API.

Delete operations are all-or-nothing,with some caveats. Specifically, if a delete operation fails with an error, it is possible that the object is not fully deleted but is temporarily not queryable. Such an object is in an analogous state to a store index exception (see The 5800 System Query Integrity Model). The queryability of such an object will eventually be resolved by automatic system healing. In addition, the queryability of such an object can be resolved under program control by using the checkIndexed method. Alternatively, the customer application may choose to re-execute the delete operation until it succeeds, or until it fails with an error that indicates the object is already deleted.