C H A P T E R 4 - Configuring Metadata and File System Views

C H A P T E R 4

Configuring Metadata and File System Views

This chapter explains metadata and the system schema, discusses upgrading the schema configuration, and then describes file system views and their definition. It contains the following sections:

Understanding Metadata and the System Schema

Upgrading the Schema Configuration

Understanding File System Principles

Understanding Metadata and the System Schema

Metadata is extra information about the data object. Each object stored in the Sun StorageTek 5800 system can have one or more descriptions attached to it. This description can contain any kind of information on the data.

Metadata can come from one of three sources:

The data itself. For example, metadata might involve automatically extracting ID3 tags from an MP3 file, or extracting appropriate strings to later perform a text search.

Note - Automated metadata extraction is not supported in release v1.0 of the Sun StorageTek 5800 system.

Information explicitly provided at the time an object is stored. For example, metadata might include the patient name, social security number, or body part that you enter when storing a medical image. (This is not necessarily derived from the data).

Metadata created by the Sun StorageTek 5800 system storage system itself. Such attributes are not writable by client applications and are created by the Sun StorageTek 5800 system.

The Sun StorageTek 5800 system's metadata space is partitioned in namespaces (that is, a string followed by a period) for greater naming flexibility. Thus, the system reserves the system namespace for storing the metadata created by the Sun StorageTek 5800 system itself.

The system namespace includes a unique identifier for each stored object, called the object ID or OID. Namespaces are collections of names, identified by a uniform resource identifier (URI), that are used in XML documents as element types and attribute names. Using namespaces is completely optional for user metadata. However, the Sun StorageTek 5800 system reserves the system and filesystem namespaces.

Specifically, the filesystem namespace affects how the file system layer presents files. For example, it may include the user identifier (UID) and group identifier (GID). For more information on namespaces, see Metadata Configuration and the XML File Structure, while Understanding File System Principles provides more detail on file systems.

Relational Metadata

The only supported metadata format in the Sun StorageTek 5800 system release 1.0 is a set of typed name-value pairs. Since the values are typed, the Sun StorageTek 5800 system needs to be pre-configured with names and their value type.

TABLE 4-1 lists the supported metadata types for version 1 of the Sun StorageTek 5800 system.

TABLE
4-1 Valid Metadata Types
Valid Types	Description
Long	64 bits Maximum Value: -9223372036854775808 Minimum Value: 9223372036854775807
Double	64 bits Maximum Value: 1.7976931348623157E308d Minimum Positive Value: 4.9E-324d
String	512 characters

Note that when performing a storage operation, you always store some metadata. If no name-value pairs are specified, the metadata only contains the fields populated by the Sun StorageTek 5800 system, such as creation time (ctime), data length, data hash etc. The returned OID provides a reference to both the data you just stored and its metadata.

You can then use an API function call (addMetadata) to attach an extra metadata description to an existing piece of data. As shown in FIGURE 4-1, this operation returns a new OID that designates the newly-stored metadata, as well as the data that was already in the system.

FIGURE 4-1 The addMetadata Operation

Figure graphically shows the addMetadata operation and how it attaches an extra metadata description to an existing piece of data.

Metadata Structure and the Schema

The Sun StorageTek 5800 system uses an embedded database to perform its search functions.(This database is also referred to as the metadata index.) In order to make this index as efficient as possible, the metadata attributes and their types are predefined in a schema.

The schema specifies what the Sun StorageTek 5800 system metadata permits and how it is structured. It consists of the list of expected fields, their types, and a designation of whether or not they should be included in the index. A predefined schema is necessary to create internal indices of the actual metadata. In the Sun StorageTek 5800 system, the predefined schema contains the minimum set of attributes (that is, only the system and filesystem namespace).

Metadata Configuration and the XML File Structure

All metadata-specific configuration for the Sun StorageTek 5800 system is stored in an XML file. XML offers a widely adopted, standard way of representing text and data in a format that can be processed with relatively little human intervention and exchanged across diverse hardware, operating systems and applications.

XML is designed to describe information and its composition, while the HyperText Markup Language (HTML) is designed to display information. The tags you use to mark up HTML documents and the document's structure are predefined, so that you can only use tags that are defined in the HTML standard. By contrast, XML is extensible and allows you to define your own tags and your own document structure.

The metadata-specific configuration file for the Sun StorageTek 5800 system uses a standard XML file format and is structured as follows:

<?xml version="1.0" encoding="UTF-8"?>

<metadataConfig>

Schema definition

</schema>

File system views specification

</fsViews>

</metadataConfig>

Note - Only one <schema> and one <fsViews> subsection are permitted here.

The general schema subsection format is then organized as follows:

<schema>

<namespace name= "namespace" [ writable="true" ] [extensible="true" ] >

</namespace>

...

<schema>

The namespace referenced here is a collection of names, identified by a URI, that are used in XML documents as element types and attribute names. Namespaces exist simply to keep names from separate sources from colliding unintentionally. Note that you can have as many namespaces as desired. There is also no limitation in the number of namespaces that can be encapsulated within a given namespace level.

When defining a namespace, you can define two optional properties:

Writable

A writable namespace applies to the filesystem interface using WebDAV. Specifically, it means that you can specify any field in that domain at the time an object is stored. Note that once a domain is not writable, subdomains cannot overwrite the value and be writable. The default value is true.

Extensible

You can extend an extensible domain in a subsequent configuration update, thus adding more attributes or more subdomains to an existing domain. Once a domain is not extensible, subdomains cannot overwrite the value and become extensible. The default value is true.

Fully-Qualified Names

When using an attribute name in an application, such as when storing metadata or querying, the form of the attribute name that is used is always the fully-qualified name. The fully-qualified name consists of all the enclosed namespace names from the broadest to the narrowest, separated by dots, followed by the attribute name itself. For example, the fully-qualified name of the attribute in the general schema subsection format shown previously is namespace.subnamespace.fieldName.

DNS Namespaces

Namespaces are arbitrary. However, as a convention it is recommended that you ensure that the namespaces relate directly to Domain Name Service (DNS) names, in a way similar to how Java classes are named.

DNS is a hierarchical naming system provided for computer networks. It is broken up hierarchically into domains and subdomains, which are logical organizations of computers that exist in a larger network. These domains exist at different levels and connect in a hierarchy that resembles the root structure of a tree.

Domain names are assigned by a well-known process through the Internet Assigned Numbers Authority (IANA), in conjunction with naming authorities within each commercial or government entity. As a result, DNS names from different organizations are already guaranteed not to conflict. This is the only quality of DNS names the Sun StorageTek 5800 system is using.

For example, it is recommended that an organization whose DNS name is sample.company.com should create a set of Sun StorageTek5800 system namespaces under the namespace heading com.company.sample. If you follow this convention, you can guarantee that storage items from separate sources that are combined into a single Sun StorageTek 5800 system cell will not conflict in their metadata namespaces, even if the applications were not originally designed to work together. Note that this is the same convention by which Java class names are assigned.

Reserved Domains

TABLE 4-2 lists the domains that Sun StorageTek 5800 system reserves.

TABLE
4-2 Sun StorageTek 5800 System Reserved Domains
Name	Writable	Extensible
system	false	false
filesystem	true	false
filesystem.ro	false	false

TABLE 4-3 lists the content of the reserved system domain.

TABLE
4-3 Sun StorageTek 5800 System Reserved Domains
Attribute Name	Definition
system.object_id	The object identifier
system.object_ctime	Creation time
system.object_layoutMapId	Layout map used to store the object
system.object_size	Data size
system.object_hash	Hash value for the data
system.object_hash_alg	Algorithm used to compute the hash (for example, SHA1)

Schema Definition

The purpose of a Document Type Definition (DTD) is to define the legal building blocks of an XML document. The DTD defines the document structure with a list of legal elements, thus providing an application-independent way of sharing data.

For example, independent groups of people can agree to use a common DTD for interchanging data. The DTD allows you to check the logical structure of the data that you receive from the outside world and also verify your own data.

In the Sun StorageTek 5800 system, you define a schema using the schema DTD. The XML syntax you need to follow here is as follows:

<!DOCTYPE HCMETADATACONFIG [

<!ELEMENT metadataConfig (schema, fsViews)>

<!ELEMENT schema (namespace*, field*)>

<!ELEMENT namespace (namespace*, field*)>

<!ELEMENT field EMPTY>

<!ELEMENT fsViews (fsView*)>

<!ELEMENT fsView (attribute+)>

<!ATTLIST namespace name CDATA #REQUIRED>

<!ATTLIST namespace writable (true|false) true>

<!ATTLIST namespace indexable (true|false) true>

<!ATTLIST field name CDATA #REQUIRED>

<!ATTLIST field type (long|double|string|date|time|blob) #REQUIRED>

<!ATTLIST field indexable (true|false) true>

<!ATTLIST fsView name CDATA #REQUIRED>

<!ATTLIST fsView filename CDATA #REQUIRED>

<!ATTLIST fsView namespace CDATA>

<!ATTLIST attribute name CDATA #REQUIRED>

<!ATTLIST attribute unset CDATA>

]>

Note that the only schema changes you can make in the Sun StorageTek 5800 system are:

Adding new attributes in an existing namespace (if the namespace is extensible)

Creating a new namespace (if the parent namespace is extensible)

Creating new file system views

Upgrading the Schema Configuration

To upgrade the schema configuration:

1. Create a schema overlay to extend an existing schema.

A schema overlay is an XML file that follows the specification shown in Metadata Configuration and the XML File Structure. It contains only the new namespaces and fields that you wish to add.

Note - For version 1 of the Sun StorageTek 5800 system, you must edit the XML file manually.

If you wish, you can use mdconfig followed by the -t (--template) option. This returns an XML template file that you can use as a starting point to create that overlay.

Once a version of the overlay is available, you can perform a validation through the CLI. The purpose of the validation is to ensure that the XML syntax is correct and also to provide an overview of the operation that will be performed if the overlay occurs.

2. Use the mdconfig command with no arguments to perform a validation.

For example, to validate the local overlay.xml file, type the following:

hc $ cat overlay.xml | ssh admin@<ADMIN IP> mdconfig

Once you are satisfied with the overlay, you must commit it so the Sun StorageTek 5800 system can go ahead and execute it.

3. Use mdconfig followed by the -c (--commit) option to commit the overlay.xml file.

For example, to continue the previous example, enter:

hc $ cat overlay.xml | ssh admin@<ADMIN IP> mdconfig -c

Note - The commit option first runs a validation before performing the commit operation. If the XML syntax is not correct, an error is produced.

Understanding File System Principles

The Sun StorageTek 5800 system's file system is built on top of the metadata. Thus, a file system, or virtual view, specifies the metadata attributes you wish to use to browse the Sun StorageTek 5800 system.

Basically, a view transforms the simple collection of name-value pairs to a path (that is, a hierarchical directory structure). For the version 1.0 release, the file system view is accessible only through Web-based Distributed Authoring and Versioning (WebDAV).

Using WebDAV for File Browsing

WebDAV is a set of extensions to the HTTP protocol that enable you to read and delete files on remote web servers. Specifically, DAV is an extension of the HTTP/1.1 protocol that adds new HTTP methods and headers.

Each WebDAV request initiates a separate WebDAV operation. GUI and CLI clients then enable you to browse WebDAV collections as you would on any web site. Thus, you can use standard web browsers, like Internet Explorer, graphical browsers, such as Konqueror, Nautilus (UNIX), or Finder (OS X), or a CLI client, such as Cadaver (UNIX), to browse WebDAV collections. You go to the top-level webdav directory to see the fsviews.

For example, you can define the following sample schema for MP3s on a cluster:

<fsView name= "byArtist" namespace="mp3" filename="${title}.${type}">

<attribute name="artist"/>

<attribute name="album"/>

<fsView>

Note here that each:

fsView entry specifies a new file system abstraction.

Within the fsView attributes:

name is the top-level directory name of the abstraction.

namespace defines a schema namespace prefix for all of the names used (that is, title will be parsed as mp3.title).

filename defines the form of the files (leaves) that are exposed by the file system in that view.

In addition to the attribute list, each fsView contains an ordered list of attributes that defines the view.

The following path:

/webdav/byArtist/Charlie Parker/Bird and Diz/Bloomdido.ogg

represents the query

mp3.artist='Charlie Parker' & mp3.album='Bird and Diz' & mp3.title='Bloomdido' & mp3.type='ogg'

In the browser's address bar, you type the following to display a page of album names:

webdav://dev321-data:8080/webdav/byArtist/U2

You can then click a particular album name, and a web page with album tracks appears as with any ordinary web page.

Metadata Attributes and WebDAV Properties

Each file in the Sun StorageTek 5800 system archive appears as a file in the exported file system. The file attributes (stat data) are exported as WebDAV properties. TABLE 4-4 lists the WebDAV property names and user metadata attributes. Be aware that these attributes are regular metadata values accessible through API queries.

TABLE
4-4 WebDAV Property Names and User Metdata
Attributes
	WebDAV Property	Metadata Attribute	Description
Pre-defined properties	DAV:getlastmodified	filesystem.mtime	Last modification time
	DAV:getcontentlength	system.object_size	Size of file
	DAV:creationdate	system.object_ctime	File creation time
	DAV:getcontenttype	filesystem.mimetype	MIME type
	DAV:displayname	<filename>	Name presented to user
New properties	HCFS:mode	filesystem.mode	File mode (permissions etc.)
	HCFS:uid	filesystem.uid	Owner ID
	HCFS:gid	filesystem.gid	Group ID

Note - The timestamps are all 64-bit signed offsets from the epoch -- 00:00:00 1/1/1970 Coordinated Universal Time (UTC), in milliseconds, while the range is 300 million years.)

The file size, uid and gid are numeric, unsigned 64-bit integers, while the creationdate property is returned as an ISO 8601 localized string. The format for getlastmodified is not defined by the protocol, so a value similar to that of date(1) is returned.

Virtual View Definition

A view presents stored data and metadata as a virtual, read-only file system. For example, for an MP3 file system view, you may present a view with a directory for the artist, a subdirectory for the album, and file names based on the title of the music files.

File system view declarations are useful for the file system protocol layers, such as WebDAV, to know what to export. To create a new virtual view, you add an entry in the <fsviews> section of the XML overlay file shown in Metadata Configuration and the XML File Structure.