Sun StorageTek 5800 System Version 1.1 Release Notes |
This document contains important information about the Sun StorageTek 5800 System, Version 1.1. Read this document so that you are aware of issues or requirements that can affect the installation and operation of the 5800 system.
This document contains the following sections:
This section includes a description of the major enhancements available with Version 1.1.
Version 1.1 provides support for the following configurations:
Users can start with a half-cell configuration, add 8 nodes to make a full-cell configuration, and then add another full cell for a configuration with two full cells.
When the cell is expanded from 8 to 16 nodes, the 5800 system rebalances data previously stored in the 8-node cell across all 16 nodes.
When an additional full-cell system is added, new data can be stored to any cell in the configuration (it is usually stored on the cell with the most capacity).
As part of multicell support, some CLI commands were changed. The ifconfig command is no longer available and has been replaced with a cellcfg and hivecfg command.
Version 1.1 includes a Graphical User Interface (GUI) for performing configuration and management functions.
Version 1.1 allows for complete restoration of data in the event of a catastrophic loss of the system using the Network Data Management Protocol (NDMP) with NetVault, Version 7.4.5, from BakBone Software.
Version 1.1 includes hot-pluggable disks that can be removed and replaced easily after issuing the hwcfg command to disable the disk. After the disk is replaced, you issue the hwcfg command again to enable the disk.
The process of upgrading from Version 1.0.1 to Version 1.1 should be performed only by Sun service personnel. Before the service person performs the upgrade, please keep these considerations in mind:
OID conversion is not required if you access data exclusively via Web-based Distributed Authoring and Versioning (WebDAV). OID conversion is also not required if you store data with metadata, and access the data by querying the metadata. The query returns an OID, which the clients then retrieve. The clients never store or remember OIDs.
Consult Sun service personnel for complete information about the upgrade process.
If your 5800 system uses the Sun Fire X2100 Server for a service node, you may notice some issues with the power LED.
To determine if your system uses the Sun Fire X2100 Server for a service node, compare the service node on your system with the front and back views of the Sun Fire X2100 Server, as shown in FIGURE 1 and FIGURE 2.
FIGURE 1 Front View of the Sun Fire X2100 Server
FIGURE 2 Rear View of the Sun Fire X2100 Server
If your system uses the Sun Fire X2100 Server, you can determine if the power is on or off to the service node when the power LED is lit by pushing in the CD/DVD drive eject button. If the LED on the caddy starts to blink, the service node is powered on. If the LED on the caddy does not light, then the service node is powered off.
If you receive email alerts and syslog messages indicating that one or more nodes are offline, inspect the 5800 system to determine if the secondary switch (the top switch on the system) is active. FIGURE 3 shows the components on the back of the switch. If the port connection status LEDs are mostly lit for the top switch, that secondary switch is active.
If the secondary (top) switch is active, contact Sun service immediately to arrange for a replacement of the primary (bottom) switch.
Until the primary switch is replaced, the system might encounter the following issues while operating from the secondary switch:
This section lists the client operating systems from which you can run applications that store, retrieve, and query data on the 5800 system. The applications can be written in either the C or Java programming languages, using the 5800 system application programming interface (API):
You can access the 5800 system GUI using the following browsers:
The browser must be running Versions 1.5 or 1.6 of the Java Runtime Environment.
You can access the data on the 5800 system using Web-based Distributed Authoring and Versioning (WebDAV).
You can read the data on the system using WebDAV from any Hypertext Transfer Protocol (HTTP) browser running on any system that is on the same network as the 5800 system
For full read and write access to data on the 5800 system, you can use free software cadaver (command-line WebDAV client for Unix) or neon (HTTP and WebDAV client library with C interface). Consult the following URL for more information about this cadaver and neon:
MAC OS X allows you to mount the 5800 system as a network share and gain read and write access via WebDAV to the data on the 5800 system.
The WebDAV implementation on the 5800 system has also been tested with the KDE Konqueror, Version 3, browser, which provides full read/write WebDAV access.
The use of WebDAV in multicell configurations is not supported.
This section provides information about functional limitations and bugs that were described in the Version 1.0.1 release notes, and which have been resolved in this Version 1.1 product release.
This section provides information about functional limitations and bugs in this version of the product release. Note that if a recommended workaround is available for a bug, it follows the bug description.
Bug 6331523 - If transient node failures have caused the system to fall below quorum, data services might remain unavailable even after quorum is regained.
Workaround - Reboot the system to bring data services back online.
Bug 6403951 - The emulator supports the Delete Record operation of NameValueObjectArchive.delete and hc_delete_ez. However, the emulator does not remove the underlying data file when the last metadata record is deleted. The semantics are correct, but the underlying space is not reclaimed.
Bug 6406170 - When you make a configuration change, certain properties require a reboot to take effect. Once the change is entered, however, you can no longer determine the current value, since the cellcfg command shows the new (pending) value instead. You also cannot tell that the displayed value is a pending value and that a reboot is still required.
Bug 6407787 - Even after the system has healed a disk, that disk may still be included in the disks unrecovered count displayed by the sysstat command.
Workaround - When the system is rebooted, the disks unrecovered count is reset to an accurate number.
Bug 6450745 - In some rare cases, the query engine might get hung in starting or stopped state.
Workaround - Try rebooting the system to create the query engine and repopulate it with metadata. The process could take 12 to 48 hours.
Bug 6451150 - Sometimes when you issue the CLI commands shutdown or reboot, the system returns the messages “It is not safe to shut down the system” or “It is not safe to reboot the system.” These messages indicate that the system is in the process of initializing the query engine.
Workaround - Although you can continue with the shutdown or reboot process, for best performance, wait until the query engine is fully initialized before proceeding.
Bug 6458160 - The use of some characters in the file name specification for a virtual file system view might cause parsing errors.
Workaround - Do not use the - character or other Unicode characters to specify the file name of a virtual file system view.
Bug 6458653 - To ensure the integrity of data on the 5800 system, the system must be operated only on a secure, internal network.
Bug 6464055 - In the schema definition file, you can specify a metadata field as queryable = false. If you later change the schema definition file to indicate that queryable = true for that field, any data that you add to the system after the change includes that field as a queryable field. However, data that was previously stored on the system is not updated and is not queryable with that field.
Bug 6464866 - It is not possible to clear metadata schema after it is configured.
Workaround - If you need to clear fields from the metadata schema, either wipe all hive data (which clears the schema as a side effect), or contact Sun support for assistance.
Bug 6481476 - The system may respond to some queries with an out-of-memory error message.
Workaround - When developing queries using the Java API, set
maxFetchsize in the range of 2000 - 5000.
Bug 6489627 - When the system first starts up, data operations may fail, even if the CLI reports that data services online.
Workaround - Wait for all disks to come online (run the sysstat or hwstat -v commands to determine the number of disks online). If the problem persists after all working disks are online, retry the operations from the client following the best practices described in the Sun StorageTek 5800 System Client API Reference Guide. If necessary, contact Sun services to replace bad drives.
Bug 6491877 - If clients attempt a large number of concurrent deletes, the system might go offline.
Workaround - Avoid large numbers of concurrent deletes; if the system goes offline, reboot to bring it back online.
Bug 6495883 - After a disk or node fails, delete operations to the system might fail for up to three minutes.
Workaround - Retry the deletions after three minutes.
Bug 6500528 - If a record is deleted from the 5800 system emulator using DeleteRecord, a WebDAV view might still show a link to the data, although not the data itself.
Workaround - Stop and restart the emulator.
Bug 6501640 - Time stamps on stores, retrieves, and queries in the SDK example programs might seem inconsistent.
Workaround - When planning stores, retrieves, and queries using the SDK Java example programs, be aware of the following:
Bug 6502605 - The system erroneously allows you to change attributes such as queryable for nonextensible namespaces.
Workaround - Do not change attributes for nonextensible namespaces.
Bug 6507353 - C API core dumps if a query result is freed after the session is freed.
Workaround - Do not call hc_session_free() before the resultset is freed with hc_qrs_free().
Bug 6516036 - The first attempt to do a restore operation after the 5800 system is rebooted might fail with a message of Connection Refused.
Workaround - Retry the restore operation; it is expected to work on the second try.
Bug 6518738 - While the system is performing a backup operation, it might generate multiple alert messages about nodes joining and leaving the system.
Workaround - You can safely ignore these messages.
Bug 6520374 - If you stop and then restart the system emulator soon after, it might fail with Java errors.
Workaround - Try restarting the emulator again.
Bug 6522009 - After you delete a file in a WebDAV view, the file might still appear to be present.
Workaround - Wait approximately five minutes and the file should no longer be displayed.
Bug 6531153 - You might not be able to access the CLI on a 5800 system from a system running Linux with a Kernel version greater than 2.6.17.
Workaround - Disable windows scaling on the Linux system using the following command:
echo 0 > /proc/sys/net/ipv4/tcp_window_scaling
Or, use sysctl to turn off window scaling.
Bug 6533145 - A query to the data on the 5800 system that includes metadata fields that are stored in more than one table might fail.
Workaround - Make sure that fields that are queried together are grouped in the same table.
Bug 6535947 - A query to the data on the 5800 system that includes a high number (more than 40, for example) of large string metadata attributes might cause the query to fail.
Workaround - Limit the number of large string metadata attributes in a query to fewer than 40.
Bug 6538378 - The 5800 system emulator might display a number of WARN!! EOF error messages.
Workaround - These innocuous messages can be safely ignored.
Bug 6539494 - The CLI sensors command and the GUI Environmental Status Panel might erroneously indicate that nodes 1, 3, and 13 are disabled.
Workaround - Check the CLI hwstat command and the GUI Cell Summary Panel to determine if the nodes are actually disabled.
Bug 6539500 - For the first 5 or 10 minutes after a node goes offline, the CLI sensors command and the GUI Environmental Status Panel might erroneously report active voltages, temperatures, and fan speeds for the node.
Workaround - Wait a few minutes and the CLI and GUI should report sensor data as disabled for the offline node.
Bug 6541837 - Rarely, an add or delete metadata operation for an object might fail if the system has not released the lock put on that object during a previous operation.
Workaround - If the symptom persists for more than 30 minutes, reboot the system.
Bug 6542247 - You cannot use an SMTP server that requires authentication to receive system alert emails.
Workaround - Configure the 5800 system with an SMTP server that does not require authentication.
Bug 6554457 - In some cases, switch 1 might fail over to switch 2, but no email alert might be sent to indicate this.
Bug 6557612 - If the network cable connection on a node is experiencing transient failures, the 5800 system might log missing heartbeat messages as well as messages indicating that the switch has failed over.
Workaround - Report the symptom to Sun service, and schedule a replacement of the node or network cable.
Bug 6558322 - If a client stores a large (greater than 1000 MB) object as the 5800 system is nearing capacity, the system might generate warning message about being unable to store the object. Also, the healing processes on the system might not be able to remove any fragments of the object that were successfully stored.
Workaround - Do not store objects larger than 1000 MB when the system is nearing capacity. A cell has reached capacity when any one of its disks has reached 80% raw utilization. To display the raw utilization of the disks in a cell, issue the CLI command df -p.
Bug 6562925 - The system does not reject the metadata schema file when the metadata name field contains one or more Unicode supplemental characters.
Workaround - Do not use Unicode supplemental characters for the metadata name field.
Bug 6566083 - If the wipe command fails and then you try to issue it again immediately, it might fail again.
Workaround - If the wipe command fails, reboot the system and then try the command again.
Bug 6570304 - Some hardware or software failures can cause a node to reboot repeatedly. Such a situation will be accompanied by email alerts and/or external syslog messages indicating that the node is leaving and joining.
Workaround - Power the node down by holding down the power button on the node that is exhibiting this behavior. Call Sun service to arrange to have the node replaced.
Bug 6570324 - The reboot -all command fails if the system is running on the secondary switch. The command requires both switches to be online.
Bug 6573144 - During the process of expanding a cell from 8 to 16 nodes, store operations to the system may time out.
Workaround - When programming applications, use retry loops within API calls to handle timeouts during cell expansion. One immediate retry should be sufficient in the great majority of cases.
Bug 6577783 - The 5800 system does not recognize a lower-case “e” as a symbol for exponent in queries.
Workaround - Use an upper-case ’E’ to symbolize exponents in queries.
Bug 6580181 - You cannot use a backup made of an 8-node system to restore data to a 16-node system until at least one backup session is created from the 16-node system.
Workaround - After you expand the system from 8 nodes to 16 nodes, perform one backup of any length from the 16-node system. You can then use this and any previous backups of the 8-node system to restore data to the 16-node system.
Bug 6582274 - When multiple system parameter are changed using the cellcfg command, all of the changes might not issue alerts.
Workaround - If you receive an alert indicating that a parameter has been changed using cellcfg, keep in mind that other parameters might also have been changed. Use the cellcfg command to check the current settings of all parameters.
Bug 6582486 - A connection attempt from a client to the 5800 system might fail with a java.net.ConnectionException error.
Workaround - Retry the connection.
Bug 6584310 - If you issue the wipe command and then retry the command without waiting for the wipe to be complete, the system might disable disks.
Workaround - Wait for one wipe action to complete before retrying the command again. If the system has already started disabling disks, reboot the system, and then reenable the disabled disks.
Bug 6584329 - A restore operation might not work correctly if the system has not been wiped and then rebooted before the restore begins.
Bug 6585878 - If a disk fails, or if a Sun service technician disables a disk, you might see severe error messages in the external syslog host.
Workaround - You can ignore these messages; they simply reflect the fact that the disk has failed and can be ignored.
Bug 6588218 - Some valid C API queries to the 5800 system might return the error code HCERR_BAD_REQUEST, which seems to indicate that the query is not valid.
Workaround - Use hc_session_get_status() to determine if the error string from the query contains the substring Relalg server involved in current operation failed. If so, retry the query.
Bug 6589653 - If you are running the 5800 system emulator on a system running Red Hat Version 4, you might not be able to shut down the emulator via the browser.
Workaround - Kill the process manually. The simplest way to do this is to start the emulator from a dedicated command prompt without running it in the background, and use Ctrl-C when it is idle to exit the program.
Bug 6595040 - If a restore operation fails, you may have to wait approximately 10 minutes while the system reclaims socket resources.
Bug 6601977 - When a system starts up or shuts down, it might not send a complete set of email and log alerts for all nodes and disks.
Bug 6603323 - Issuing the reboot -all command might cause a switch “split-brain” situation, in which neither switch is fully functioning as the primary switch and both switches are performing some of the primary switch’s duties.
Workaround - Call Sun service to help you troubleshoot and correct the problem.
Bug 6604018 - After issuing the shutdown command from the CLI, you might have to wait up to two hours when you restart the system before all disks come online.
Bug 6609313 - When you store an object with the storeObject API function, the object_ctime reported for the object might not match the object_ctime that is actually stored with the object.
Workaround - To determine the object_ctime that was actually stored with the object, retrieve the metadata for the object after the store operation is completed. The system metadata retrieved will include the object_ctime that was actually stored with the object and inserted in the query engine.
Bug 6612017 - If you issue a query on a metadata field of type binary, entries in the query engine that include the first bits specified in the query are returned as matches, even if the entry includes more bits than were specified in the query. For example, suppose an entry for binary field, bfield, contains the value ABCDEFGHIJ. A query on bfield = “ABCD” will return a match to that entry.
Bug 6612146 - You cannot start a restore on a cell that has been wiped in preparation for the restore.
Workaround - After you wipe the cell, reboot it.
Bug 6612244 - During the process of expanding a cell from 8 to 16 nodes, when you are running the celladm expand command, you cannot backup data from the cell.
Bug 6613234 - Because the Files Only at Leaf Level checkbox on the Set Up Virtual File System panel in the GUI does not work correctly, it is not possible to use the GUI to define schemas that include fsViews.
Workaround - Use the CLI mdconfig command to define schemas that include fsViews.
Bug 6613735 - If the length attribute for one or more directory fields is less than the length attribute for a filename field, WebDAV GETs may fail for file names that are longer than directory names.
Workaround - Specify equal lengths for directory and filename fields. For example, instead of specifying:
<namespace name="space1" writable="true" extensible="true"> <field name="dir1" type="string" length="2" /> <field name="dir2" type="string" length="2" /> <field name="fname" type="string" length="128" /> </namespace>
<fsView name="HashDirs" filename="${ofoto.fname}" filesonlyatleaflevel="false"> <attribute name="space1.dir1" unset="unk" /> <attribute name="space1.dir2" unset="unk" /> </fsView>
<namespace name="space1" writable="true" extensible="true"> <field name="dir1" type="string" length="128" /> <field name="dir2" type="string" length="128" /> <field name="fname" type="string" length="128" /> </namespace>
<fsView name="HashDirs" filename="${ofoto.fname}" filesonlyatleaflevel="false"> <attribute name="space1.dir1" unset="unk" /> <attribute name="space1.dir2" unset="unk" /> </fsView>
Bug 6615347 - Storing a very large file (greater than 400 GB) might fail with an error similar to the following:
com.sun.honeycomb.common.ArchiveException: Failed to get system metadata from the fragments
Workaround - Break the file into smaller pieces and then retry the store operation.
Bug 6616306 - If a restore operation fails and you initiate a new restore operation, the new restore operation might fail if it is started too soon.
Workaround - Wait at least 20 minutes before initiating a new restore operation.
Bug 6619221 - In the C API, time stamps supplied to hc_nvr_add_timestamp and dates supplied to hc_nvr_add_date should be in the range 00:00:00 January 1, 1970 UTC to 00:00:00 January 1, 2038 UTC. This is due to limitations in converting “seconds since the epoch” to the human-readable dates used for metadata storage. The Java interface is not as restrictive; however, dates outside of these limits stored in Java might not be retrieved properly by the C API.
Bug 6621320 - In the JAVA API and SDK RetrieveMetadata program, the SystemRecord.isIndexed() method always returns False.
Workaround - Ignore the SystemRecord.isIndexed() value.
Bug 6624848 - After you have restored data to a system following a disaster, you might not be able to resume backup of the system.
Workaround - Contact Sun service for assistance.
Bug 6625515 - The usage message displayed by the system for the SDK Java example application CheckIndexed is actually the usage message for the RetriveMetadata example application.
Workaround - Refer to the Sun StorageTek 5800 System SDK Developer’s Guide for the correct usage for CheckIndexed.
Bug 6627590 - The signature for hc_query_ez() in the C API hcclient.h has an int variable named max_records. This int variable should be renamed results_per_fetch to make its function more apparent. The Sun StorageTek 5800 System Client API Reference Guide refers to the variable as results_per_fetch, but in the code it is named max_records. Functionality is not affected.
Bug 6628840 - Occasionally, when you attempt to reboot the 5800 system using the CLI command reboot, the system returns the error message Connection Refused.
Workaround - The error message indicates that one or more storage nodes did not reboot. Wait at least 10 minutes and then issue the CLI command reboot --all.
The following table lists the documents for this product. The online documentation is available at:
http://docs.sun.com/app/docs/prod/stortek.5800#hic
Sun StorageTek 5800 System Regulatory and Safety Compliance Manual |
|||
If you need help installing or using this product, go to: