This appendix describes some ways to set up your configuration. Use the following table to locate specific information in this chapter.
When planning a configuration, the main point to keep in mind is that for any given application there are trade-offs in performance, availability, and hardware costs. Experimenting with the different variables is necessary to figure out what works best for your configuration.
What are the performance trade-offs?
Striping generally has the best performance, but it offers no data protection. For write intensive applications, mirroring generally has better performance than RAID5.
What are the availability trade-offs?
Mirroring and RAID5 metadevices both increase data availability, but they both generally have lower performance, especially for write operations. Mirroring does improve random read performance.
What are the hardware cost trade-offs?
RAID5 metadevices have a lower hardware cost than mirroring. Both striped metadevices and concatenated metadevices have no additional hardware cost.
This section provides a list of guidelines for working with concatenations, stripes, mirrors, RAID5 metadevices, state database replicas, and file systems constructed on metadevices.
A concatenated metadevice uses less CPU time than striping.
Concatenation works well for small random I/O.
Avoid using physical disks with different disk geometries.
Disk geometry refers to how sectors and tracks are organized for each cylinder in a disk drive. The UFS organizes itself to use disk geometry efficiently. If slices in a concatenated metadevice have different disk geometries, DiskSuite uses the geometry of the first slice. This fact may decrease the UFS file system efficiency.
Disk geometry differences do not matter with disks that use Zone Bit Recording (ZBR), because the amount of data on any given cylinder varies with the distance from the spindle. Most disks now use ZBR.
When constructing a concatenation, distribute slices across different controllers and busses. Cross-controller and cross-bus slice distribution can help balance the overall I/O load.
Set the stripe's interlace value correctly.
The more physical disks in a striped metadevice, the greater the I/O performance. (The MTBF, however, will be reduced, so consider mirroring striped metadevices.)
Don't mix differently sized slices in the striped metadevice. A striped metadevice's size is limited by its smallest slice.
Avoid using physical disks with different disk geometries.
Distribute the striped metadevice across different controllers and busses.
Striping cannot be used to encapsulate existing file systems.
Striping performs well for large sequential I/O and for random I/O distributions.
Striping uses more CPU cycles than concatenation. However, it is usually worth it.
Striping does not provide any redundancy of data.
Mirroring may improve read performance; write performance is always degraded.
Mirroring improves read performance only in threaded or asynchronous I/O situations; if there is just a single thread reading from the metadevice, performance will not improve.
Mirroring degrades write performance by about 15-50 percent, because two copies of the data must be written to disk to complete a single logical write. If an application is write intensive, mirroring will degrade overall performance. However, the write degradation with mirroring is substantially less than the typical RAID5 write penalty (which can be as much as 70 percent). Refer to Figure 7-1.
Note that the UNIX operating system implements a file system cache. Since read requests frequently can be satisfied from this cache, the read/write ratio for physical I/O through the file system can be significantly biased toward writing.
For example, an application I/O mix might be 80 percent reads and 20 percent writes. But, if many read requests can be satisfied from the file system cache, the physical I/O mix might be quite different--perhaps only 60 percent reads and 40 percent writes. In fact, if there is a large amount of memory to be used as a buffer cache, the physical I/O mix can even go the other direction: 80 percent reads and 20 percent writes might turn out to be 40 percent reads and 60 percent writes.
RAID5 can withstand only a single device failure.
A mirrored metadevice can withstand multiple device failures in some cases (for example, if the multiple failed devices are all on the same submirror). A RAID5 metadevice can only withstand a single device failure. Striped and concatenated metadevices cannot withstand any device failures.
RAID5 provides good read performance if no error conditions, and poor read performance under error conditions.
When a device fails in a RAID5 metadevice, read performance suffers because multiple I/O operations are required to regenerate the data from the data and parity on the existing drives. Mirrored metadevices do not suffer the same degradation in performance when a device fails.
RAID5 can cause poor write performance.
In a RAID5 metadevice, parity must be calculated and both data and parity must be stored for each write operation. Because of the multiple I/O operations required to do this, RAID5 write performance is generally reduced. In mirrored metadevices, the data must be written to multiple mirrors, but mirrored performance in write-intensive applications is still much better than in RAID5 metadevices.
RAID5 involves a lower hardware cost than mirroring.
RAID5 metadevices have a lower hardware cost than mirroring. Mirroring requires twice the disk storage (for a two-way mirror). In a RAID5 metadevice, the amount required to store the parity is: 1/#-disks.
RAID5 can't be used for existing file systems.
You can't encapsulate an existing file system in a RAID5 metadevice (you must backup and restore).
All replicas are written when the configuration changes.
Only two replicas (per mirror) are updated for mirror dirty region bitmaps.
A good average is two replicas per three mirrors.
Use two replicas per one mirror for write intensive applications.
Use two replicas per 10 mirrors for read intensive applications.
The default inode density value (-i option) for the newfs(1M) command is not optimal for large file systems. When creating a new file system with the newfs command, you should set the inode density to 1 inode per 8 Kbyte of file space (-i 8192), rather than the default 1 inode per 2 Kbyte. Typical files today are approaching 64 Kbyte or larger in size, rather than the 1 Kbyte which typified files in 1980.
For large metadevices (greater than 8 Gbyte), it may be necessary to increase the size of a cylinder group to as many as 256 cylinders as in:
# newfs -c 256 /dev/md/rdsk/d114 |
The man page in Solaris 2.3 and 2.4 incorrectly states that the maximum size is 32 cylinders.)
If possible, set your file system cluster size equal to some integral of the stripe width.
For example, try the following parameters for sequential I/O:
maxcontig = 16 (16 * 8 Kbyte blocks = 128 Kbyte clusters)
Using a four-way stripe with a 32 Kbyte interlace value results in a 128 Kbyte stripe width, which is a good performance match:
interlace size = 32 Kbyte (32 Kbyte stripe unit size * 4 disks = 128 Kbyte stripe width)
You can set the maxcontig parameter for a file system to control the file system I/O cluster size. This parameter specifies the maximum number of blocks, belonging to one file, that will be allocated contiguously before inserting a rotational delay.
Performance may be improved if the file system I/O cluster size is some integral of the stripe width. For example, setting the maxcontig parameter to 16 results in 128 Kbyte clusters (16 blocks * 8 Kbyte file system block size).
The options to the mkfs(1M) command can be used to modify the default minfree, inode density, cylinders/cylinder group, and maxcontig settings. You can also use the tunefs(1M) command to modify the maxcontig and minfree settings.
See the man pages for mkfs(1M), tunefs(1M), and newfs(1M) for more information.
Assign data to physical drives to evenly balance the I/O load among the available disk drives.
Identify the most frequently accessed data, and increase access bandwidth to that data with mirroring or striping.
Both striped metadevices and RAID5 metadevices distribute data across multiple disk drives and help balance the I/O load. In addition, mirroring can also be used to help balance the I/O load.
Use DiskSuite Tool performance monitoring capabilities, and generic OS tools such as iostat(1M), to identify the most frequently accessed data. Once identified, the "access bandwidth" to this data can be increased using mirroring, striping, or RAID5.
This section compares performance issues for RAID5 metadevices and striped metadevices.
How does I/O for a RAID5 metadevice and a striped metadevice compare?
The striped metadevice performance is better than the RAID5 metadevice, but it doesn't provide data protection (redundancy).
RAID5 metadevice performance is lower than striped metadevice performance for write operations, because the RAID5 metadevice requires multiple I/O operations to calculate and store the parity.
For raw random I/O reads, the striped metadevice and the RAID5 metadevice are comparable. Both the striped metadevice and RAID5 metadevice split the data across multiple disks, and the RAID5 metadevice parity calculations aren't a factor in reads except after a slice failure.
For raw random I/O writes, the striped metadevice performs better, since the RAID5 metadevice requires multiple I/O operations to calculate and store the parity.
For raw sequential I/O operations, the striped metadevice performs best. The RAID5 metadevice performs lower than the striped metadevice for raw sequential writes, because of the multiple I/O operations required to calculate and store the parity for the RAID5 metadevice.
This section explains the differences between random I/O and sequential I/O, and DiskSuite strategies for optimizing your particular configuration.
What is random I/O?
Databases and general-purpose file servers are examples of random I/O environments. In random I/O, the time spent waiting for disk seeks and rotational latency dominates I/O service time.
Why do I need to know about random I/O?
You can optimize the performance of your configuration to take advantage of a random I/O environment.
What is the general strategy for configuring for a random I/O environment?
You want all disk spindles to be busy most of the time servicing I/O requests. Random I/O requests are small (typically 2-8 Kbytes), so it's not efficient to split an individual request of this kind onto multiple disk drives.
The interlace size doesn't matter, because you just want to spread the data across all the disks. Any interlace value greater than the typical I/O request will do.
For example, assume you have 4.2 Gbytes DBMS table space. If you stripe across four 1.05-Gbyte disk spindles, and if the I/O load is truly random and evenly dispersed across the entire range of the table space, then each of the four spindles will tend to be equally busy.
The target for maximum random I/O performance on a disk is 35 percent or lower as reported by DiskSuite Tool's performance monitor, or by iostat(1M). Disk use in excess of 65 percent on a typical basis is a problem. Disk use in excess of 90 percent is a major problem.
If you have a disk running at 100 percent and you stripe the data across four disks, you might expect the result to be four disks each running at 25 percent (100/4 = 25 percent). However, you will probably get all four disks running at greater than 35 percent since there won't be an artificial limitation to the throughput (of 100 percent of one disk).
While most people think of disk I/O in terms of sequential performance figures, only a few servers--DBMS servers dominated by full table scans and NFS servers in very data-intensive environments--will normally experience sequential I/O.
Why do I need to know about sequential I/O?
You can optimize the performance of your configuration to take advantage of a sequential I/O environment.
The goal in this case is to get greater sequential performance than you can get from a single disk. To achieve this, the stripe width should be "small" relative to the typical I/O request size. This will ensure that the typical I/O request is spread across multiple disk spindles, thus increasing the sequential bandwidth.
What is the general strategy for configuring for a sequential I/O environment?
You want to get greater sequential performance from an array than you can get from a single disk by setting the interlace value small relative to the size of the typical I/O request.
max-io-size / #-disks-in-stripe
Example:
Assume a typical I/O request size of 256 Kbyte and striping across 4 spindles. A good choice for stripe unit size in this example would be:
256 Kbyte / 4 = 64 Kbyte, or smaller
Seek and rotation time are practically non-existent in the sequential case. When optimizing sequential I/O, the internal transfer rate of a disk is most important.
The most useful recommendation is: max-io-size / #-disks. Note that for UFS file systems, the maxcontig parameter controls the file system cluster size, which defaults to 56 Kbyte. It may be useful to configure this to larger sizes for some sequential applications. For example, using a maxcontig value of 12 results in 96 Kbyte file system clusters (12 * 8 Kbyte blocks = 96 Kbyte clusters). Using a 4-wide stripe with a 24 Kbyte interlace size results in a 96 Kbyte stripe width (4 * 24 Kbyte = 96 Kbyte) which is a good performance match.
Example: In sequential applications, typical I/O size is usually large (greater than 128 Kbyte, often greater than 1 Mbyte). Assume an application with a typical I/O request size of 256 Kbyte and assume striping across 4 disk spindles. Do the arithmetic: 256 Kbyte / 4 = 64 Kbyte. So, a good choice for the interlace size would be 32 to 64 Kbyte.
Number of stripes: Another way of looking at striping is to first determine the performance requirements. For example, you may need 10.4 Mbyte/sec performance for a selected application, and each disk may deliver approximately 4 Mbyte/sec. Based on this, then determine how many disk spindles you need to stripe across:
10.4 Mbyte/sec / 4 Mbyte/sec = 2.6
Therefore, 3 disks would be needed.
Striping cannot be used to encapsulate existing file systems.
Striping performs well for large sequential I/O and for uneven I/O distributions.
Striping uses more CPU cycles than concatenation, but the trade-off is usually worth it.
Striping does not provide any redundancy of data.
To summarize the trade-offs: Striping delivers good performance, particularly for large sequential I/O and for uneven I/O distributions, but it does not provide any redundancy of data.
Write intensive applications: Because of the read-modify-write nature of RAID5, metadevices with greater than about 20 percent writes should probably not be RAID5. If data protection is required, consider mirroring.
RAID5 writes will never be as fast as mirrored writes, which in turn will never be as fast as unprotected writes. The NVRAM cache on the SPARCstorage Array closes the gap between RAID5 and mirrored configurations.
Full Stripe Writes: RAID5 read performance is always good (unless the metadevice has suffered a disk failure and is operating in degraded mode), but write performance suffers because of the read-modify-write nature of RAID5.
In particular, when writes are less than a full stripe width or don't align with a stripe, multiple I/Os (a read-modify-write sequence) are required. First, the old data and parity are read into buffers. Next, the parity is modified (XOR's are performed between data and parity to calculate the new parity--first the old data is logically subtracted from the parity and then the new data is logically added to the parity), and the new parity and data are stored to a log. Finally, the new parity and new data are written to the data stripe units.
Full stripe width writes have the advantage of not requiring the read-modify-write sequence, and thus performance is not degraded as much. With full stripe writes, all new data stripes are XORed together to generate parity, and the new data and parity are stored to a log. Then, the new parity and new data are written to the data stripe units in a single write.
Full stripe writes are used when the I/O request aligns with the stripe and the I/O size exactly matches:
interlace_size * (num_of_columns - 1)
For example, if a RAID5 configuration is striped over 4 columns, in any one stripe, 3 chunks are used to store data, and 1 chunk is used to store the corresponding parity. In this example, full stripe writes are used when the I/O request starts at the beginning of the stripe and the I/O size is equal to: stripe_unit_size * 3. For example, if the stripe unit size is 16 Kbyte, full stripe writes would be used for aligned I/O requests of size 48 Kbyte.
Performance in degraded mode: When a slice of a RAID5 metadevice fails, the parity is used to reconstruct the data; this requires reading from every column of the RAID5 metadevice. The more slices assigned to the RAID5 metadevice, the longer read and write operations (including resyncing the RAID5 metadevice) will take when I/O maps to the failed device.
Logs (logging devices) are typically accessed frequently. For best performance, avoid placing them on heavily-used disks. You may also want to place logs in the middle of a disk, to minimize the average seek times when accessing the log.
The log device and the master device of the same trans metadevice should be located on separate drives and possibly separate controllers to help balance the I/O load.
Sharing logs: trans metadevices can share log devices. However, if a file system is heavily used, it should have a separate log. The disadvantage to sharing a logging device is that certain errors require that all file systems sharing the logging device must be checked with the fsck(1M) command.
The larger the log size, the better the performance. Larger logs allow for greater concurrency (more simultaneous file system operations per second).
The absolute minimum size for a logging device is 1 Mbyte. A good average for performance is 1 Mbyte of log space for every 100 Mbyte of file system space. A recommended minimum is 1 Mbyte of log for every 1 Gbyte of file system space.
Assume you have a 4 Gbyte file system. What are the recommended log sizes?
For good performance, a size of 40 Mbyte is recommended (1 Mbyte log / 100 Mbyte file system).
A recommended minimum is 4 Mbyte (1 Mbyte log/1 Gbyte file system).
The absolute minimum is 1 Mbyte.
It is strongly recommended that you mirror all logs. It is possible to lose the data in a log because of device errors. If the data in a log is lost, it can leave a file system in an inconsistent state which fsck may not be able to repair without user intervention.
State database replicas contain configuration and status information of all metadevices and hot spares. Multiple copies (replicas) are maintained to provide redundancy. Multiple copies also prevent the database from being corrupted during a system crash (at most, only one copy if the database will be corrupted).
State database replicas are also used for mirror resync regions. Too few state database replicas relative to the number of mirrors may cause replica I/O to impact mirror performance.
At least three replicas are recommended. DiskSuite allows a maximum of 50 replicas. The following guidelines are recommended:
For a system with only a single drive: put all 3 replicas in one slice.
For a system with two to four drives: put two replicas on each drive.
For a system with five or more drives: put one replica on each drive.
In general, it is best to distribute state database replicas across slices, drives, and controllers, to avoid single points-of-failure.
Each state database replica occupies 517 Kbyte (1034 disk sectors) of disk storage by default. Replicas can be stored on: a dedicated disk partition, a partition which will be part of a metadevice, or a partition which will be part of a logging - device.
Replicas cannot be stored on the root (/), swap, or /usr slices, or on slices containing existing file systems or data.
Why do I need at least three state database replicas?
Three or more replicas are required. You want a majority of replicas to survive a single component failure. If you lose a replica (for example, due to a device failure), it may cause problems running DiskSuite or when rebooting the system.
How does DiskSuite handle failed replicas?
The system will stay running with exactly half or more replicas. The system will panic when fewer than half the replicas are available to prevent data corruption.
The system will not reboot without one more than half the total replicas. In this case, you must reboot single-user and delete the bad replicas (using the metadb command).
As an example, assume you have four replicas. The system will stay running as long as two replicas (half the total number) are available. However, in order for the system to reboot, three replicas (half the total plus 1) must be available.
In a two-disk configuration, you should always create two replicas on each disk. For example, assume you have a configuration with two disks and you only created three replicas (two on the first disk and one on the second disk). If the disk with two replicas fails, DiskSuite will stop functioning because the remaining disk only has one replica and this is less than half the total number of replicas.
If you created two replicas on each disk in a two-disk configuration, DiskSuite will still function if one disk fails. But because you must have one more than half of the total replicas available in order for the system to reboot, you will be unable to reboot in this state.
Where should I place replicas?
If multiple controllers exist, replicas should be distributed as evenly as possible across all controllers. This provides redundancy in case a controller fails and also helps balance the load. If multiple disks exist on a controller, at least two of the disks on each controller should store a replica.
Replicated databases have an inherent problem in determining which database has valid and correct data. To solve this problem, DiskSuite uses a majority consensus algorithm. This algorithm requires that a majority of the database replicas agree with each other before any of them are declared valid. This algorithm requires the presence of at least three initial replicas which you create. A consensus can then be reached as long as at least two of the three replicas are available. If there is only one replica and the system crashes, it is possible that all metadevice configuration data may be lost.
The majority consensus algorithm is conservative in the sense that it will fail if a majority consensus cannot be reached, even if one replica actually does contain the most up-to-date data. This approach guarantees that stale data will not be accidentally used, regardless of the failure scenario. The majority consensus algorithm accounts for the following: the system will stay running with exactly half or more replicas; the system will panic when fewer than half the replicas are available; the system will not reboot without one more than half the total replicas.