The I/O subsystem is a vital component of an Oracle database. This chapter introduces fundamental I/O concepts, discusses the I/O requirements of different parts of the database, and provides sample configurations for I/O subsystem design.
This chapter includes the following topics:
Every Oracle Database reads or write data on disk, the database generates disk I/O. The performance of many software applications is inherently limited by disk I/O. Applications that spend the majority of CPU time waiting for I/O activity to complete are said to be I/O-bound.
Oracle Database is designed so that if an application is well written, its performance should not be limited by I/O. Tuning I/O can enhance the performance of the application if the I/O system is operating at or near capacity and is not able to service the I/O requests within an acceptable time. However, tuning I/O cannot help performance if the application is not I/O-bound (for example, when CPU is the limiting factor).
Consider the following database requirements when designing an I/O system:
Storage, such as minimum disk capacity
Availability, such as continuous (24 x 7) or business hours only
Performance, such as I/O throughput and application response times
Many I/O designs plan for storage and availability requirements with the assumption that performance will not be an issue. This is not always the case. Optimally, the number of disks and controllers to be configured should be determined by I/O throughput and redundancy requirements. The size of disks can then be determined by the storage requirements.
When developing an I/O design plan, consider using Automatic Storage Management (ASM). ASM is an integrated, high-performance database file system and disk manager that is based on the principle that the database should manage storage instead of requiring an administrator to do it.
Oracle recommends that you use ASM for your database file storage, instead of raw devices or the operating system file system.ASM provides the following key benefits:
Online storage reconfiguration and dynamic rebalancing
Managed file creation and deletion
See Also:Oracle Database Storage Administrator's Guide for additional information about ASM
The I/O calibration feature of Oracle Database enables you to assess the performance of the storage subsystem, and determine whether I/O performance problems are caused by the database or the storage subsystem. Unlike other external I/O calibration tools that issue I/Os sequentially, the I/O calibration feature of Oracle Database issues I/Os randomly using Oracle datafiles to access the storage media, producing results that more closely match the actual performance of the database.
The section describes how to use the I/O calibration feature of Oracle Database and contains the following topics:
Before running I/O calibration, ensure that the following requirements are met:
The user must be granted the SYSDBA privilege
timed_statistics must be set to
Asynchronous I/O must be enabled
When using file systems, asynchronous I/O can be enabled by setting the
FILESYSTEMIO_OPTIONS initialization parameter to
Ensure that asynchronous I/O is enabled for datafiles by running the following query:
COL NAME FORMAT A50 SELECT NAME,ASYNCH_IO FROM V$DATAFILE F,V$IOSTAT_FILE I WHERE F.FILE#=I.FILE_NO AND FILETYPE_NAME='DATA FILE';
Additionally, only one calibration can be performed on a database instance at a time.
The I/O calibration feature of Oracle Database is accessed using the
CALIBRATE_IO procedure. This procedure issues an I/O intensive read-only workload (made up of one megabytes of random of I/Os) to the database files to determine the maximum IOPS (I/O requests per second) and MBPS (megabytes of I/O per second) that can be sustained by the storage subsystem. Due to the overhead from running the I/O workload, I/O calibration should only be performed when the database is idle, or during off-peak hours, to minimize the impact of the I/O workload on the normal database workload.
To run I/O calibration and assess the I/O capability of the storage subsystem used by Oracle Database, use the
SET SERVEROUTPUT ON DECLARE lat INTEGER; iops INTEGER; mbps INTEGER; BEGIN -- DBMS_RESOURCE_MANAGER.CALIBRATE_IO (<DISKS>, <MAX_LATENCY>, iops, mbps, lat); DBMS_RESOURCE_MANAGER.CALIBRATE_IO (2, 10, iops, mbps, lat); DBMS_OUTPUT.PUT_LINE ('max_iops = ' || iops); DBMS_OUTPUT.PUT_LINE ('latency = ' || lat); dbms_output.put_line('max_mbps = ' || mbps); end; /
When running the
CALIBRATE_IO procedure, consider the following:
Do not run the procedure multiple times across separate databases that use the same storage subsystem
Quiesce the database to minimize I/O on the instance
For Oracle Real Application Clusters (RAC) configurations, ensure that all instances are opened to calibrate the storage subsystem across nodes
The execution time of the procedure is dependent on the number of disks in the storage subsystem and increases with the number of nodes in the database
In some cases, asynchronous I/O is permitted for datafiles, but the I/O subsystem for submitting asynchronous I/O may be maximized, and I/O calibration cannot continue. In such cases, refer to the port-specific documentation for information about checking the maximum limit for asynchronous I/O on the system
At any time during the I/O calibration process, you can query the calibration status in the
V$IO_CALIBRATION_STATUS view. After I/O calibration is successfully completed, you can view the results in the
Oracle Database PL/SQL Packages and Types Reference for more information about running the
Oracle Database Reference for information about the
V$IO_CALIBRATION_STATUS view and
This section describes the basic information to be gathered and decisions to be made when defining a system's I/O configuration. You want to keep the configuration as simple as possible, while maintaining the required availability, recoverability, and performance. The more complex a configuration becomes, the more difficult it is to administer, maintain, and tune.
This section contains the following topics:
If your operating system has LVM software or hardware-based striping, then it is possible to distribute I/O using these tools. Decisions to be made when using an LVM or hardware striping include stripe depth and stripe width.
Stripe depth is the size of the stripe, sometimes called stripe unit.
Stripe width is the product of the stripe depth and the number of drives in the striped set.
Choose these values wisely so that the system is capable of sustaining the required throughput. For an Oracle database, reasonable stripe depths range from 256 KB to 1 MB. Different types of applications benefit from different stripe depths. The optimal stripe depth and stripe width depend on the following:
Table 8-1 lists the Oracle and operating system parameters that you can use to set I/O size:
The size of single-block I/O requests. This parameter is also used in combination with multiblock parameters to determine multiblock I/O request size.
OS block size
Determines I/O size for redo log and archive log operations.
Maximum OS I/O size
Places an upper bound on the size of a single I/O request.
The maximum I/O size for full table scans is computed by multiplying this parameter with
Determines I/O sizes and concurrency for sort operations.
Determines the I/O size for hash operations.
In addition to I/O size, the degree of concurrency also helps in determining the ideal stripe depth. Consider the following when choosing stripe width and stripe depth:
On low-concurrency (sequential) systems, ensure that no single I/O visits the same disk twice. For example, assume that the stripe width is four disks, and the stripe depth is 32k. If a single 1MB I/O request (for example, for a full table scan) is issued by an Oracle server process, then each disk in the stripe must perform eight I/Os to return the requested data. To avoid this situation, the size of the average I/O should be smaller than the stripe width multiplied by the stripe depth. If this is not the case, then a single I/O request made by Oracle Database to the operating system results in multiple physical I/O requests to the same disk.
On high-concurrency (random) systems, ensure that no single I/O request is broken up into more than one physical I/O call. Failing to do this multiplies the number of physical I/O requests performed in your system, which in turn can severely degrade the I/O response times.
In a system with a high degree of concurrent small I/O requests, such as in a traditional OLTP environment, it is beneficial to keep the stripe depth large. Using stripe depths larger than the I/O size is called coarse grain striping. In high-concurrency systems, the stripe depth can be
n * DB_BLOCK_SIZE
n is greater than 1.
Coarse grain striping allows a disk in the array to service several I/O requests. In this way, a large number of concurrent I/O requests can be serviced by a set of striped disks with minimal I/O setup costs. Coarse grain striping strives to maximize overall I/O throughput. multiblock reads, as in full table scans, will benefit when stripe depths are large and can be serviced from one drive. Parallel query in a DSS environment is also a candidate for coarse grain striping. This is because there are many individual processes, each issuing separate I/Os. If coarse grain striping is used in systems that do not have high concurrent requests, then hot spots could result.
In a system with a few large I/O requests, such as in a traditional DSS environment or a low-concurrency OLTP system, then it is beneficial to keep the stripe depth small. This is called fine grain striping. In such systems, the stripe depth is
Fine grain striping allows a single I/O request to be serviced by multiple disks. Fine grain striping strives to maximize performance for individual I/O requests or response time.
On some Oracle ports, an Oracle block boundary may not align with the stripe. If your stripe depth is the same size as the Oracle block, then a single I/O issued by Oracle might result in two physical I/O operations.
This is not optimal in an OLTP environment. To ensure a higher probability of one logical I/O resulting in no more than one physical I/O, the minimum stripe depth should be at least twice the Oracle block size. Table 8-2 shows recommended minimum stripe depth for random access and for sequential reads.
|Disk Access||Minimum Stripe Depth|
Random reads and writes
The minimum stripe depth is twice the Oracle block size.
See Also:The specific documentation for your platform
With an LVM, the simplest configuration to manage is one with a single striped volume over all available disks. In this case, the stripe width encompasses all available disks. All database files reside within that volume, effectively distributing the load evenly. This single-volume layout provides adequate performance in most situations.
A single-volume configuration is viable only when used in conjunction with RAID technology that allows easy recoverability, such as RAID 1. Otherwise, losing a single disk means losing all files concurrently and, hence, performing a full database restore and recovery.
In addition to performance, there is a manageability concern: the design of the system must allow disks to be added simply, to allow for database growth. The challenge is to do so while keeping the load balanced evenly.
For example, an initial configuration can involve the creation of a single striped volume over 64 disks, each disk being 16 GB. This is total disk space of 1 terabyte (TB) for the primary data. Sometime after the system is operational, an additional 80 GB (that is, five disks) must be added to account for future database growth.
The options for making this space available to the database include creating a second volume that includes the five new disks. However, an I/O bottleneck might develop, if these new disks are unable to sustain the I/O throughput required for the files placed on them.
Another option is to increase the size of the original volume. LVMs are becoming sophisticated enough to allow dynamic reconfiguration of the stripe width, which allows disks to be added while the system is online. This begins to make the placement of all files on a single striped volume feasible in a production environment.
If your LVM is unable to support dynamically adding disks to the stripe, then it is likely that you need to choose a smaller, more manageable stripe width. Then, when new disks are added, the system can grow by a stripe width.
In the preceding example, eight disks might be a more manageable stripe width. This is only feasible if eight disks are capable of sustaining the required number of I/Os each second. Thus, when extra disk space is required, another eight-disk stripe can be added, keeping the I/O balanced across the volumes.
Note:The smaller the stripe width becomes, the more likely it is that you will need to spend time distributing the files on the volumes, and the closer the procedure becomes to manually distributing I/O.
If your system does not have an LVM or hardware striping, then I/O must be manually balanced across the available disks by distributing the files according to each file's I/O requirements. In order to make decisions on file placement, you should be familiar with the I/O requirements of the database files and the capabilities of the I/O system. If you are not familiar with this data and do not have a representative workload to analyze, you can make a first guess and then tune the layout as the usage becomes known.
Evaluate database disk-storage requirements by checking the size of the files and the disks.
Identify the expected I/O throughput for each file. Determine which files have the highest I/O rate and which do not have many I/Os. Lay out the files on all the available disks so as to even out the I/O rate.
One popular approach to manual I/O distribution suggests separating a frequently used table from its index. This is not correct. During the course of a transaction, the index is read first, and then the table is read. Because these I/Os occur sequentially, the table and index can be stored on the same disk without contention. It is not sufficient to separate a datafile simply because the datafile contains indexes or table data. The decision to segregate a file should be made only when the I/O rate for that file affects database performance.
Regardless of whether you use operating system striping or manual I/O distribution, if the I/O system or I/O layout is not able to support the I/O rate required, then you need to separate files with high I/O rates from the remaining files. You can identify such files either at the planning stage or after the system is live.
The decision to segregate files should only be driven by I/O rates, recoverability concerns, or manageability issues. (For example, if your LVM does not support dynamic reconfiguration of stripe width, then you might need to create smaller stripe widths to be able to add n disks at a time to create a new stripe of identical configuration.)
Before segregating files, verify that the bottleneck is truly an I/O issue. The data produced from investigating the bottleneck identifies which files have the highest I/O rates.
The following sections describe how to segregate the following file types:
See Also:"Identifying High-Load SQL"
If the files with high I/O are datafiles belonging to tablespaces that contain tables and indexes, then identify whether the I/O for those files can be reduced by tuning SQL or application code.
If the files with high-I/O are datafiles that belong to the
TEMP tablespace, then investigate whether to tune the SQL statements performing disk sorts to avoid this activity, or to tune the sorting.
After the application has been tuned to avoid unnecessary I/O, if the I/O layout is still not able to sustain the required throughput, then consider segregating the high-I/O files.
See Also:"Identifying High-Load SQL"
Placing all redo logs on one disk without any other files. Also consider availability; members of the same group should be on different physical disks and controllers for recoverability purposes.
Placing each redo log group on a separate disk that does not store any other files.
Striping the redo log files across several disks, using an operating system striping tool. (Manual striping is not possible in this situation.)
Avoiding the use of RAID 5 for redo logs.
Redo log files are written sequentially by the Log Writer (LGWR) process. This operation can be made faster if there is no concurrent activity on the same disk. Dedicating a separate disk to redo log files usually ensures that LGWR runs smoothly with no further tuning necessary. If your system supports asynchronous I/O but this feature is not currently configured, then test to see if using this feature is beneficial. Performance bottlenecks related to LGWR are rare.
If the archiver is slow, then it might be prudent to prevent I/O contention between the archiver process and LGWR by ensuring that archiver reads and LGWR writes are separated. This is achieved by placing logs on alternating drives.
For example, suppose a system has four redo log groups, each group with two members. To create separate-disk access, the eight log files should be labeled 1a, 1b, 2a, 2b, 3a, 3b, 4a, and 4b. This requires at least four disks, plus one disk for archived files.
Figure 8-1 illustrates how redo members should be distributed across disks to minimize contention.
In this example, LGWR switches out of log group 1 (member 1a and 1b) and writes to log group 2 (2a and 2b). Concurrently, the archiver process reads from group 1 and writes to its archive destination. Note how the redo log files are isolated from contention.
Note:Mirroring redo log files, or maintaining multiple copies of each redo log file on separate disks, does not slow LGWR considerably. LGWR writes to each disk in parallel and waits until each part of the parallel write is complete. Hence, a parallel write does not take longer than the longest possible single-disk write.
Because redo logs are written serially, drives dedicated to redo log activity generally require limited head movement. This significantly accelerates log writing.
This section contains three high-level examples of configuring I/O systems. These examples include sample calculations that define the disk topology, stripe depths, and so on:
The simplest approach to I/O configuration is to build one giant volume, striped across all available disks. To account for recoverability, the volume is mirrored (RAID 1). The striping unit for each disk should be larger than the maximum I/O size for the frequent I/O operations. This provides adequate performance for most cases.
If archive logs are striped on the same set of disks as other files, then any I/O requests on those disks could suffer when redo logs are being archived. Moving archive logs to separate disks provides the following benefits:
The archive can be performed at very high rate (using sequential I/O).
Nothing else is affected by the degraded response time on the archive destination disks.
The number of disks for archive logs is determined by the rate of archive log generation and the amount of archive storage required.
In high-update OLTP systems, the redo logs are write-intensive. Moving the redo log files to disks that are separate from other disks and from archived redo log files has the following benefits:
Writing redo logs is performed at the highest possible rate. Hence, transaction processing performance is at its best.
Writing of the redo logs is not impaired with any other I/O.
The number of disks for redo logs is mostly determined by the redo log size, which is generally small compared to current technology disk sizes. Typically, a configuration with two disks (possibly mirrored to four disks for fault tolerance) is adequate. In particular, by having the redo log files alternating on two disks, writing redo log information to one file does not interfere with reading a completed redo log for archiving.
For systems where a file system can be used to contain all Oracle data, database administration is simplified by using Oracle-managed files. Oracle internally uses standard file system interfaces to create and delete files as needed for tablespaces, temp files, online logs, and control files. Administrators only specify the file system directory to be used for a particular type of file. You can specify one default location for datafiles and up to five multiplexed locations for the control and online redo log files.
Oracle ensures that a unique file is created and then deleted when it is no longer needed. This reduces corruption caused by administrators specifying the wrong file, reduces wasted disk space consumed by obsolete files, and simplifies creation of test and development databases. It also makes development of portable third-party tools easier, because it eliminates the need to put operating system-specific file names in SQL scripts.
New files can be created as managed files, while old ones are administered in the old way. Thus, a database can have a mixture of Oracle-managed and manually managed files.
Note:Oracle-managed files cannot be used with raw devices.
Several points should be considered when tuning Oracle-managed files.
Because Oracle-managed files require the use of a file system, DBAs give up control over how the data is laid out. Therefore, it is important to correctly configure the file system.
The Oracle-managed file system should be built on top of an LVM that supports striping. For load balancing and improved throughput, the disks in the Oracle-managed file system should be striped.
Oracle-managed files work best if used on an LVM that supports dynamically extensible logical volumes. Otherwise, the logical volumes should be configured as large as possible.
Oracle-managed files work best if the file system provides large extensible files.
See Also:Oracle Database Administrator's Guide for detailed information on using Oracle-managed files
A block size of 8K is optimal for most systems. However, OLTP systems occasionally use smaller block sizes and DSS systems occasionally use larger block sizes. This section discusses considerations when choosing database block size for optimal performance and contains the following topics:
Note:The use of multiple block sizes in a single database instance is not encouraged because of manageability issues.
Regardless of the size of the data, the goal is to minimize the number of reads required to retrieve the desired data.
If the rows are small and access is predominantly random, then choose a smaller block size.
If the rows are small and access is predominantly sequential, then choose a larger block size.
If the rows are small and access is both random and sequential, then it might be effective to choose a larger block size.
If the rows are large, such as rows containing large object (LOB) data, then choose a larger block size.
For high-concurrency OLTP systems, consider appropriate values for
FREELISTS when using a larger block size. These parameters affect the degree of update concurrency allowed within a block. However, you do not need to specify the value for
FREELISTS when using automatic segment-space management.
If you are uncertain about which block size to choose, then try a database block size of 8 KB for most systems that process a large number of transactions. This represents a good compromise and is usually effective. Only systems processing LOB data need more than 8 KB.
See Also:The Oracle documentation specific to your operating system for information on the minimum and maximum block size on your platform
Table 8-3 lists the advantages and disadvantages of different block sizes.
Good for small rows with lots of random access.
Reduces block contention.
Has relatively large space overhead due to metadata (that is, block header).
Not recommended for large rows. There might only be a few rows stored for each block, or worse, row chaining if a single row does not fit into a block,
Has lower overhead, so there is more room to store data.
Permits reading a number of rows into the buffer cache with a single I/O (depending on row size and block size).
Good for sequential access or very large rows (such as LOB data).
Wastes space in the buffer cache, if you are doing random access to small rows and have a large block size. For example, with an 8 KB block size and 50 byte row size, you waste 7,950 bytes in the buffer cache when doing random access.
Not good for index blocks used in an OLTP environment, because they increase block contention on the index leaf blocks.