8.4. Storage

8.4.1. Sizing Guidelines for Storage Servers
8.4.2. About ZFS Storage Caches
8.4.3. About Block Alignment
8.4.4. Oracle VDI Global Settings for Storage
8.4.5. Managing the ZIL on Oracle Solaris Platforms

8.4.1. Sizing Guidelines for Storage Servers

The recommended disk layout is RAID 10, mirrored sets in a striped set, with ZFS striping the data automatically between multiple sets. This layout is called "mirrored" by the 7000 series. While this disk layout uses 50% of the available disk capacity for redundancy, it is faster than RAID 5 for intense small random read/writes, which is the typical access characteristic for iSCSI.

The storage servers provide the virtual disks that are accessed by Oracle VM VirtualBox through iSCSI. Because iSCSI is a CPU-intensive protocol the number of cores of the storage server are a decisive factor for its performance. Other important factors are the memory size (cache), the number of disks, and the available network bandwidth.

The network bandwidth is very volatile and determined by the relation of desktops starting up (*peak network bandwidth*) and desktops that have cached the applications in use (*average network bandwidth*). Starting a virtual machine (XP guest) creates a network load of 150 MB which needs to be satisfied in around 30 seconds. If many desktops are started at the same time, the requested network bandwidth may exceed 1 Gb/s if the CPUs of the storage can handle the load created by the iSCSI traffic. This scenario is typical for shift-work companies. In such a case, set the Pool, Cloning, or Machine State option to Running, which always keeps the desktops running and therefore decouples the OS boot from the login of a user. Another option is to trunk several interfaces to provide more than 1 Gb/s bandwidth through one IP. You can also use Jumbo Frames to speed up iSCSI connections. Jumbo Frames need to be configured for all participants of the network: storage servers, Oracle VM VirtualBox servers, and switches. Note that Jumbo Frames are not standardized so there is a risk of incompatibilities.

Oracle VDI, in combination with Oracle VM VirtualBox, uses the Sparse Volume feature of ZFS, which enables it to allocate more disk space for volumes than is physically available as long as the actual data written does not exceed the capacity of the storage. This feature, in combination with the fact that cloned desktops reuse unchanged data of their templates, results in a very effective usage of the available disk space. Therefore, the calculation for disk space below is a worst-case scenario assuming that all volumes are completely used by data which differs from the template.

  • Number of cores = number of virtual disks in use / 200

    Example: A x7210 storage with 2 CPUs and 4 cores per CPU can serve up to 2 * 4 * 200 = 1600 virtual disks

  • Memory size - The more the better. The free memory can be used as a disk cache, which reduces the access time.

  • Number of disks = number of desktops / 10

  • Average Network bandwidth [Mb/s] = number of virtual disks in use * 0.032 Mb/s

    Example: An x7210 storage with one Gigabit Ethernet interface can serve up to 1000 / 0.032 = 31250 virtual disks

  • Peak Network bandwidth [Mb/s] = number of virtual disks in use * 40 Mb/s

    Example: An x7210 storage with one Gigabit Ethernet interface can serve up to 1000 / 40 = 25 virtual disks

  • Disk space [GB] = number of desktops * size of the virtual disk [GB]

    Example: An x7210 storage with a capacity of 46 TB can support 46 * 1024 GB / 2 / 8 GB = 2944 8 GB disks in a RAID 10 configuration

Note

For details about how to improve desktop performance, see the sections on optimizing desktop images Section 6.5, “Creating Desktop Images”.

8.4.2. About ZFS Storage Caches

This section provides a brief overview of the cache structure and performance of ZFS, and how it maps to the hardware of the Sun Storage 7000 series Unified Storage Systems.

Background

The Zettabyte File System (ZFS) is the underlying file system on the supported Solaris and Sun Storage 7000 series Unified Storage Systems storage platforms.

The Adaptive Replacement Cache (ARC) is the ZFS read cache in the main memory (DRAM).

The Second Level Adaptive Replacement Cache (L2ARC) is used to store read cache data outside of the main memory. Sun Storage 7000 series Unified Storage Systems use read-optimized SSDs (known as Readzillas) for the L2ARC. SSDs are slower than DRAM but still much faster then hard disks. The L2ARC allows for a very large cache which improves the read performance.

The ZFS Intent Log (ZIL) satisfies the POSIX requirements for synchronous writes and crash recovery. It is not used for asynchronous writes. The ZFS system calls are logged by the ZIL and contain sufficient information to play them back in the event of a system crash. Sun Storage 7000 series Unified Storage Systems use write-optimized SSDs (known as Writezillas or Logzillas) for the ZIL. If Logzillas are not available the hard disks are used.

The write cache is used to store data in volatile (not battery-backed) DRAM for faster writes. There are no system calls logged in the ZIL if the Sun Storage 7000 series Unified Storage Systems write cache is enabled.

Performance Considerations

Size the read cache to store as much data in it to improve performance. Maximize the ARC first (DRAM), then add L2ARC (Readzillas).

Oracle VDI enables the write cache by default for every iSCSI volume used by Oracle VDI. This configuration is very fast and does not make use of Logzillas, as the ZIL is not used. Without ZIL, data might be at risk if the Sun Storage 7000 series Unified Storage System reboots or experiences a power loss while desktops are active. However, it does not cause corruption in ZFS itself.

Disable the write cache in Oracle VDI to minimize the risk of data loss. Without Logzillas the ZIL is backed by the available hard disks and performance suffers noticeably. Use Logzillas to speed up the ZIL. In case you have two or four Logzillas use the 'striped' profile to further improve performance.

To switch off the in-memory write cache, select a storage in Oracle VDI Manager, click Edit to open the Edit Storage wizard and unselect the Cache check box. The change will be applied to newly created desktops for Oracle VDI Hypervisors and to newly started desktops for Microsoft Hyper-V virtualization platforms.

8.4.3. About Block Alignment

Classic hard disks have a block size of 512 bytes. Oracle Solaris and Sun Unified Storage use the ZFS file system, which has a default block size of 8 kilobytes. Depending on the guest operating system of the virtual machine, one logical block of the guest file system can use two ZFS blocks on the storage. This is known as block misalignment, as shown in Figure 8.2, “Examples of Misaligned and Aligned Blocks”. It is best to avoid block misalignment because it doubles the IO on the storage to access a block of the guest OS file system (assuming a complete random access pattern and no caching).

Figure 8.2. Examples of Misaligned and Aligned Blocks

The diagram shows two examples of a virtual disk, one is aligned correctly and one is misaligned.

Windows XP is a particular example of where block misalignment can happen. Typically a single partition on a disk starts at disk sector 63. To check the alignment of a windows partition, use the following command:

wmic partition get StartingOffset, Name, Index

The following is an example of the output from this command:

Index Name                   StartingOffset 
0     Disk #0, Partition #0  32256

To find the starting sector, divide the StartingOffset value by 512:

32256 ÷ 512 = 63

An NTFS cluster is typically 4 kilobytes in size. So the first NTFS cluster starts at disk sector 63 and ends at disk sector 70. On the storage, the fourth ZFS block maps to disk sectors 48 to 63, and the fifth ZFS block sector maps to disk sectors 64 to 79. A misalignment occurs because both ZFS blocks must be accessed to access the first NTFS cluster, as shown in Figure 8.2, “Examples of Misaligned and Aligned Blocks”.

For a correct block alignment, the StartingOffset value must be exactly divisible by 8192 (the default block size of the underlying ZFS storage).

In the following example, the blocks are misaligned:

wmic partition get StartingOffset, Name, Index
Index Name                   StartingOffset 
0     Disk #0, Partition #0  32256

32556 ÷ 8192 = 3.97412109

In the following example, the blocks are aligned:

wmic partition get StartingOffset, Name, Index
Index Name                   StartingOffset 
0     Disk #0, Partition #0  32768

32768 ÷ 8192 = 4

On Windows 2003 SP1 and later, the diskpart.exe utility has an Align option to specify the block alignment of partitions. For Windows XP, use a third-party disk partitioning tool such as parted to create partitions with a defined start sector, see the example that follows. For other operating systems, refer to your system documentation for details of how to align partitions.

Example of How to Prepare a Disk with Correct Block Alignment for Windows XP

In this example, the disk utilities on a bootable live Linux system, such as Knoppix, are used to create a disk partition with the blocks aligned correctly.

  1. Create a new virtual machine.

  2. Assign the ISO image of the live Linux system to the CD/DVD-ROM drive of the virtual machine.

  3. Boot the virtual machine.

  4. Open a command shell and become root.

  5. Obtain the total number of sectors of the disk.

    Use the fdisk -ul command to obtain information about the disk.

    In the following example, the disk has 20971520 sectors:

    # fdisk -ul
    Disk /dev/sda doesn't contain a valid partition table
    
    Disk /dev/sda: 10.7 GB, 10737418240 bytes
    255 heads, 63 sectors/track, 1305 cylinders, total 20971520 sectors
    Units = sectors of 1 * 512 = 512 bytes
    Sector size (logical/physical): 512 bytes / 512 bytes
    I/O size (minimum/optimal): 512 bytes / 512 bytes
    Disk identifier: 0x00000000
  6. Create an MS-DOS partition table on the disk.

    Use the parted <disk> mklabel msdos command to create the partition table.

    In the following example, a partition table is created on the /dev/sda disk:

    # parted /dev/sda mklabel msdos
  7. Create a new partition, specifying the start and end sectors of the partition.

    Use the parted <disk> mkpartfs primary fat32 64s <end-sector>s command to create the partition. The <end-sector> is the total number of sectors of the disk minus one. For example if the disk has 20971520 sectors, the <end-sector> is 20971519.

    Depending on the version of parted used, you might see a warning that the partition is not properly aligned for best performance. You can safely ignore this warning.

    In the following example, a partition is created on the /dev/sda disk:

    # parted /dev/sda mkpartfs primary fat32 64s 20971519s                          
  8. Check that the partition is created.

    Use the parted <disk> print command to check the partition.

    In the following example, the /dev/sda disk is checked for partitions:

    # parted /dev/sda print
    Model: ATA VBOX HARDDISK (scsi)
    Disk /dev/sda: 10.7GB
    Sector size (logical/physical): 512B/512B
    Partition Table: msdos
    
    Number  Start   End     Size    Type     File system  Flags
     1      32.8kB  10.7GB  10.7GB  primary  fat32        lba
  9. Shut down the virtual machine and unassign the ISO image.

  10. Assign the Windows XP installation ISO image to the CD/DVD-ROM drive of the virtual machine.

  11. Boot the virtual machine and install Windows XP.

  12. When prompted, select the newly created partition.

  13. (Optional) When prompted, change the file system from FAT32 to NTFS.

  14. Complete the installation.

  15. Log in to the Windows XP guest as an administrator.

  16. Check that the StartingOffset is 32768.

    wmic partition get StartingOffset, Name, Index
    Index Name                   StartingOffset 
    0     Disk #0, Partition #0  32768

8.4.4. Oracle VDI Global Settings for Storage

This section provides information about the Oracle VDI global settings that apply to storage. Use the vda settings-getprops and vda settings-setprops commands to list and edit these settings.

Global Setting

Description

storage.max.commands

The number of commands executed on a storage in parallel.

The default is 25.

Changing this setting requires a restart of the Oracle VDI service.

The setting is global for an Oracle VDI installation and applies to a physical storage determined by its IP or DNS name.

The number of Oracle VDI hosts does not influence the maximum number of parallel storage actions executed by Oracle VDI on a physical storage. Reduce the number in case of intermittent "unresponsive storage" messages to reduce the storage load. Doing so impacts cloning and recycling performance.

This option works even if the Oracle VDI Center Agent is no longer running on the host.

storage.query.size.interval

The time in seconds the Oracle VDI service queries the storage for its total and available disk space.

The default is 180 seconds.

As there is only one Oracle VDI host which does this, there is typically no need to change this setting.

storage.watchdog.interval

The time in seconds the Oracle VDI service queries the storage for its availability.

The default is 30 seconds.

As there is only one Oracle VDI host which does this, there is typically no need to change this setting.

storage.fast.command.duration

The time in seconds after which the Oracle VDI service considers a fast storage command to have failed.

The default is 75 seconds.

Changing this setting requires a restart of the Oracle VDI service.

The only Oracle VDI functionality which uses this command duration is the storage watchdog which periodically pings the storage for its availability.

storage.medium.command.duration

The time in seconds after which the Oracle VDI service considers a medium storage command to have failed.

The default is 1800 seconds (30 minutes).

Changing this setting requires a restart of the Oracle VDI service.

The majority of the storage commands used by Oracle VDI use this command duration.

storage.slow.command.duration

The time in seconds after which the Oracle VDI service considers a slow storage command to have failed.

The default is 10800 seconds (3 hours).

Changing this setting requires a restart of the Oracle VDI service.

Only a few complex storage scripts used by Oracle VDI use this command duration. Such scripts are not run very often, typically once per day.

The storage.max.commands setting is the setting that is most often changed. By default, Sun Storage 7000 series Unified Storage Systems can only execute four commands in parallel, and the remaining commands are queued. To achieve better perfromance, Oracle VDI VDI intentionally overcommits the storage queue. If your storage becomes slow, for example because of a heavy load, it can take too long for queued commands to be executed, and if the commands take longer than the duration specified in the duration settings, the storage might be marked incorrectly as unresponsive. If this happens regularly, you can decrease the value of the storage.max.commands setting, but this might result in a decrease in performance when the storage is not so busy.

The interval settings rarely need to be changed because the commands are performed only by the primary host in an Oracle VDI Center. Decreasing the value of these settings results in more up-to-date information about the storage disk space and a quicker detection of unresponsive storage hosts, but also increases the load on the storage hosts. It is best to keep these settings at their defaults.

The duration settings include a good safety margin. Only change the duration settings if the storage is not able to execute the commands in the allotted time.

8.4.5. Managing the ZIL on Oracle Solaris Platforms

Disabling the ZFS Intent Log (ZIL) is a way to speed up Oracle Solaris 10 10/09 (and later) storage platforms. There are several ways to do it, but be aware that disabling ZIL is dangerous when synchronous disk I/O and data consistency during storage failures is important.

The command to immediately disable the ZIL:

        echo zil_disable/W0t1 | mdb -kw
      

The command to immediately enable the ZIL:

        echo zil_disable/W0t0 | mdb -kw
      

To prevent the disable ZIL command to survive a reboot, edit the /etc/system and add the following line.

        set zfs:zil_disable=1
      

Changing the ZIL state is effective for a particular ZFS pool when it is mounted, so the ZFS pool must be created or remounted or imported after the setting was changed (which is implicitly done during reboot).

Since the ZIL setting is global for a storage and disables the ZIL for all ZFS pools of a storage after a reboot, a system's root volume served by ZFS might show undesired behavior because the synchronous semantics are gone.

The best practice to avoid such a conflict of interests is to use a server with at least two disks. The first disk hosts the system slices of the OS using the old UFS file system. The remaining disks are ZFS formatted and used as Oracle VDI storage. By doing this, the ZIL can be disabled and the UFS disk will still offer synchronous semantics since ZIL is ZFS only.

A reference page for ZFS and ZIL:

http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Disabling_the_ZIL_.28Don.27t.29