|A P P E N D I X B|
SPARC Behavior and Implementation
This chapter discusses issues related to the floating-point units used in SPARC® based workstations and describes a way to determine which code generation flags are best suited for a particular workstation.
This section lists a number of SPARC floating-point units and describes the instruction sets and exception handling features they support. See the SPARC Architecture Manual Version 8 Appendix N, "SPARC IEEE 754 Implementation Recommendations", and Version 9 Appendix B, "IEEE Std 754-1985 Requirements for SPARC-V9", for brief descriptions of what happens when a floating-point trap is taken, the distinction between trapped and untrapped underflow, and recommended possible courses of action for SPARC implementations that provide a non-IEEE (nonstandard) arithmetic mode.
TABLE B-1 lists the hardware floating-point implementations used by SPARC workstations. Many early SPARC based systems have floating-point units derived from cores developed by TI or Weitek:
These two families of FPUs have been licensed to other workstation vendors, so chips from other semiconductor manufacturers may be found in some SPARC based workstations. Some of these other chips are also shown in the table.
The last column in the preceding table shows the compiler flags to use to obtain the fastest code for each FPU. These flags control two independent attributes of code generation: the -xarch flag determines the instruction set the compiler may use, and the -xchip flag determines the assumptions the compiler will make about a processor's performance characteristics in scheduling the code. Because all SPARC floating-point units implement at least the floating-point instruction set defined in the SPARC Architecture Manual Version 7, a program compiled with -xarch=v7 will run on any SPARC based system, although it may not take full advantage of the features of later processors. Likewise, a program compiled with a particular -xchip value will run on any SPARC based system that supports the instruction set specified with -xarch, but it may run more slowly on systems with processors other than the one specified.
The floating-point units listed in the table preceding the microSPARC-I implement the floating-point instruction set defined in the SPARC Architecture Manual Version 7. Programs that must run on systems with these FPUs should be compiled with -xarch=v7. The compilers make no special assumptions regarding the performance characteristics of these processors, so they all share the single -xchip option -xchip=old. (Not all of the systems listed in TABLE B-1 are still supported by the compilers; they are listed solely for historical purposes. Refer to the appropriate version of the Numerical Computation Guide for the code generation flags to use with compilers supporting these systems.)
The microSPARC-I and microSPARC-II floating-point units implement the floating-point instruction set defined in the SPARC Architecture Manual Version 8 except for the FsMULd and quad precision instructions. Programs compiled with -xarch=v8 will run on systems with these processors, but because unimplemented floating-point instructions must be emulated by the system kernel, programs that use FsMULd extensively (such as Fortran programs that perform a lot of single precision complex arithmetic), may encounter severe performance degradation. To avoid this, compile programs for systems with these processors with -xarch=v8a.
The SuperSPARC-I, SuperSPARC-II, hyperSPARC, and TurboSPARC floating-point units implement the floating-point instruction set defined in the SPARC Architecture Manual Version 8 except for the quad precision instructions. To get the best performance on systems with these processors, compile with -xarch=v8.
The UltraSPARC I, UltraSPARC II, UltraSPARC IIe, UltraSPARC IIi, UltraSPARC III, UltraSPARC IIIi, and UltraSPARC IV floating-point units implement the floating-point instruction set defined in the SPARC Architecture Manual Version 9 except for the quad precision instructions; in particular, they provide 32 double precision floating-point registers. To allow the compiler to use these registers, compile with -xarch=v8plus (for programs that run under a 32-bit OS) or -xarch=v9 (for programs that run under a 64-bit OS). These processors also provide extensions to the standard instruction set. The additional instructions, known as the Visual Instruction Set or VIS, are rarely generated automatically by the compilers, but they may be used in assembly code. Therefore, to take full advantage of the instruction set these processors support, use -xarch=v8plusa (32-bit) or -xarch=v9a (64-bit).
The -xarch and -xchip options can be specified simultaneously using the -xtarget macro option. (That is, the -xtarget flag simply expands to a suitable combination of -xarch, -xchip, and -xcache flags.) The default code generation option is -xtarget=generic. See the cc(1), CC(1), and f95(1) man pages and the compiler manuals for more information including a complete list of -xarch, -xchip, and -xtarget values. Additional -xarch information is provided in the Fortran User's Guide, C User's Guide, and C++ User's Guide.
All SPARC floating-point units, regardless of which version of the SPARC architecture they implement, provide a floating-point status register (FSR) that contains status and control bits associated with the FPU. All SPARC FPUs that implement deferred floating-point traps provide a floating-point queue (FQ) that contains information about currently executing floating-point instructions. The FSR can be accessed by user software to detect floating-point exceptions that have occurred and to control rounding direction, trapping, and nonstandard arithmetic modes. The FQ is used by the operating system kernel to process floating-point traps and is normally invisible to user software.
Software accesses the floating-point status register via STFSR and LDFSR instructions that store the FSR in memory and load it from memory, respectively. In SPARC assembly language, these instructions are written as follows:
The inline template file libm.il located in the directory containing the libraries supplied with the Sun Studio compilers contains examples showing the use of STFSR and LDFSR instructions.
FIGURE B-1 shows the layout of bit fields in the floating-point status register.
In versions 7 and 8 of the SPARC architecture, the FSR occupies 32 bits as shown. In version 9, the FSR is extended to 64 bits, of which the lower 32 match the figure; the upper 32 are largely unused, containing only three additional floating point condition code fields.
Here res refers to bits that are reserved, ver is a read-only field that identifies the version of the FPU, and ftt and qne are used by the system when it processes floating-point traps. The remaining fields are described in the following table.
The RM field holds two bits that specify the rounding direction for floating-point operations. The NS bit enables nonstandard arithmetic mode on SPARC FPUs that implement it; on others, this bit is ignored. The fcc field holds floating-point condition codes generated by floating-point compare instructions and used by branch and conditional move operations. Finally, the TEM, aexc, and cexc fields contain five bits that control trapping and record accrued and current exception flags for each of the five IEEE 754 floating-point exceptions. These fields are subdivided as shown in TABLE B-3.
(The symbols NV, OF, UF, DZ, and NX above stand for the invalid operation, overflow, underflow, division-by-zero, and inexact exceptions respectively.)
In most cases, SPARC floating-point units execute instructions completely in hardware without requiring software support. There are four situations, however, when the hardware will not successfully complete a floating-point instruction:
In each situation, the initial response is the same: the process "traps" to the system kernel, which determines the cause of the trap and takes the appropriate action. (The term "trap" refers to an interruption of the normal flow of control.) In the first three situations, the kernel emulates the trapping instruction in software. Note that the emulated instruction can also incur an exception whose trap is enabled.
In the first three situations above, if the emulated instruction does not incur an IEEE floating-point exception whose trap is enabled, the kernel completes the instruction. If the instruction is a floating-point compare, the kernel updates the condition codes to reflect the result; if the instruction is an arithmetic operation, it delivers the appropriate result to the destination register. It also updates the current exception flags to reflect any (untrapped) exceptions raised by the instruction, and it "or"s those exceptions into the accrued exception flags. It then arranges to continue execution of the process at the point at which the trap was taken.
When an instruction executed by hardware or emulated by the kernel software incurs an IEEE floating-point exception whose trap is enabled, the instruction is not completed. The destination register, floating point condition codes, and accrued exception flags are unchanged, the current exception flags are set to reflect the particular exception that caused the trap, and the kernel sends a SIGFPE signal to the process.
The following pseudo-code summarizes the handling of floating-point traps. Note that the aexc field can normally only be cleared by software.
A program will encounter severe performance degradation when many floating-point instructions must be emulated by the kernel. The relative frequency with which this happens can depend on several factors including, of course, the type of trap.
Under normal circumstances, the fp_disabled trap should occur only once per process. The system kernel disables the floating-point unit when a process is first started, so the first floating-point operation executed by the process will cause a trap. After processing the trap, the kernel enables the floating-point unit, and it remains enabled for the duration of the process. (It is possible to disable the floating-point unit for the entire system, but this is not recommended and is done only for kernel or hardware debugging purposes.)
An unimplemented_FPop trap will obviously occur any time the floating-point unit encounters an instruction it does not implement. Since most current SPARC floating-point units implement at least the instruction set defined by the SPARC Architecture Manual Version 8 except for the quad precision instructions, and the Sun Studio compilers do not generate quad precision instructions, this type of trap should not occur on most systems. As mentioned above, two notable exceptions are the microSPARC-I and microSPARC-II processors, which do not implement the FsMULd instruction. To avoid unimplemented_FPop traps on these processors, compile programs with the -xarch=v8a option.
The remaining two trap types, unfinished_FPop and trapped IEEE exceptions, are usually associated with special computational situations involving NaNs, infinities, and subnormal numbers.
When a floating-point instruction encounters an IEEE floating-point exception whose trap is enabled, the instruction is not completed; instead the system delivers a SIGFPE signal to the process. If the process has established a SIGFPE signal handler, that handler is invoked, and otherwise, the process aborts. Since trapping is most often enabled for the purpose of aborting the program when an exception occurs, either by invoking a signal handler that prints a message and terminates the program or by resorting to the system default behavior when no signal handler is installed, most programs do not incur many trapped IEEE floating-point exceptions. As described in Chapter 4, however, it is possible to arrange for a signal handler to supply a result for the trapping instruction and continue execution. Note that severe performance degradation can result if many floating-point exceptions are trapped and handled in this way.
Most SPARC floating-point units will also trap on at least some cases involving infinite or NaN operands or IEEE floating-point exceptions even when trapping is disabled or an instruction would not cause an exception whose trap is enabled. This happens when the hardware does not support such special cases; instead it generates an unfinished_FPop trap and leaves the kernel emulation software to complete the instruction. Different SPARC FPUs vary as to the conditions that result in an unfinished_FPop trap: for example, most early SPARC FPUs as well as the hyperSPARC FPU trap on all IEEE floating-point exceptions regardless of whether trapping is enabled, while UltraSPARC FPUs can trap "pessimistically" when a floating-point exception's trap is enabled and the hardware is unable to determine whether or not an instruction would raise that exception. On the other hand, the SuperSPARC-I, SuperSPARC-II, TurboSPARC, microSPARC-I, and microSPARC-II FPUs handle all exceptional cases in hardware and never generate unfinished_FPop traps.
Since most unfinished_FPop traps occur in conjunction with floating-point exceptions, a program can avoid incurring an excessive number of these traps by employing exception handling (i.e., testing the exception flags, trapping and substituting results, or aborting on exceptions). Of course, care must be taken to balance the cost of handling exceptions with that of allowing exceptions to result in unfinished_FPop traps.
The most common situations in which some SPARC floating-point units will trap with an unfinished_FPop involve subnormal numbers. Many SPARC FPUs will trap whenever a floating-point operation involves subnormal operands or must generate a nonzero subnormal result (i.e., a result that incurs gradual underflow). Because underflow is somewhat rare but difficult to program around, and because the accuracy of underflowed intermediate results often has little effect on the overall accuracy of the final result of a computation, the SPARC architecture includes a nonstandard arithmetic mode that provides a way for a user to avoid the performance degradation associated with unfinished_FPop traps involving subnormal numbers.
The SPARC architecture does not precisely define nonstandard arithmetic mode; it merely states that when this mode is enabled, processors that support it may produce results that do not conform to the IEEE 754 standard. However, all existing SPARC implementations that support this mode use it to disable gradual underflow, replacing all subnormal operands and results with zero. (There is one exception: Weitek 1164/1165 FPUs only flush subnormal results to zero in nonstandard mode, they do not treat subnormal operands as zero.)
Not all SPARC implementations provide a nonstandard mode. Specifically, the SuperSPARC-I, SuperSPARC-II, TurboSPARC, microSPARC-I, and microSPARC-II floating-point units handle subnormal operands and generate subnormal results entirely in hardware, so they do not need to support nonstandard arithmetic. (Any attempt to enable nonstandard mode on these processors is ignored.) Therefore, gradual underflow incurs no performance loss on these processors.
To determine whether gradual underflows are affecting the performance of a program, you should first determine whether underflows are occurring at all and then check how much system time is used by the program. To determine whether underflows are occurring, you can use the math library function ieee_retrospective() to see if the underflow exception flag is raised when the program exits. Fortran programs call ieee_retrospective() by default. C and C++ programs need to call ieee_retrospective() explicitly prior to exit. If any underflows have occurred, ieee_retrospective() prints a message similar to the following:
If the program encounters underflows, you might want to determine how much system time the program is using by timing the program execution with the time command.
If the system time (the third figure shown above) is unusually high, multiple underflows might be the cause. If so, and if the program does not depend on the accuracy of gradual underflow, you can enable nonstandard mode for better performance. There are two ways to do this. First, you can compile with the -fns flag (which is implied as part of the macros -fast and -fnonstd) to enable nonstandard mode at program startup. Second, the value-added math library libsunmath provides two functions to enable and disable nonstandard mode, respectively: calling nonstandard_arithmetic() enables nonstandard mode (if it is supported), while calling standard_arithmetic() restores IEEE behavior. The C and Fortran syntax for calling these functions is as follows:
Caution - Since nonstandard arithmetic mode defeats the accuracy benefits of gradual underflow, you should use it with caution. For more information about gradual underflow, see Chapter 2.
On SPARC floating-point units that implement nonstandard mode, enabling this mode causes the hardware to treat subnormal operands as zero and flush subnormal results to zero. The kernel software that is used to emulate trapped floating-point instructions, however, does not implement nonstandard mode, in part because the effect of this mode is undefined and implementation-dependent and because the added cost of handling gradual underflow is negligible compared to the cost of emulating a floating-point operation in software.
If a floating-point operation that would be affected by nonstandard mode is interrupted (for example, it has been issued but not completed when a context switch occurs or another floating-point instruction causes a trap), it will be emulated by kernel software using standard IEEE arithmetic. Thus, under unusual circumstances, a program running in nonstandard mode might produce slightly varying results depending on system load. This behavior has not been observed in practice. It would affect only those programs that are very sensitive to whether one particular operation out of millions is executed with gradual underflow or with abrupt underflow.
The fpversion utility distributed with the compilers identifies the installed CPU and estimates the processor and system bus clock speeds. fpversion determines the CPU and FPU types by interpreting the identification information stored by the CPU and FPU. It estimates their clock speeds by timing a loop that executes simple instructions that run in a predictable amount of time. The loop is executed many times to increase the accuracy of the timing measurements. For this reason, fpversion is not instantaneous; it can take several seconds to run.
fpversion also reports the best -xtarget code generation option to use for the host system.
On an Ultra 4 workstation, fpversion displays information similar to the following. (There may be variations due to differences in timing or machine configuration.)
See the fpversion(1) manual page for more information.