This part of the guide deals with advanced topics regarding ChorusOS systems. It includes a list of differences between the application programming interfaces available on ChorusOS systems and those available in the POSIX standard. It also extends the design discussion begun in the ChorusOS Design and Performance Guide to explain some of the core workings of the POSIX compatible part of a ChorusOS system. Finally, it includes a description of the performance profiling facilities available on ChorusOS systems.
Compatibility for the sake of application source code portability is important for highly scalable applications. This chapter discusses compatibility of ChorusOS tools, devices, APIs and protocols with those described in the POSIX standards. This chapter provides a study of what is compatible, organized for reference, rather than an exhaustive demonstration of how to port applications. After reading this chapter, you will be aware of differences in tools, devices, APIs and protocols that require your attention as you create portable, scalable applications.
The ChorusOS operating system provides an almost complete POSIX API, compliant with the IEEE Standard 1003.1b-1993. There are however, some functions that have not been implemented in this version of the ChorusOS operating system and some functions whose implementation is different in some way. This section highlights the system and library calls in which the implementation or semantics differ from the POSIX standard. You are advised to consult the relevant ChorusOS man page for information on how these calls on implemented.
For ease of use, this section is divided into the same categories as are found in the POSIX IEEE Standard documentation.
The posix_spawn() call has been added to the ChorusOS operating system. The implementation of this call differs to some extent from the POSIX standard. See the posix_spawn(2POSIX) man page for a complete description of the ChorusOS implementation.
The implementation of the setsid() call differs from the POSIX standard. Refer to the setsid(2POSIX) man page for a complete description of the ChorusOS implementation.
ChorusOS 5.0 provides the POSIX signal APIs, enabling applications to post and catch signals. Signals may be either synchronous or asynchronous. Signals may also be posted between applications or by the system to applications.
There are, however, limitations on the ChorusOS signal API. The ChorusOS operating system does not support the complete range of signals, as defined by POSIX.
There are also limitations in APIs such as sigaction() with respect to the POSIX specification. For example, SA_ONSTACK is not supported as a part of the sigaction() API. Refer to the following man pages for more information:
There is no support for asynchronous I/O in the ChorusOS operating system.
The fdatasync() call is not implemented in the ChorusOS operating system:
The following system database calls are not implemented in the ChorusOS operating system:
Synchronization in the ChorusOS operating system has the following limitations with regard to the POSIX standard:
Priority inversion handling is not supported.
Mutexes are not supported with signals.
Mutexes cannot be shared between user and supervisor applications.
Reader/writer locks are not supported.
The implementation of the following memory management calls is substantially different in the ChorusOS operating system:
See the mmap(2POSIX) man page for more information.
There is a limitation in the implementation of this call. See the munmap(2POSIX) man page for more information.
The following memory management calls are not implemented in the ChorusOS operating system:
Creation and cancellation of threads in the ChorusOS operating system are compatible with the POSIX standard. There are two restrictions regarding the implementation of threads:
The pthread_detach() call is not implemented.
The pthread_kill() call can only be used by user applications, since signal handling is not implemented in supervisor mode.
The ChorusOS operating system does not provide support for the POSIX realtime files and system calls.
The majority of the process management, fies and directoy, and I/O calls are derived from FreeBSD version 4.1. Calls in these sections have been implemented in the same way as the corresponding FreeBSD calls.
The ChorusOS implementatino of a few calls is very different to the FreeBSD implementation. The following list highlights thse differences.
Refer to the fork(2POSIX) man page for more information.
Refer to the sigaction(2POSIX) man page for more information.
The way in which signals have been implemented also differs substantially from the FreeBSD implementation. Pay particular attention to the following calls:
Refer to the sigaction(2POSIX) man page for more information.
Refer to the sigprocmask(2POSIX) man page for more information.
Refer to the sigqueue(2POSIX) man page for more information.
Refer to the sigtimedwait(2POSIX) man page for more information.
Refer to the sigwaitinfo(2POSIX) man page for more information.
This chapter explains how to analyze the performance of a ChorusOS system (and its applications) by generating a performance profile report. It includes the following sections:
Introduction to Performance Profiling -- explains why a performance profile is useful, and how it can be used.
Preparing to Create a Performace Profile -- explains how to configure your system to generate a performance profile.
Running a Performance Profiling Session -- explains how to create a performance profile.
Analyzing Performance Profiling Reports -- explains how to analyze the performance profile.
Performance Profiler Description -- provides additional information on how the performance profiler works.
The ChorusOS operating system performance profiling system contains a set of tools that facilitate the analysis and optimization of the performance of the ChorusOS operating system and applications. These tools relate only to system components sharing the system address space, that is, the ChorusOS operating system components and supervisor application processes. This set of tools is composed of a profiling server, libraries for building profiled processes, a target controller, and a host utility.
Software performance profiling consists of collecting data about the dynamic behavior of the software to identify the time distribution. For example, the performance profiling system is able to report the time spent within each procedure, as well as providing a dynamically constructed call graph.
Typical steps in a ChorusOS optimization are:
Bench a set of typical applications, using the ChorusOS operating system and applications at peak performance. The selection of these applications is essential because the system will eventually be tuned for this specific type of application.
Evaluate and record the output of the benchmarks.
To use the performance profiling system to collect raw data about the dynamic behavior of the applications.
Generate, evaluate, and record the performance profiling reports.
Plan and implement optimizations, for example, rewriting certain time-critical routines in assembly language, using in-line functions, or tuning algorithms.
The performance profiling tools provide two different classes of service, depending on the method used to prepare the software being measured:
The performance profiling system is applied to software generated in the standard way, (the same version as used for benchmarking). In this case, the performance profiling reports only minimal information, which consists mainly of the percentage of time spent in each routine of the software. The corresponding performance profiling report is called simple form.
The performance profiling system is applied to software regenerated exclusively for performance profiling; software is completely recompiled, using the performance profiling C compiler option (usually the -p option). This enables the performance profiling system to report more information by dynamically counting routine invocations and building a complete call graph. The corresponding performance profiling report is called full form.
The standard (binary) version of the ChorusOS operating system is not compiled with the performance profiling option. Profiling a system will only generate a simple form. Non-profiled components (or components for which a simple report form is sufficient) need not be compiled with the performance profiling option.
To obtain a full form for ChorusOS operating system components, a source product distribution is needed. In this case, it is necessary to regenerate the system components with the performance profiling option set.
To perform system performance profiling
using the ChorusOS Profiler, a ChorusOS target system must include the NFS_CLIENT
feature option.
Launch the performance profiling server (the PROF
process) dynamically, using:
% rsh -n target arun PROF & |
If you require full report forms, the profiled components must be compiled using the performance profiling compiler options (usually, the -p option).
If you are using the imake environment provided with the ChorusOS operating system:
Set the profiling option in the Project.tmpl file to profile the whole project hierarchy
or
Set the profiling option in each Imakefile of the directories to be profiled, to profile only a subset of your project hierarchy.
FPROF=ON
The performance profiling option can be added dynamically by calling make with the compiler profiling option:
% make PROF=-p |
The preceding call must be made in the directory of the program that is to be performance profiled.
In this section, it is assumed that the application consists of a single supervisor process, the_process, it is also assumed that the target system is named trumpet, and that the target tree is mounted under the $CHORUS_ROOT host directory.
An application being performance profiled, can be either:
launched at system boot time, as part of the system image, or
launched dynamically using the arun command with the -k option:
% rsh trumpet arun -k "the_process" |
The -k option enables the debugger to access the symbol table of the process_name. This option is ignored for user processes.
Although the previous example was performed on a supervisor process, a user process can also be profiled, using the same method.
Running a performace profiling session includes, starting and stopping the session, and generating the required performance profiling reports. These processes are described in the following sections.
Performance profiling is initiated by running the profctl utility on the target system, with the -start option. This utility considers the components to be profiled as arguments.
If the_process was part of the system image, use the following command to initiate the performance profiling session:
% rsh trumpet arun profctl -start -b the_process |
If the_process was loaded dynamically, use the following command:
% rsh trumpet arun profctl -start -a the_process pid |
Where pid is the numeric identifier of the process (as returned by the aps command).
Run the application.
Several components can be specified as arguments to the profctl utility.
Performance profiling is stopped by running the profctl utility on the target system, with the -stop option:
% rsh trumpet arun profctl -stop |
When performance profiling is stopped, a raw data file is generated for each profiled component within the /tmp directory of the target file system. The name of the file consists of the component name, to which the suffix .prof is added. For example, if only the_process was profiled, the file $CHORUSUS_ROOT/tmp/the_process.prof would be created.
Performance profiling reports are generated by the profrpg host utility.
Use the report generator to produce a report for each profiled component as follows:
% cd $CHORUSUS_ROOT/tmp |
% profrpg the_process > the_process.rpg |
Reports should be archived to track the benefits of optimization.
Performance profiling can be applied to a user-selected set of components. The result of the performance profiling is a report on each profiled component.
A performance profiling report consists of two parts:
A global report that provides general information about the profiling session, including clock attributes, CPU attributes, the distribution of CPU time between idle threads, user processes, non-profiled supervisor components, and each of the profiled supervisor components.
A component-based function table that indicates the distribution of CPU time within the profiled component.
For each function, the performance profile report displays the information listed in the following sections.
The function header contains the following fields:
Function number. This field indicates the function number in the current report, and is provided to facilitate analysis of the report using a text editor.
Function name. This field indicates the name of the function.
Size. This field indicates the size of the function (in bytes).
Time spent in function. This field indicates the flat time spent in the body of the function (the number of profiling ticks that occurred while an instruction was being executed within the function). This value is followed by the percentage of the total component time it represents. This is probably the most valuable information -- a report can be sorted by this key if desired.
Total time spent in function. This field indicates the aggregated time spent within the function and called functions. The value is expressed as a percentage of total process time. By default, the report generator sorts the table by the total time key. This field is computed by the report generator and assumes that each call to a given routine takes the same period of time. This information is only provided in the full profiling form. In the simple form, the information is the same as the flat time information.
Recursion indicator. This information is provided in the full profiling form only. The recursion indicator field indicates that the procedure was found in a recursive loop. Because the profiling system is not completely set up for multithreading, this indicator might be erroneously set.
The call graph description contains the following fields:
List of callers. This field details a list of the functions calling the profiled function. For each caller, the report provides:
Caller's function number
Number of calls
Caller's name and call offset in the caller's body. When a function calls another function from several locations, several entries are made in the list of callers.
List of called functions. For each called function, the report provides:
Callee's function number
Number of calls
Percentage of the total function time that is charged to the callee
Name of the function
The following is a sample profiling report.
overhead=2.468 memcpy 4 K=18.834 memcpy 16 K=51.936 memcpy 64 K=185.579 memcpy 256 K=801.300 sysTime=2.576 threadSelf=2.210 thread switch=5.777 threadCreate (active)=8.062 threadCreate (active, preempt)=10.071 threadPriority (self)=3.789 threadPriority (self, high)=3.195 threadResume (preempt)=6.999 threadResume (awake)=4.014 ... ipcCall (null, timeout)=35.732 ipcSend (null, funcmode)=7.723 ipcCall (null, funcmode)=31.762 ipcSend (null, funcumode)=7.924 ipcCall (null, funcumode)=31.864 ipcSend (annex)=8.294 ipcReceive (annex)=7.086 ipcCall (annex)=33.708 ipcSend (body 4b)=8.020 ipcReceive (body 4b)=6.822 ipcCall (body 4b)=32.558 ipcSend (annex, body 4b)=8.684 ipcReceive (annex, body 4b)=7.495 ipcCall (annex, body 4b)=34.849 |
This section provides information on the design of the performance profiling system. This information can help you understand the sequence of events that occur before the generation of a performance profiling report.
The performance profiling tool set consists of:
The profiler server, PROF
(a supervisor
process). This process first interprets the performance profiling requests
issued by the PROF utility, and then executes the performance
profiling function at a selected profiling clock rate on the target. See the PROF(1CC)
man page for more details.
The profctl target utility. This utility sends performance profiling requests to the profiler server on the target. See the profctl(1CC) man page for more information.
The profrpg host utility. This command interprets profiling data and produces coherent profiling reports on the development host. See the profrpg(1CC) man page for more information.
When the performance profiling compiler option (-p) is used, the compiler provides each function entry point with a call to a routine, usually known as mcount. For each function, the compiler also sets up a static counter and passes the address of this counter to mcount. The counter is initialized at zero.
The scope of the action performed by mcount is defined by the application. Low-end performance profilers count the number of times the routine is called, and do not do much more than that. The ChorusOS profiler supports an advanced mcount routine within the profiled library (for constructing the runtime call graph).
You can supply your own mcount routine, to assert predicates when debugging a component, for example.
The profiler server PROF
, is a
supervisor process that can locate and modify static data within the memory
context of the profiled processes (using the embedded symbol tables). The
profiler server also dynamically creates and deletes the memory regions that
are used to construct the call graph and count the profiling ticks (see the
following section).
While the performance profiler is active, the system is regularly interrupted by the profiling clock (which by default is the system clock). At each clock tick, the instruction pointer is sampled, the active procedure is located, and a counter associated with the interrupted procedure is incremented. A high rate performance profiling clock can use a significant amount of system time, which may lead to the system appearing to run more slowly. A rapid sampling clock could jeopardize the system's real time requirements.
Significant disruptions in the real time capabilities of the profiled programs must be expected because performance profiling is implemented with software (rather than by hardware with an external bus analyzer or equivalent device). Performance profiling using software slows down the processor. An application can behave differently when being profiled, compared to when running at full processor speed.
When profiling, a processor can spend more than fifty percent of the processing time profiling clock interrupts. Similarly, the time spent recording the call graph is significant and can bias the profiling results in a non-linear manner.
The accuracy of the reported percentage of time spent is about five percent when the number of profiling ticks is in the order of magnitude of ten times the number of bytes in the profiled programs. For example, to profile a program of 1 million bytes with any degree of accuracy, at least 10 millions ticks should be used. This level of accuracy is usually sufficient to plan code optimizations (which is the primary goal of the profiler). However, the operator should beware of using all the fractional digits of the reported figures.
If greater accuracy is required, experiment with different combinations of the profiling clock rate, the type of profiling clock, and the time spent profiling.