Inline Function Templates in C and C++ - SPARC Assembly Language Reference Manual

Language:

7.1 Inline Function Templates in C and C++

The following are examples where inline templates are particularly useful:

Hand-coded mutex locks using atomic instructions.
System-level access for a hardware device or to access certain hardware registers.
Precise implementation of algorithms that can be implemented optimally using hand-coding that the compiler is unable to replicate.

Inline templates appear as normal function calls in the C/C++ source code. When the source code program cc -O prog.c code.il and the file containing the inline template defining the function are compiled together, the compiler will insert the code from the inline template in place of the function call in the code generated from the C/C++ source code.

7.1.1 Compiling C/C++ with Inline Templates

Inline template files have the .il file extension. Compile inline templates along with the source file that calls them. The code is inlined during the code-generator stage of compilation.

cc -O prog.c code.il

The preceding example will compile prog.c and inline the code from code.il wherever the function defined by code.il is called in prog.c.

7.1.2 Layout of Code in Inline Templates

A single inline template file can define more than one inline templates. Each template definition starts with a declaration, and ends with an end statement:

.inline identifier
  ...assembler code...
.end

identifier is the name of the template function. Multiple template definitions with the same name can appear in the file, but the compiler will use only the first definition.

Since the template code will be inlined directly, without a call, into the code generated by the compiler, there is no need for a return instruction.

The template requires a prototype declaration in C/C++ source code to ensure that the compiler assigns correct types for all the parameters and recognizes the template name as a function.

For example, the following prototype declaration defines the template function:

void do_nothing();

And the associated template definition of this function might look like the following:

/* The do_nothing() template does nothing*/
.inline do_nothing,0
  nop
end

The inline template definition would appear in a separate .il file and would be compiled along with the source code file containing the call.

7.1.3 Guidelines for Coding Inline Templates

SPARC inline assembly code can use only integer registers %o0 to %o5 and floating point registers %f0 to %f31 for temporary values. These registers are referred to as the caller-saved registers. Other registers should not be used. Calls can be made to other routines from the inline template, but these calls are subject to the same constraint.

The compiler will handle most of the SPARC instruction set. If the template utilises only those instructions that the compiler normally generates it will be early inlined (see Late and Early Inlining), and the code will be scheduled optimally. However, if the template utilises instructions that the compiler accepts but does not typically generate (such as VIS instructions or atomics), the code might be late inlined. Consequently, the code might not be optimally scheduled by the compiler, resulting in a possible performance loss.

7.1.3.1 Parameter Passing

Passing parameters between the C/C++ caller program and the assembly language template code must obey the parameter passing rules defined by the target architecture, which are different for 32-bit and 64-bit code. Parameter passing is described by the SPARC ABI. See https://sparc.org/technical-documents/. SCD 2.3 describes Version 8 (32-bit code) and SCD 2.4.1 describes Version 9 (64-bit code).

Entering the template code, arguments will be passed in %o0 to %o5 and will continue on the stack. For 32-bit code, the offset is [%sp+0x5c] and %sp is guaranteed to be 64-byte aligned; for 64-bit code, the offset is [%sp+0x8af]. (For 64-bit code, the stack bias is %sp+2047, which is aligned on a 16-byte boundary.)

For example (function prototype in C followed by assembler template equivalent):

int add_up(int v1,int v2, int v3, int v4, int v5, int v6, int v7);

/*Add up 7 integer parameters; last one will be passed on stack*/
.inline add_up,28
  add %o0,%o1,%o0
  ld [%sp+0x5c],%o1
  add %o2,%o3,%o2
  add %o4,%o5,%o4
  add %o0,%o1,%o0
  add %o2,%o4,%o2
  add %o0,%o2,%o0
.end

The same example for 64-bit code, but note that when a 32-bit int register is passed on the stack, the full 64 bits of the register are saved:

int add_up(int v1,int v2, int v3, int v4, int v5, int v6, int v7);

/*Add up 7 integer parameters; last one will be passed on stack*/
.inline add_up,28
  add %o0,%o1,%o0
  ldx [%sp+0x8af],%o1
  add %o2,%o3,%o2
  add %o4,%o5,%o4
  add %o0,%o1,%o0
  add %o2,%o4,%o2
  add %o0,%o2,%o0
.end

For 32-bit floating point, values will be passed in the integer registers. For 64-bit code, they will be passed in the floating point registers.

32-bit floating-point passing by value example:

double sum_val(double a, double b);

/*sum of two doubles by value*/
.inline sum_val,16
  st   %o0,[%sp+0x48]
  st   %o1,[%sp+0x4c]
  ldd  [%sp+0x48],%f0
  st   %o2,[%sp+0x48]
  st   %o3,[%sp+0x4c]
  ldd  [%sp+0x48],%f2
  faddd %f0,%f2,%f0
.end

64-bit floating-point passing by value example:

double sum(double a, double b);

/*sum of two doubles 64-bit calling convention*/
.inline sum,16
  faddd %f0,%f2,%f0
.end

Values passed in memory, single-precision floating point values, and integers are guaranteed to be 4-byte aligned. Double-precision floating point values will be 8-byte aligned if their offset in the parameters is a multiple of 8-bytes.

Integer return values are passed in %o0. Floating point return values are passed in %f0/%f1 (single-precision values in %f0, double-precision values in the register pair %f0,%f1).

For 32-bit code, there are two ways of passing the floating point registers. The first way is to pass them by value, and the second is to pass them by reference. Either way, the compiler will do its best to optimize out the load and store instructions. It is often more successful at doing this if the floating point parameters are passed by reference.

Here is an example of 32-bit by reference parameter passing:

double sum_ref(double *a, double *b);

/*sum of two doubles by reference*/
.inline sum_ref,16
  ldd [%o0],%f0
  ldd [%o1],%f2
  faddd %f0,%f2,%f0
.end

7.1.3.2 Stack Space

Sometimes, it is necessary to store variables to the stack in order to load them back later; this is the case for moving between the int and fp registers. The best way of doing this is to use the space already set aside for parameters that are passed into the function.

For example, in the 32-bit floating-point passing by value code shown in the preceding code, the location %sp+0x48 is 8-byte aligned (%sp is 8-byte aligned), and it corresponds to the place where the second and third 4-byte integer parameters would be stored if they were passed on the stack. (Note that the first parameter would be stored at a non-8-byte boundary.)

7.1.3.3 Branches and Calls

Branching and calls within template code is allowed. Every branch or call must be followed by a nop instruction to fill the branch delay slot. It is possible to put instructions in the delay slot of branches, which can be useful if you wish to use the processor support for annulled instructions, but doing so will cause the code to be late-inlined (described in Late and Early Inlining) and may result in sub-optimal performance.

Call instructions must have an extra last argument that indicates the number of registers used to pass arguments in the call parameters. In general, you should avoid inlining call instructions.

The destinations of branches must be indicated with a number, and the branch instructions should use this number to indicate the appropriate destination together with an f for a forward branch or a b for a backward branch.

Here is an example of using branches in an inline template:

int is_true(int i);
/*return whether true*/
.inline is_true,4
   cmp  %o0,%g0
   bne  1f
   nop
   mov  1,%o0
   ba   2f
   nop
1:
   mov  0,%o0
2:
.end

7.1.4 Late and Early Inlining

The code generator of the compiler processes template inlining. There are two opportunities for inlining: before and after optimization. If the inline template is complicated, the compiler may choose to do the inlining after optimization (late inlining), which means that the code will more or less appear exactly as it appears in the template. Otherwise, the code is inlined before optimization (early inlining) and will be merged and optimized with the rest of the code around the call site.

Early inlining leads to better performance. Things that will cause late inlining are:

Use of instructions that the compiler cannot generate
Instructions in the delay slots of branches
Call instructions

View the compiler commentary generated with -g to see if a routine is late inlined. The following example shows a template that fails early inlining because it uses the frame pointer (%fp) rather than the stack pointer (%sp).

.inline sum_val,16
  st   %o0,[%fp+0x48]
  st   %o1,[%fp+0x4c]
  ldd  [%fp+0x48],%f0
  st   %o2,[%fp+0x48]
  st   %o3,[%fp+0x4c]
  ldd  [%fp+0x48],%f2
  faddd %f0,%f2,%f0
.end

The compiler will still inline the code, but it is unable to early inline the code and the code will not participate in the compiler's optimization.

The following example compiles a 32-bit executable with compiler commentary information and displays it using the Oracle Developer Studio er_src command. The debug information is stored in the .o files by default, so it is necessary to keep these files available.

$ cc -g -O inline32.il driver32.c
$ er_src a.out main
Source file: /home/jdoe/code/inline/driver32.c
Object file: /home/jdoe/code/inline/driver32.o
Load Object: a.out

     1. #include <stdio.h>
     2.
     3. void do_nothing();
     4. int add_up(int v1,int v2, int v3, int v4, int v5, int v6, int v7);
     5. double sum_val(double a, double b);
     6. double sum_ref(double *a, double *b);
     7. int is_true(int i);
     8.
     9.
     10. void main()
     11. {
     12.   double a=3.11,b=7.22;
     13.   do_nothing();
     14.   printf("add_up  %i\n",add_up(1,2,3,4,5,6,7));

   Template could not be early inlined because it references the register %fp
   Template could not be early inlined because it references the register %fp
   Template could not be early inlined because it references the register %fp
   Template could not be early inlined because it references the register %fp
   Template could not be early inlined because it references the register %fp
   Template could not be early inlined because it references the register %fp
     15.   printf("sum_val %f\n",sum_val(a,b));
     16.   printf("sum_ref %f\n",sum_ref(&a,&b));
     17.   printf("is_true 0=%i,1=%i\n", is_true(0),is_true(1));
     18. }

Use the Oracle Developer Studio er_src command to examine the compiler commentary for a particular file. It takes two parameters: the name of the executable and the name of the function to examine. In this case, the template that cannot be early inlined is sum_val. Each time the compiler comes across the %fp register, it inserts a debug message, so you can tell that there are six references to %fp in the template.

7.1.5 Compiler Calling Convention

The calling convention differs for each architecture. You can see this by examining the assembler code generated by the compiler for a simple test function.

The following example is compiled for a 32-bit platform:

% more fptest.c

double sum(double d1,double d2, double d3, double d4)
{
  return d1 + d2 + d3 + d4;
  }

% cc -O -xarch=sparc -m32 -S fptest.c
% more fptest.s
....
                        .global sum
                       sum:
/* 000000          2 */         st      %o0,[%sp+68]
/* 0x0004            */         st      %o2,[%sp+76]
/* 0x0008            */         st      %o1,[%sp+72]
/* 0x000c            */         st      %o3,[%sp+80]
/* 0x0010            */         st      %o4,[%sp+84]
/* 0x0014            */         st      %o5,[%sp+88]

!    3                !  return d1 + d2 + d3 + d4;

/* 0x0018          3 */         ld      [%sp+68],%f2
/* 0x001c            */         ld      [%sp+72],%f3
/* 0x0020            */         ld      [%sp+76],%f10
/* 0x0024            */         ld      [%sp+80],%f11
/* 0x0028            */         ld      [%sp+84],%f4
/* 0x002c            */         faddd   %f2,%f10,%f12
/* 0x0030            */         ld      [%sp+88],%f5
/* 0x0034            */         ld      [%sp+92],%f6
/* 0x0038            */         ld      [%sp+96],%f7
/* 0x003c            */         faddd   %f12,%f4,%f14
/* 0x0040            */         retl    ! Result =  %f0
/* 0x0044            */         faddd   %f14,%f6,%f0
....

In the example code, you can see that the first three floating-point parameters are passed in %o0-%o5, and the fourth is passed on the stack at locations %sp+92 and %sp+96. Note that this location is 4-byte aligned, so it is not possible to use a single floating point load double instruction to load it.

Here is an example for 64-bit code.

$ more inttest.c
long sum(long v1,long v2, long v3, long v4, long v5, long v6, long v7)
{
return v1 + v2 + v3 + v4 + v5 + v6 + v7;
}

$ cc -O -xarch=sparc -m64 -S inttest.c
$ more inttest.s...
/* 000000          2 */         ldx     [%sp+2223],%g2
/* 0x0004          3 */         add     %o0,%o1,%g1
/* 0x0008            */         add     %o3,%o2,%g3
/* 0x000c            */         add     %g3,%g1,%g4
/* 0x0010            */         add     %o5,%o4,%g5
/* 0x0014            */         add     %g5,%g4,%o1
/* 0x0018            */         retl    ! Result =  %o0
/* 0x001c            */         add     %o1,%g2,%o0
...

In the preceding code, you can see that the first action is to load the seventh integer parameter from the stack.

7.1.6 Improving Efficiency of Inlined Functions

In the following example, when we examine the code the compiler generated we see a number of unnecessary loads and stores when all the data could be held in registers.

Calling C program:

int lzd(int);

int a;
int c=0;

int main()
{
  for(a=0; a<1000; a++)
  {
    c=lzd(c);
  }
  return 0;
}

The program is intended to use the Leading Zero Detect (LZD) instruction on the SPARC T4 to do a count of the number of leading zero bits in an integer register. The inline template lzd.il might look like this:

.inline lzd
  lzd %o0,%o0
.end

Compiling the code with optimization gives the resulting code:

$ cc -O -xtarget=T4 -S lzd.c lzd.il
$ more lzd.s
...
                        .L77000018:
/* 0x001c         11 */         lzd     %o0,%o0
/* 0x0020          9 */         ld      [%i1],%i3
/* 0x0024         11 */         st      %o0,[%i2]
/* 0x0028          9 */         add     %i3,1,%i0
/* 0x002c            */         cmp     %i0,999
/* 0x0030            */         ble,pt  %icc,.L77000018
/* 0x0034            */         st      %i0,[%i1]
...

Clearly everything could be held in registers, but the compiler is adding unnecessary loads and stores because it sees the inline template as a call to a function and must load and save registers around a function call it knows nothing about.

But we can insert a #pragma directive to tell the compiler that the routine lzd() has no side effects - meaning that it does not read or write to memory:

#pragma no_side_effect(routine_name)

and it needs to be placed after the declaration of the function. The new C code might look like:

int lzd(int);
#pragma no_side_effect(lzd)

int a;
int c=0;

int main()
{
  for(a=0; a<1000; a++)
  {
    c=lzd(c);
  }
  return 0;
}

Now the generated assembler code for the loop looks much neater:

/* 0x0014         10 */         add     %i1,1,%i1

!   11                !  {
!   12                !    c=lzd(c);

/* 0x0018         12 */         lzd     %o0,%o0
/* 0x001c         10 */         cmp     %i1,999
/* 0x0020            */         ble,pt  %icc,.L77000018
/* 0x0024            */         nop

7.1.7 Inline Templates in C++

To prevent linker errors, calls to inline template functions in C++ must be enclosed in an extern "C" declaration. For example:

extern "C"
  {
    void nothing();
  }

int main()
{
  nothing();
}

Inline template function:

.inline nothing
  nop
.end

7.1.7.1 C++ Inline Templates and Exceptions

In C++, #pragma no_side_effect cannot be combined with exceptions. But we know that the code cannot produce exceptions. The compiler might be able to produce even better code by adding the throw()keyword to the template declaration:

extern "C"
  {
    int mytemplate(int) throw(); 
    #pragma no_side_effect(mytemplate)
  }