x86 Assembly Language Reference Manual

Source Files in Assembly Language Format

This section details the following:

File Organization

Input to the assembler is a text file consisting of a sequence of statements. Each statement ends with the first occurrence of a newline character (ASCII LF), or of a semicolon (;) that is not within a string operand or between a slash and a newline character. Thus, it is possible to have several statements on one line.

To make programs easy to read, understand and maintain, however, it is good programming practice not to have more than one statement per line. As indicated above, a line may contain one or more statements. If several statements appear on a line, they must be separated by semicolons (;).

Statements

This section outlines the types of statements that apply to assembly language. Each statement must be one of the following types:

Furthermore, any statement remains a statement even if it is modified in either or both of the following ways:

Values and Symbol Types

This section presents the values and symbol types that the assembler uses.

Values

Values are represented in the assembler by numerals which can be faithfully represented in standard two's complement binary positional notation using 32 bits. All integer arithmetic is performed using 32 bits of precision. Note, however, that the values used in an x86 instruction may require 8, 16, or 32 bits.

Symbols

A symbol has a value and a symbol type, each of which is either specified explicitly by an assignment statement or implicitly from context. Refer to the next section for the regular definition of the expressions of a symbol.

The following symbols are reserved by the assembler:

.Commonly referred to as dot. This is the location counter while assembling a program. It takes on the current location in the text, data, or bss section.


.text

This symbol is of type text. It is used to label the beginning of a .text section in the program being assembled.


.data

This symbol is of type data. It is used to label the beginning of a data section in the program being assembled.


.bss

This symbol is of type bss. It is used to label the beginning of a .bss section in the program being assembled.


.init

This is used with C++ programs which require constructors.


.fini

This is used with C++ programs which require destructors.

Symbol Types

Symbol type is one of the following:


undefined

A value is of undefined symbol type if it has not yet been defined. Example instances of undefined symbol types are forward references and externals.


absolute

A value is of absolute symbol type it does not change with relocation. Example instances of absolute symbol types are numeric constants and expressions whose proper sub-expressions are themselves all absolute.


text

A value is of text symbol type if it is relative to the .text section.


data

A value is of data symbol type if it is relative to the .data section.


bss

A value is of bss symbol type if it is relative to the .bss section.

You can give any of these symbol types the attribute EXTERNAL.

Sections

Five of the symbol types are defined with respect to certain sections of the object file into which the assembler translates the source file. This section describes symbol types.

If the assembler translates a particular assembly language statement into a machine language instruction or into a data allocation, the translation is associated with one of the following five sections of the object file into which the assembler is translating the source file:

Table 1-1 Translations and their Associations

Section 

Purpose 

text

This is an initialized section. Normally, it is read-only and contains code from a program. It may also contain read-only tables 

data

This is an initialized section. Normally, it is readable and writable. It contains initialized data. These can be scalars or tables. 

bss

This is an initialized section. Space is not allocated for this segment in the object file. 

init

This is used with C++ programs that require constructors. 

fini

This is used by C++ programs that require destructors. 

An optional section, .comment, may also be produced.

The section associated with the translated statement is .text unless the original statement occurs after a section control pseudo operation has directed the assembler to associate the statement with another section.

Expressions

The expressions accepted by the x86 assembler are defined by their syntax and semantics. The following are the operators supported by the assembler:

Table 1-2 Operators Supported by the Assembler

Operator 

Action 

Addition 

Subtraction 

\* 

Multiplication 

\/ 

Division 

Bitwise logical and 

Bitwise logical or 

>> 

Right shift 

<< 

Left shift 

\% 

Remainder operator 

Bitwise logical and not 

Bitwise logical XOR 

Expression Syntax

Table 1-3 shows syntactic rules, the non terminals are represented by lowercase letters, the terminal symbols are represented by uppercase letters, and the symbols enclosed in double quotes are terminal symbols. There is no precedence assigned to the operators. You must use square brackets to establish precedence.

Table 1-3 Syntactical Rules of Expressions

	expr		: term
		| expr "+" term
		| expr "-" term
		| expr "\*" term
		| expr "\/" term
		| expr "&" term
		| expr "|" term
		| expr ">>" term
		| expr "<<" term
		| expr "\%" term
		| expr "!" term
		| expr "^" term
		
		term			: id
		| number		| "-" term
		| "[" expr "]"
		| "<o>" term
		| "<s>" term
		;

		id	: LABEL
		;

		number		: DEC_VAL
		| HEX_VAL
		| OCT_VAL
		| BIN_VAL
		;

The terminal nodes are given by the following regular expressions:

	LABEL   = [a-zA-Z_][a-zA-Z0-9_]*:
	DEC_VAL = [1-9][0-9]*
	HEX_VAL = 0[Xx][0-9a-fA-F][0-9a-fA-F]*
	OCT_VAL = 0[0-7]*
	BIN_VAL = 0[Bb][0-1][0-1]*

In the above regular expressions, choices are enclosed in square brackets; a range of choices is indicated by letters or numbers separated by a dash (-); and the asterisk (*) indicates zero or more instances of the previous character.

Expression Semantics (Absolute vs. Relocatable)

Semantically, the expressions fall into two groups, absolute and relocatable. The equations later in this section show the legal combinations of absolute and relocatable operands for the addition and subtraction operators. All other operations are only legal on absolute-valued expressions.

All numbers have the absolute attribute. Symbols used to reference storage, text, or data are relocatable. In an assignment statement, symbols on the left side inherit their relocation attributes from the right side.

In the equations below, a is an absolute-valued expression and r is a relocatable-valued expression. The resulting type of the operation is shown to the right of the equal sign.

	a + a = a
	r + a = r
	a - a = a
	r - a = r
	r - r = a

In the last example, you must declare the relocatable expressions before taking their difference.

Following are some examples of valid expressions:

	label

	$label

	[label + 0x100]

	[label1 - label2]

	$[label1 - label2]

Following are some examples of invalid expressions:

	[$label - $label]

	[label1 * 5]

	(label + 0x20)

Machine Instruction Syntax

This section describes the instructions that the assembler accepts. The detailed specification of how the particular instructions operate is not included; for this, see Intel's 80386 Programmer's Reference Manual.

The following list describes the three main aspects of the SunOS x86 assembler:

Operands

Three kinds of operands are generally available to the instructions: register, memory, and immediate. Indirect operands are available only to jump and call instructions.

The assembler always assumes it is generating code for a 32-bit segment. When 16-bit data is called for (e.g., movw %ax, %bx), the assembler automatically generates the 16-bit data prefix byte.

Byte, word, and long registers are available on the x86 processor. The instruction pointer (%eip) and flag register (%efl) are not available as explicit operands to the instructions. The code segment (%cs) may be used as a source operand but not as a destination operand.

The names of the byte, word, and long registers available as operands and a brief description of each follow. The segment registers are also listed.

Table 1-4 8-bit (byte) General Registers

%al 

Low byte of %ax register 

%ah 

High byte of %ax register 

%cl 

Low byte of %cx register 

%ch 

High byte of %cx register 

%dl 

Low byte of %dx register 

%dh 

High byte of %dx register 

%bl 

Low byte of %bx register 

%bh 

High byte of %bx register 

Table 1-5 16-bit (word) General Registers

%ax  

Low 16-bits of %eax register 

%cx  

Low 16-bits of %ecx register 

%dx  

Low 16-bits of %edx register 

%bx  

Low 16-bits of %ebx register 

%sp  

Low 16-bits of the stack pointer 

%bp  

Low 16-bits of the frame pointer  

%si  

Low 16-bits of the source index register  

%di  

Low 16-bits of the destination index register 

Table 1-6 32-bit (long) General Registers

%eax  

32-bit general register 

%ecx  

32-bit general register 

%edx  

32-bit general register 

%ebx  

32-bit general register 

%esp  

32-bit stack pointer 

%ebp  

32-bit frame pointer 

%esi  

32-bit source index register 

%edi  

32-bit destination index register 

Table 1-7 Description of Segment Registers

%cs 

Code segment register; all references to the instruction space use this register 

%ds 

Data segment register, the default segment register for most references to memory operands 

%ss 

Stack segment register, the default segment register for memory operands in the stack (i.e., default segment register for %bp, %sp, %esp, and %ebp) 

%es 

General-purpose segment register; some string instructions use this extra segment as their default segment 

%fs 

General-purpose segment register 

%gs 

General-purpose segment register 

Instruction Description

This section describes the SunOS x86 instruction syntax.

The assembler assumes it is generating code for a 32-bit segment, therefore, it also assumes a 32-bit address and automatically precedes word operations with a 16-bit data prefix byte.

Addressing Modes

Addressing modes are represented by the following:

[sreg:][offset][([base][,index][,scale])]

Following are some examples of addresses:


movl var, %eax

Move the contents of memory location var into %eax.


movl %cs:var, %eax

Move the contents of the memory location var in the code segment into %eax.


movl $var, %eax

Move the address of var into %eax.


movl array_base(%esi), %eax

Add the address of memory location array_base to the contents of %esi to get an address in memory. Move the contents of this address into %eax.


movl (%ebx, %esi, 4), %eax

Multiply the contents of %esi by 4 and add this to the contents of %ebx to produce a memory reference. Move the contents of this memory location into %eax.


movl struct_base(%ebx, %esi, 4), %eax

Multiply the contents of %esi by 4, add this to the contents of %ebx, and add this to the address of struct_base to produce an address. Move the contents of this address into %eax.

Expressions and Immediate Values

An immediate value is an expression preceded by a dollar sign:


immediate: "$" expr

Immediate values carry the absolute or relocatable attributes of their expression component. Immediate values cannot be used in an expression, and should be considered as another form of address, i.e., the immediate form of address.


immediate: "$" expr "," "$" expr

The first expr is 16 bits of segment. The second expr is 32 bits of offset.