This section details the following:
file organization
statements
values and symbols
expressions
machine instruction syntax
Input to the assembler is a text file consisting of a sequence of statements. Each statement ends with the first occurrence of a newline character (ASCII LF), or of a semicolon (;) that is not within a string operand or between a slash and a newline character. Thus, it is possible to have several statements on one line.
To make programs easy to read, understand and maintain, however, it is good programming practice not to have more than one statement per line. As indicated above, a line may contain one or more statements. If several statements appear on a line, they must be separated by semicolons (;).
This section outlines the types of statements that apply to assembly language. Each statement must be one of the following types:
An empty statement is one that contains nothing other than spaces, tabs, or formfeed characters.
Empty statements have no meaning to the assembler. They can be inserted freely to improve the appearance of a source file or of a listing generated from it.
An assignment statement is one that gives a value to a symbol. It consists of a symbol, followed by an equal sign (=), followed by an expression.
The expression is evaluated and the result is assigned to the symbol. Assignment statements do not generate any code. They are used only to assign assembly time values to symbols.
A pseudo operation statement is a directive to the assembler that does not necessarily generate any code. It consists of a pseudo operation code, optionally followed by operands. Every pseudo operation code begins with a period (.).
A machine operation statement is a mnemonic representation of an executable machine language instruction to which it is translated by the assembler. It consists of an operation code, optionally followed by operands.
Furthermore, any statement remains a statement even if it is modified in either or both of the following ways:
Prefixing a label at the beginning of the statement.
A label consists of a symbol followed by a colon (:). When the assembler encounters a label, it assigns the value of the location counter to the label.
Appending a comment at the end of the statement by preceding the comment with a slash (/).
The assembler ignores all characters following a slash up to the next occurrence of newline. This facility allows insertion of internal program documentation into the source file for a program.
This section presents the values and symbol types that the assembler uses.
Values are represented in the assembler by numerals which can be faithfully represented in standard two's complement binary positional notation using 32 bits. All integer arithmetic is performed using 32 bits of precision. Note, however, that the values used in an IA--32 instruction may require 8, 16, or 32 bits.
A symbol has a value and a symbol type, each of which is either specified explicitly by an assignment statement or implicitly from context. Refer to the next section for the regular definition of the expressions of a symbol.
The following symbols are reserved by the assembler:
.Commonly referred to as dot. This is the location counter while assembling a program. It takes on the current location in the text, data, or bss section.
.text
This symbol is of type text. It is used to label the beginning of a .text section in the program being assembled.
.data
This symbol is of type data. It is used to label the beginning of a data section in the program being assembled.
.bss
This symbol is of type bss. It is used to label the beginning of a .bss section in the program being assembled.
.init
.fini
Symbol type is one of the following:
undefined
A value is of undefined symbol type if it has not yet been defined. Example instances of undefined symbol types are forward references and externals.
absolute
A value is of absolute symbol type it does not change with relocation. Example instances of absolute symbol types are numeric constants and expressions whose proper sub-expressions are themselves all absolute.
text
A value is of text symbol type if it is relative to the .text section.
data
A value is of data symbol type if it is relative to the .data section.
bss
A value is of bss symbol type if it is relative to the .bss section.
You can give any of these symbol types the attribute EXTERNAL.
Five of the symbol types are defined with respect to certain sections of the object file into which the assembler translates the source file. This section describes symbol types.
If the assembler translates a particular assembly language statement into a machine language instruction or into a data allocation, the translation is associated with one of the following five sections of the object file into which the assembler is translating the source file:
Table 1-1 Translations and their Associations
Section |
Purpose |
---|---|
text |
This is an initialized section. Normally, it is read-only and contains code from a program. It may also contain read-only tables |
data |
This is an initialized section. Normally, it is readable and writable. It contains initialized data. These can be scalars or tables. |
bss |
This is an initialized section. Space is not allocated for this segment in the object file. |
init |
This is used with C++ programs that require constructors. |
fini |
This is used by C++ programs that require destructors. |
An optional section, .comment, may also be produced.
The section associated with the translated statement is .text unless the original statement occurs after a section control pseudo operation has directed the assembler to associate the statement with another section.
The expressions accepted by the assembler are defined by their syntax and semantics. The following are the operators supported by the assembler:
Table 1-2 Operators Supported by the Assembler
Operator |
Action |
---|---|
+ |
Addition |
- |
Subtraction |
\* |
Multiplication |
\/ |
Division |
& |
Bitwise logical and |
| |
Bitwise logical or |
>> |
Right shift |
<< |
Left shift |
\% |
Remainder operator |
! |
Bitwise logical and not |
^ |
Bitwise logical XOR |
Table 1-3 shows syntactic rules, the non terminals are represented by lowercase letters, the terminal symbols are represented by uppercase letters, and the symbols enclosed in double quotes are terminal symbols. There is no precedence assigned to the operators. You must use square brackets to establish precedence.
Table 1-3 Syntactical Rules of Expressions
expr : term | expr "+" term | expr "-" term | expr "\*" term | expr "\/" term | expr "&" term | expr "|" term | expr ">>" term | expr "<<" term | expr "\%" term | expr "!" term | expr "^" term term : id | number | "-" term | "[" expr "]" | "<o>" term | "<s>" term ; id : LABEL ; number : DEC_VAL | HEX_VAL | OCT_VAL | BIN_VAL ; |
The terminal nodes are given by the following regular expressions:
LABEL = [a-zA-Z_][a-zA-Z0-9_]*: DEC_VAL = [1-9][0-9]* HEX_VAL = 0[Xx][0-9a-fA-F][0-9a-fA-F]* OCT_VAL = 0[0-7]* BIN_VAL = 0[Bb][0-1][0-1]*
In the above regular expressions, choices are enclosed in square brackets; a range of choices is indicated by letters or numbers separated by a dash (-); and the asterisk (*) indicates zero or more instances of the previous character.
Semantically, the expressions fall into two groups, absolute and relocatable. The equations later in this section show the legal combinations of absolute and relocatable operands for the addition and subtraction operators. All other operations are only legal on absolute-valued expressions.
All numbers have the absolute attribute. Symbols used to reference storage, text, or data are relocatable. In an assignment statement, symbols on the left side inherit their relocation attributes from the right side.
In the equations below, a is an absolute-valued expression and r is a relocatable-valued expression. The resulting type of the operation is shown to the right of the equal sign.
a + a = a r + a = r a - a = a r - a = r r - r = a
In the last example, you must declare the relocatable expressions before taking their difference.
Following are some examples of valid expressions:
label $label [label + 0x100] [label1 - label2] $[label1 - label2]
Following are some examples of invalid expressions:
[$label - $label] [label1 * 5] (label + 0x20)
This section describes the instructions that the assembler accepts. The detailed specification of how the particular instructions operate is not included; for this, see Intel's 80386 Programmer's Reference Manual.
The following list describes the three main aspects of the IA-32 Assembler assembler:
All register names use the percent sign (%) as a prefix to distinguish them from symbol names.
Instructions with two operands use the left one as the source and the right one as the destination. This is reversed from Intel's notation.
Most instructions that can operate on a byte, word, or long may have b, w, or l appended to them. When an opcode is specified with no type suffix, it usually defaults to long. In general, the IA-32 Assembler derives its type information from the opcode, where the Intel assembler can derive its type information from the operand types. Where the type information is derived motivates the b, w, and l suffixes used in the IA-32 Assembler. For example, in the instruction movw $1,%eax the w suffix indicates the operand is a word.
Three kinds of operands are generally available to the instructions: register, memory, and immediate. Indirect operands are available only to jump and call instructions.
The assembler always assumes it is generating code for a 32-bit segment. When 16-bit data is called for (e.g., movw %ax, %bx), the assembler automatically generates the 16-bit data prefix byte.
Byte, word, and long registers are available on the IA--32 processor. The instruction pointer (%eip) and flag register (%efl) are not available as explicit operands to the instructions. The code segment (%cs) may be used as a source operand but not as a destination operand.
The names of the byte, word, and long registers available as operands and a brief description of each follow. The segment registers are also listed.
Table 1-4 8-bit (byte) General Registers
%al |
Low byte of %ax register |
|
%ah |
High byte of %ax register |
|
%cl |
Low byte of %cx register |
|
%ch |
High byte of %cx register |
|
%dl |
Low byte of %dx register |
|
%dh |
High byte of %dx register |
|
%bl |
Low byte of %bx register |
|
%bh |
High byte of %bx register |
Table 1-5 16-bit (word) General Registers
%ax |
Low 16-bits of %eax register |
%cx |
Low 16-bits of %ecx register |
%dx |
Low 16-bits of %edx register |
%bx |
Low 16-bits of %ebx register |
%sp |
Low 16-bits of the stack pointer |
%bp |
Low 16-bits of the frame pointer |
%si |
Low 16-bits of the source index register |
%di |
Low 16-bits of the destination index register |
Table 1-6 32-bit (long) General Registers
%eax |
32-bit general register |
%ecx |
32-bit general register |
%edx |
32-bit general register |
%ebx |
32-bit general register |
%esp |
32-bit stack pointer |
%ebp |
32-bit frame pointer |
%esi |
32-bit source index register |
%edi |
32-bit destination index register |
Table 1-7 Description of Segment Registers
%cs |
Code segment register; all references to the instruction space use this register |
%ds |
Data segment register, the default segment register for most references to memory operands |
%ss |
Stack segment register, the default segment register for memory operands in the stack (i.e., default segment register for %bp, %sp, %esp, and %ebp) |
%es |
General-purpose segment register; some string instructions use this extra segment as their default segment |
%fs |
General-purpose segment register |
%gs |
General-purpose segment register |
This section describes the IA--32 instruction syntax.
The assembler assumes it is generating code for a 32-bit segment, therefore, it also assumes a 32-bit address and automatically precedes word operations with a 16-bit data prefix byte.
Addressing modes are represented by the following:
[sreg:][offset][([base][,index][,scale])]
All the items in the square brackets are optional, but at least one is necessary. If you use any of the items inside the parentheses, the parentheses are mandatory.
sreg is a segment register override prefix. It may be any segment register. If a segment override prefix is present, you must follow it by a colon before the offset component of the address. sreg does not represent an address by itself. An address must contain an offset component.
offset is a displacement from a segment base. It may be absolute or relocatable. A label is an example of a relocatable offset. A number is an example of an absolute offset.
base and index can be any 32-bit register. scale is a multiplication factor for the index register field. Its value may be 1, 2, 4, 8 to indicate the number to multiply by. The multiplication then occurs by 1, 2, 4, and 8.
Refer to Intel's 80386 Programmer's Reference Manual for more details on IA--32 addressing modes.
Following are some examples of addresses:
movl var, %eax
Move the contents of memory location var into %eax.
movl %cs:var, %eax
Move the contents of the memory location var in the code segment into %eax.
movl $var, %eax
Move the address of var into %eax.
movl array_base(%esi), %eax
Add the address of memory location array_base to the contents of %esi to get an address in memory. Move the contents of this address into %eax.
movl (%ebx, %esi, 4), %eax
Multiply the contents of %esi by 4 and add this to the contents of %ebx to produce a memory reference. Move the contents of this memory location into %eax.
movl struct_base(%ebx, %esi, 4), %eax
Multiply the contents of %esi by 4, add this to the contents of %ebx, and add this to the address of struct_base to produce an address. Move the contents of this address into %eax.
An immediate value is an expression preceded by a dollar sign:
immediate: "$" expr
Immediate values carry the absolute or relocatable attributes of their expression component. Immediate values cannot be used in an expression, and should be considered as another form of address, i.e., the immediate form of address.
immediate: "$" expr "," "$" expr
The first expr is 16 bits of segment. The second expr is 32 bits of offset.