Skip to content

Formalized Intel Syntax for x86

LIU Hao edited this page Aug 14, 2024 · 33 revisions

The Motivation

The assembly language for x86 and x86-64 involves two major variations of syntax: the Microsoft assembler (MASM) syntax and the GNU assembler (GAS) syntax. The MASM syntax, also known as the Intel syntax, is prescriptive in Intel Software Developer Manual, and is used extensively by many non-GNU tools. The GNU syntax, also known as the AT&T syntax, derives from PDP-11 assembly to create Unix, and is default and dominant in the post-Unix world.

The advantages of the MASM syntax are:

  1. It looks more modern, closer to many other assembly languages, such as ARM, MIPS and RISC-V.
  2. It is the syntax in Intel and AMD documentation.

The disadvantages of the MASM syntax are:

  1. MASM is proprietary software, but it defines the defacto standard.
  2. It does not match some mnemonics well. For example, cvtsi2sd reads 'ConVerT Scalar Integer TO Scalar Double' i.e. source precedes destination, but it's actually cvtsi2sd xmm0, rax i.e. destination precedes source.
  3. The syntax has not been formally described, and causes occasional ambiguity.

For instance, the Intel Software Developer Manual contains

MOV EBX, RAM_START

This is ambiguous in two ways. First, it could be interpreted as either of

MOV EBX, OFFSET RAM_START         ; `movl $RAM_START, %ebx`
MOV EBX, DWORD PTR [RAM_START]    ; `movl RAM_START, %ebx`

Second, on x86-64 the address might be RIP-relative or absolute, as in

MOV EBX, DWORD PTR [RAM_START]
          ; x86    absolute       ; 8B 1D    RAM_START   ; `movl RAM_START, %ebx`
          ; x86-64 RIP-relative   ; 8B 1D    RAM_START   ; `movl RAM_START(%rip), %ebx`
          ; x86-64 absolute       ; 8B 1C 25 RAM_START   ; `movl RAM_START, %ebx`

The first issue here is solved by interpreting it as an memory reference, but the ambiguity may still arise if the symbol results from a high-level language, such as C.

When targeting x86, the Microsoft compiler decorates C identifiers: External names that denote objects or functions with the __cdecl or __stdcall calling convention are prefixed with an underscore _; external names that denote functions with the __fastcall or __vectorcall calling convention are prefixed with an at symbol @. This technique prevents symbols from conflicting with keywords in assembly.

But it is no longer the case for x86-64, as well as ARM and ARM64. If a user declares an external variable with the name RSI, the compiler may generate the ambiguous and incorrect

MOV EAX, DWORD PTR [RSI]    ; parsed as `movl (%rsi), %eax`
                            ; should have been `movl RSI, %eax`

This RFC proposes formalization of the Intel syntax, by disallowing certain constructions, to resolve ambiguity.

The Proposal

  1. If an indirect reference contains a symbol, the symbol shall always follow a mode specifier (PTR or BCST) or OFFSET. In other words, only registers and numeric displacements are enclosed within brackets. This idea is shared with GAS syntax.

    MOV EAX, DWORD PTR [RCX]               ; valid, complete: `movl (%rcx), %eax`
    MOV EAX, DWORD [RCX]                   ; valid, abbreviated: `movl (%rcx), %eax`
    MOV EAX, [RCX]                         ; valid, symbolless: `movl (%rcx), %eax`
    VMULPD ZMM0, ZMM1, QWORD BCST [RCX]    ; valid, complete: `vmulpd (%rcx){1to8}, %zmm1, %zmm0`
  2. An overriding segment register shall follow the operand and mode specifier if any; when there is no such specifier, it shall occur at the beginning of the operand.

    MOV EAX, DWORD PTR CS:[RCX]            ; valid: `movl %cs:(%rcx), %eax`
    MOV EAX, DWORD CS:[RCX]                ; valid: `movl %cs:(%rcx), %eax`
    MOV EAX, CS:[RCX]                      ; valid: `movl %cs:(%rcx), %eax`
  3. If a valid symbol name follows PTR, BCAST or OFFSET, after an overriding segment register if any, then it is always treated as a symbol, even when it is a keyword.

    LEA RAX, bx[RIP]                       ; invalid: `bx` is parsed as the register due to lack
                                           ;          of a mode specifier
    LEA RAX, BYTE PTR bx[RIP]              ; valid: `leaq bx(%rip), %rax`
    MOV EAX, printf                        ; invalid: `printf` is not a known register
    MOV EAX, OFFSET printf                 ; valid: `movl $printf, %eax`
    MOV EAX, RCX                           ; invalid: operand size mismatch
    MOV EAX, OFFSET RCX                    ; valid: `movl $RCX, %eax`
    MOV EAX, DWORD PTR [RCX]               ; valid: `movl (%rcx), %eax`
    MOV EAX, DWORD PTR RCX                 ; valid: `movl RCX, %eax`
    MOV EAX, DWORD PTR RCX[RIP+12]         ; valid: `movl RCX+12(%rip), %eax`
  4. For instructions with a dummy memory operand (LEA, NOP, etc.) and those with an uncommon size (FXSAVE/FXRSTOR, FNSAVE/FNRSTOR, etc.), BYTE PTR should be used.

    NOP DWORD PTR [RAX], EAX               ; warning: `BYTE PTR` should be used
    NOP BYTE PTR [RAX], EAX                ; valid: 0F 1F 00
  5. An RIP-relative operand shall have RIP as its base register.

    MOV EBX, DWORD PTR foo                 ; valid: `movl foo, %ebx`; might cause linker errors
                                           ;        on x86-64
    MOV EBX, DWORD PTR foo[RIP]            ; valid: `movl foo(%rip), %ebx`
  6. The base, index, scale and displacement parts of a memory operand shall appear uniformly. The displacement comes first, immediately following the mode specifier and overriding segment register. If there is at least a base or index register, they are all placed in a pair of square brackets. This idea is also shared with GAS syntax.

    MOV ECX, DWORD PTR [RSI+RDI*4+field]   ; warning: `field` is not a known register and is 
                                           ;          assumed to be a symbol
    MOV ECX, DWORD PTR field[RSI+RDI*4]    ; valid: `movl field(%rsi,%rdi,4), %ecx`

External Links

  1. GCC Bug 53929 - [meta-bug] -masm=intel with global symbol