Skip to content

Formalized Intel Syntax for x86

LIU Hao edited this page Jan 18, 2024 · 33 revisions

The Motivation

The assembly language for x86 and x86-64 involves two major variations of syntax: the Microsoft assembler (MASM) syntax and the GNU assembler (GAS) syntax. The MASM syntax, also known as the Intel syntax, is prescriptive in Intel Software Developer Manual, and is used extensively by many non-GNU tools. The GNU syntax, also known as the AT&T syntax, derives from PDP-11 assembly to create Unix, and is default and dominant in the post-Unix world.

The advantages of the MASM syntax are:

  1. It looks more modern, closer to many other assembly languages, such as ARM, MIPS and RISC-V.
  2. It is the syntax in Intel and AMD documentation.

The disadvantages of the MASM syntax are:

  1. MASM is proprietary software.
  2. The syntax has not been formally defined, and causes ambiguity sometimes.

For instance, the Intel Software Developer Manual contains this line:

MOV EBX, RAM_START

This is ambiguous in two ways. First, it could be interpreted as either of

MOV EBX, OFFSET RAM_START         ; `movl $RAM_START, %ebx`
MOV EBX, DWORD PTR [RAM_START]    ; `movl RAM_START, %ebx`

Second, on x86-64 the address might be RIP-relative or absolute, as in

MOV EBX, DWORD PTR [RAM_START]
          ; x86    absolute       ; 8B 1D    RAM_START   ; `movl RAM_START, %ebx`
          ; x86-64 RIP-relative   ; 8B 1D    RAM_START   ; `movl RAM_START(%rip), %ebx`
          ; x86-64 absolute       ; 8B 1C 25 RAM_START   ; `movl RAM_START, %ebx`

The first issue here is solved by interpreting it as an memory reference, but the ambiguity may still arise if the symbol results from a high-level language, such as C.

When targeting x86, the Microsoft compiler decorates C identifiers: External names that denote objects or functions with the __cdecl or __stdcall calling convention are prefixed with an underscore _; external names that denote functions with the __fastcall or __vectorcall calling convention are prefixed with an at symbol @. This technique prevents symbols from conflicting with keywords in assembly.

But it is no longer the case for x86-64 (as well as ARM and ARM64). If a user declares an external variable with the name RSI, the compiler may generate the ambiguous and incorrect

MOV EAX, DWORD PTR [RSI]    ; parsed as `movl (%rsi), %eax`
                            ; should have been `movl rsi, %eax`

This RFC proposes formalization of the Intel syntax, by disallowing certain constructions to resolve ambiguity.

The Proposal

  1. Indirect references shall always contain a mode specifier. Plain brackets are no longer allowed.

    MOV EAX, [RCX]                         ; invalid: operand size and mode specifier are required
    MOV EAX, DWORD [RCX]                   ; invalid: mode specifier is required
    MOV EAX, DWORD PTR [RCX]               ; valid: `movl (%rcx), %eax`
    VMULPD ZMM0, ZMM1, QWORD BCST [RCX]    ; valid: `vmulpd (%rcx){1to8}, %zmm1, %zmm0`
    LEA RAX, bx[RIP]                       ; invalid: operand size and mode specifier are required
    LEA RAX, BYTE PTR bx[RIP]              ; valid: `leaq bx(%rip), %rax`
  2. Overriding segment registers shall occur before the operand size and mode specifier.

    MOV EAX, DWORD PTR CS:[RCX]            ; maybe invalid: symbol name cannot contain `:`
    MOV EAX, CS:DWORD PTR [RCX]            ; valid: `movl %cs:(%rcx), %eax`
  3. If an identifier follows PTR, BCAST or OFFSET, then it is always treated as a symbol, even when it is a keyword. In other words, only registers are enclosed within brackets. This idea is shared with GAS syntax.

    MOV EAX, printf                        ; invalid: `printf` is not a known register
    MOV EAX, OFFSET printf                 ; valid: `movl $printf, %eax`
    MOV EAX, RCX                           ; invalid: operand size mismatch
    MOV EAX, OFFSET RCX                    ; valid: `movl $RCX, %eax`
    MOV EAX, DWORD PTR [RCX]               ; valid: `movl (%rcx), %eax`
    MOV EAX, DWORD PTR RCX                 ; valid: `movl rcx, %eax`
    MOV EAX, DWORD PTR RCX[RIP+10]         ; valid: `movl rcx+10(%rip), %eax`
  4. For instructions with a dummy memory operand (LEA, NOP, etc.) and those with an uncommon size (FXSAVE/FXRSTOR, FNSAVE/FNRSTOR, etc.), BYTE PTR shall be used.

    NOP DWORD PTR [RAX], EAX               ; invalid: `BYTE PTR` is requred
    NOP BYTE PTR [RAX], EAX                ; valid: 0F 1F 00
  5. RIP-relative operands must have RIP as the base register.

    MOV EBX, DWORD PTR foo                 ; valid: `movl RAM_START, %ebx`
                                           ; note: might cause linker errors on x86-64
    MOV EBX, DWORD PTR foo[RIP]            ; valid: `movl RAM_START(%rip), %ebx`
  6. The base, index, scale and displacement parts of a memory operand shall appear uniformly. The displacement comes first, immediately following the mode specifier. If there is at least a base or index register, they are all placed in a pair of square brackets. This idea is also shared with GAS syntax.

    MOV ECX, DWORD PTR [RSI+RDI*4+field]   ; invalid: `field` is not a known register
    MOV ECX, DWORD PTR field[RSI+RDI*4]    ; valid: `movl field(%rsi,%rdi,4), %ecx`

External Links

  1. GCC Bug 53929 - [meta-bug] -masm=intel with global symbol