A walk in x64 land

x64 has been around for a long time, and this is not new stuff. But I just thought I’d write a little on the intricacies of x64. When moving to the 64-bit world, some things remain the same and some things change.

The 64-bit architecture we’re referring to here is the Intel 64 / Windows 64-bit architecture, or simply referred to as x64 in this article. To be doubly clear, we are discussing how Microsoft implements the specification for x64, including the way they handle calling conventions, and so on. The details are not going to apply to Linux, and even when using a non-Microsoft compiler to compile a 64-bit binary in Windows, it will be different, though maybe a bit less different. An example is using GCC on Windows.

That said, in many respects, the x64 architecture to the x86, but some things are worth a closer look. In particular:

  • File Format

  • Registers

  • Calling Conventions

  • Function Conventions

  • Stack Unwinding

File Format

The file format for Win64 is called PE32+, and it is almost the same as the original x86 (Win32) PE file format.

A couple of fields are changed, such as ImageBase (expanded to 64 bits)

Header Field            Change
---------------------------------------------------------
Magic                   Set to 0x20b instead of 0x10b
BaseOfData              Deleted
ImageBase               Widened to 64 bits
SizeOfStackReserve      Widened
SizeOfStackCommit       Widened
SizeOfHeapReserve       Widened
SizeOfHeapCommit        Widened
---------------------------------------------------------

There is also a new .pdata section, which houses the Exception information, which we will discuss in “Stack Unwinding”.

Registers

The are 16 registers in x64: RAX, RCX, RDX, R8, R9, R10, R11 (volatile) and RBX, RSP, RBP, RSI, RDI, R12, R13, R14, R15 (non-volatile).

Volatile registers simply mean that a function can use them and overwrite them, and no function should trust that their values are preserved across function calls. Those that are non-volatile are the opposite, and functions are required to restore their original values when they exit.

In addition to the general purpose registers, there are the usual floating point registers, but there are 16 of them: XMM0-XMM5 (volatile) and XMM6-XMM15 (non-volatile).

Calling Conventions

The previous calling conventions used in x86 included the stdcall, fastcall, thiscall and cdecl calling conventions, each of which has their own pecularities such as whose responsibility it is to clean up the stack. In x64, there is only one calling convention described in the ABI (Application Binary Interface). The full MSDN documentation on the calling conventions can be found at http://msdn.microsoft.com/en-us/library/7kcdt6fy.aspx.

Argument Passing

First of all, arguments are passed fastcall style. Not exactly fastcall, but somewhat close. The first four arguments of any function call are passed in registers RCX, RDX, R8 and R9, respectively. If the function has less than four arguments, then less registers are used. If there are more, the fifth argument and beyond are passed on the stack (called stack spillover). If any argument is a floating point argument, the corresponding XMM register is used.

In this case, corresponding means that, if we number the registers RCX, RDX, R8 and R9 as first, second, third and fourth, and the registers XMM0, XMM1, XMM2, XMM3 as first, second, third and fourth, then the third argument passed will either be in R8 or XMM2, regardless of which is used, and the fourth argument passed will either be in R9 or XMM2, again regardless of which is used.

Hence, the function: fn(int a, double b, int c, float d) will have a, b, c, d passed in RCX, XMM1, R8 and XMM3.

Similar to x86, the RAX register is used to house the return value, if the return value is 64-bits or less. If it is greater, a pointer is used. There is no more return spillover as in the x86 case. For instance, in x86, larger return values could be returned via EDX:EAX. No more for x64.

Finally, for functions that do not call any other functions (leaf functions), they are allowed to use custom calling conventions, as long as the stack pointer RSP is not changed.

Stack Usage

The second change is how the stack is used for stack frames. In x64, the caller sets up its stack frame as follows:

  • It saves all the non-volatile registers on the stack

  • It allocates space for its local variables (of which the total size must be known statically, except for the case of alloca)

  • It allocates space for the function that takes the largest total size of arguments, that it (the caller) calls. This is called the “home space”.

We need to spend some time describing this “home space”. Note that there are several terms that describe the home space. You may see home address space, home addresses, shadow addresses, shadow space, and so on. Here, we refer to it has the home space. This home space is quite peculiar, in that it is the size of the largest total argument space required among all the calls that the function in question (caller) makes. You will also notice that this space is pre-allocated. The function in question simply allocates this space (in the function prologue). However, the minimum space required to be is that of four register values, i.e. 32 bytes. This home space is used against every call the function makes, and after the callee returns, the home space is usually left uncleaned, because there are more callees coming down the line. Of course, when all callees are done the caller still cleans up the stack (the caller is always responsible for cleaning up).

The traditional intention, so to speak, of the home space, is to store, as some kind of “backup” (though not necessarily “backup”, if you dig into the details), the first four arguments that are passed to the callee. It is the callee’s responsibility to place those arguments (say, in RCX, RDX, R8, R9) into the home space, but it is not enforced. Also, the callee essentially owns the home space, so the caller cannot and should not use that space, nor trust values in that space. Also, if the callee wants to leave the space blank, or re-purpose the space to store other things, or anything in between, it is perfectly legitimate.

The total stack space requirement based on the three points above can be statically computed, and the compiler does just that. The aim for “standardizing” the stack will be explained subsequently, when we discuss stack unwinding. This has the effect that the stack pointer RSP is virtually static throughout the lifetime of the function. The largest possible home space is already reserved, all local variable space can be computed and reserved. We call this the “static” portion of the stack frame.

This leaves the only exception and unknown being memory directly allocated on the stack (through alloca()). If this happens, this space goes above the home space, and there is a need to know where the static portion ends and the dynamic portion begins. This is described later in stack unwinding.

A visual representation of the stack can also be found in the stack unwinding segment.

Function Conventions

Some function conventions have been altered. The caller always cleans up the stack (as mentioned), and the prologue of a function is now made responsible for deciding if it wants to copy the register-passed arguments to the home space (also mentioned). Beyond that, stricter requirements have been imposed on the form of function prologues and epilogues.

Function Prologues and Epilogues

Function prologues and epilogues are constrained to very well-defined forms. A typical prologue looks like:

mov        [rsp`+8],rcx     ; store parameter in shadow space if necessary
push       r14              ; save any non-volatile registers to be used
push       r13              ;
sub        rsp,size         ; allocate stack for local variables if needed
lea        r13,[bias+rsp]   ; use `r13` as a frame pointer with an offset

And a typical epilogue looks like:

lea         rsp,[r13-bias]  ; this is not part of the official epilogue
add         rsp,size        ; the official epilogue starts here
pop         r13
pop         r14
ret

When a frame pointer is needed the programmer can choose which register is used. Although it does not have to be used for access to the allocated space, it must be assigned in the prologue and remain unchanged during the execution of the body of the function. In such a case, the epilogue looks like:

lea         rsp,[r13+size-bias]
pop         r13
pop         r14
ret

These are the only forms allowed.

Stack Unwinding

One of the benefits of x64 is that stack unwinding is extremely structured. This has a number of trickle down effects – SEH is more structured and (somewhat) less exploitable, and debugger stack traces are much cleaner and more reliable, even without the aid of symbols.

The key reason is that a lot more information on the functions themselves are stored, and the functions use the stack in much stricter (or static) ways. As we already saw above, the way the stack is used is fairly static, the only dynamic bit being the alloca() issue.

We draw a visual representation of the stack here:

0x0
------------------
|
| ..and so on..
------------------ Function C
| Dynamic stack
| (alloca)
|-----------------
| Home space
| Home space
| Home space
| Home space
|-----------------
| Local variables
|-----------------
| Saved registers
|-----------------
| Return addr
------------------ Function B
| Dynamic stack
| (alloca)
|-----------------
| Home space
| Home space
| Home space
| Home space
|-----------------
| Local variables
|-----------------
| Saved registers
|-----------------
| Return addr
------------------   Function A
0xFF

Let’s assume that the stack use is static for now, and let’s also assume that the way the stack is used (how much space is allocated for local variables, how many registers are saved, the size of the home space allocated, etc) are all recorded somewhere. As such, given the current stack pointer, RSP, and assuming that RSP is not corrupted, it is easy to see how one may easily traverse the stack, or walk the stack, as it is commonly called.

Basically, starting at (a good) RP, say function C’s RSP, if we subtract the size of the allocated region, we get to the return address. Correcting for that return address, we can again subtract the size of that (function B’s) allocated region, we get to the return address of function A. Thus, we can easily “walk” through the stack, without frame pointers (EBP in x86 land), nor knowing how many parameters the function takes (symbols!!). Hence, we basically land up with perfect stack unwinding without the help of symbols or frame pointers.

Now suppose that we include the “problematic” case whereby there is a dynamic region on the stack – the alloca() we were talking about. How this is resolved is by having a non-volatile register which is used to mark the end of the static part of the stack (and hence the beginning of the dynamic part of the stack). That register, being non-volatile, must be saved in the function prologue. Using that register, which basically acts as some kind of frame pointer, we know where the dynamic portion ends, and hence can easily bypass the dynamic portion. All this is done by the compiler/assembler.

With that, basically we are guaranteed to be able to unwind the stack, not by following chains of EBP (as in x86) and so on, but by using data-driven semantics to follow through and reverse the stack creation process.

The picture however is incomplete, because we have assumed that the data for all that stack creation information is magically stored somewhere. We know the compiler does it (which brings into question the issue of dynamically generated code, but more on that later), but we don’t know where it’s stored, or how it looks like. First of all, since the compiler computes and stores this information, a convenient location would be to store it statically in the PE (portable executable) itself. This is what is done, and it is stored in the .exception section. The following screenshot illustrates this:

PEBrowse_ExceptionSection

The next question is, how is that data stored. To keep the description simply, a record is created for every function that exists in the code. This record is called the _RUNTIME_FUNCTION record, and as its name describes, it stores all information for the functions used at runtime. In particular, the record stores the:

  • Start address of the function

  • End address of the function

  • A pointer (actually it’s an offset) to an UNWIND_INFO structure

and the exact structure is as follows:

typedef struct _RUNTIME_FUNCTION {
  ULONG BeginAddress;
  ULONG EndAddress;
  ULONG UnwindData;
}

UnwindData is the offset to the UNWIND_INFO structure, and the UNWIND_INFO structure stores information relevant to that function. The UNWIND_INFO structure contains a union, which allows it to store exception handler information, unwind handler information, and chained handler information.

However, suffice it to understand that the UNWIND_INFO (and it’s sub-structures) hold the critical information for that piece of function we are interested in.

We can actually view this information in windbg. The .fnent (I read it as “function entry”) command dumps that information, starting with the _RUNTIME_FUNCTION information, and then the UNWIND_INFO information.

0:000> .fnent msvcrt!read
Debugger function entry 00000000`04fbd740 for:
(000007fe`fdc7b504)     msvcrt!read     |   (000007fe`fdc7e550)     msvcrt!tell
Exact matches:
    msvcrt!read = <no type information>

BeginAddress           = 00000000`0000b504
EndAddress               = 00000000`0000b5c4
UnwindInfoAddress = 00000000`0008db08     <-------

Unwind info at 000007fe`fdcfdb08, 20 bytes
  version 1, flags 2, prolog 1b, codes a
  handler routine: msvcrt!_C_specific_handler (000007fe`fdcb631c), data 2
  00: offs 1b, unwind op 4, op info 6         UWOP_SAVE_NONVOL FrameOffset: 70 reg: `rsi`.
  02: offs 1b, unwind op 4, op info 3         UWOP_SAVE_NONVOL FrameOffset: 68 reg: `rbx`.
  04: offs 1b, unwind op 2, op info 5         UWOP_ALLOC_SMALL.
  05: offs 17, unwind op 0, op info f         UWOP_PUSH_NONVOL reg: `r15`.
  06: offs 15, unwind op 0, op info e         UWOP_PUSH_NONVOL reg: `r14`.
  07: offs 13, unwind op 0, op info d         UWOP_PUSH_NONVOL reg: `r13`.
  08: offs 11, unwind op 0, op info c         UWOP_PUSH_NONVOL reg: `r12`.
  09: offs f,  unwind op 0, op info 7         UWOP_PUSH_NONVOL reg: `rdi.

The unwind info here describes exactly what the function used the stack for, and how. The way this is described is via UNWIND_CODE structures. In other words, the UNWIND_INFO (logically) houses a bunch of UNWIND_CODE structures. Each UNWIND_CODE refers to a operation in the prologue, on the stack.

UNWIND_CODE has the following format:

typedef struct UNWIND_CODE {
  UBYTE offset;                                 // Offset in the prologue
  UBYTE operation_code[4];           // The unwind operation code
  UBYTE operation_info[4];           // The unwind operation info
}

For instance, the UWOP_PUSH_NONVOL means that the prologue pushed a bunch of non-volatile registers, which we expect it to, and those registers are exactly RDI, R12, R13, R14 and R15. There are 5, so it must have used 5 * 8 bytes of stack for that. The UWOP_SAVE_NONVOL simply means that the prologue MOVed registers rather than PUSHed them. There are exactly two: RSI, RBX.

UWOP_ALLOC_SMALL means that some amount of stack memory was reserved, between 8 and 128 bytes. The size of the stack memory reserved is stored in the operation_info field, as follows: size = (operation_info * 8) + 8. It isn’t here, but UWOP_ALLOC_LARGE follows the same logic.

Prolog 1b means that the prolog is exactly 0x1b bytes long, and hence, msvcrt!read to msvcrt!read+20 is the prologue. As seen here:

0:000> u msvcrt!read msvcrt!read+1b
msvcrt!read:
000007fe`fdc7b504 48895c2410           mov         qword ptr [`rsp`+10h],`rbx`
000007fe`fdc7b509 4889742418           mov         qword ptr [`rsp`+18h],`rsi`
000007fe`fdc7b50e 894c2408               mov         dword ptr [`rsp`+8],ecx
000007fe`fdc7b512 57                           push       `rdi
000007fe`fdc7b513 4154                       push       `r12`
000007fe`fdc7b515 4155                       push       `r13`
000007fe`fdc7b517 4156                       push       `r14`
000007fe`fdc7b519 4157                       push       `r15`
000007fe`fdc7b51b 4883ec30               sub         `rsp`,30h

Hence, it is quite clear that we have all of the information we need in UNWIND_INFO and UNWIND_CODE[] to know how to walk the stack. What we do not know is the division between home space and local variable space. That is only found in the private symbols (PDB files).

Note: There are other codes – UWOP_PUSH_NONVOL, UWOP_ALLOC_LARGE, UWOP_ALLOC_SMALL, UWOP_SET_FPREG, UWOP_SAVE_NONVOL, UWOP_SAVE_NONVOL_FAR, UWOP_SAVE_XMM128, UWOP_SAVE_XMM128_FAR, UWOP_PUSH_MACHFRAME. For a list of all the codes, and their usage, see here.

The Issue of Dynamic Code

Finally, we get to the question of how dynamic code (that the compiler cannot build the stack information for and put into the .Exception section) is handled.The solution is really quite simple. Basically, code that is dynamically generated must, obviously, have unwind information too. Else we cannot walk the stack. This is handled via APIs which are provided, namely:

RtlAddFunctionTable
RtlInstallFunctionCallback

The prototypes are as follows:

BOOLEAN WINAPI RtlAddFunctionTable(
  __in   PRUNTIME_FUNCTION FunctionTable,
  __in   DWORD EntryCount,
  __in   DWORD64 BaseAddress,
  __in   ULONGLONG TargetGp
);

BOOLEAN WINAPI RtlInstallFunctionTableCallback(
  __in   DWORD64 TableIdentifier,
  __in   DWORD64 BaseAddress,
  __ in   DWORD Length,
  __in   PGET_RUNTIME_FUNCTION_CALLBACK Callback,
  __in   PVOID Context,
  __in   PCWSTR OutOfProcessCallbackDll
);

These APIs create a dynamic table, in memory, and a pointer to this table is stored in a variable in NTDLL, called RtlpDynamicFunctionTable. This is a linked list, and it conforms to the standard (albeit undocumented) way in which Windows implements linked lists.

RtlAddFunctionTable “inserts” a RUNTIME_FUNCTION record into the dynamic table, while RtlInstallFunctionTableCallback specifies the memory range that belongs to the generated code. It is somewhat like a “static” vs. “dynamic” difference. The difference is that RtlInstallFunctionTableCallback does not build the dynamic table until it is needed by an unwind request. At that time, the system calls the callback function with the Context and the control address, and the callback function is responsible for returning the runtime function entry instead.

In short, the dynamic table contains the RUNTIME_FUNCTION records that we need. Other APIs exist to allow the searching of the dynamic table based on a given instruction pointer. One such API is SymFunctionTableAccess64(), as follows:

PVOID WINAPI SymFunctionTableAccess64(
  __in   HANDLE hProcess,
  __in   DWORD64 AddrBase
);

where the return value is a pointer to the function table entry. Finally, from a run-time point of view, all these dynamic tables and callbacks are perfectly good. However, when the process is suspended (like in the debugger), we can’t execute code to query dynamic tables or call callback functions. This is where the RtlInstallFunctionTableCallback accepts a 6th parameter, called OutOfProcessCallbackDll. MSDN’s description is quite clear on this (rephrased lightly):

OutOfProcessCallbackDll is an optional pointer to a string that specifies the path of a DLL that provides function table entries that are outside the process. When a debugger unwinds to a function in the range of addresses managed by the callback function, it loads this DLL and calls the OUT_OF_PROCESS_FUNCTION_TABLE_CALLBACK_EXPORT_NAME function, whose type is a pointer to OUT_OF_PROCESS_FUNCTION_TABLE_CALLBACK.