This is the first post of a series I am making on C++ exceptions.

When people say “C++ exceptions are slow”, what they really meant is that throwing an exception in C++ is slow. C++’s exception handling has zero-cost when there are no exceptions. Let’s look at a very simple example.

class Foo {
    public:
        ~Foo() {}
};

class Bar {
    public:
        ~Bar() {}
};

void func(bool b) {
    Bar bar{}; // <--- Bar::~Bar() should be called on stack unwinding
    if (b) {
        throw 1;
    }
}

int main() {
    try {
      Foo foo{}; // <--- Foo::~Foo() should be called on stack unwinding
      func(false);
    } catch (...) {
        
    }
    return 0;
}

I put throw 1; in the code, even though it never throws; so we can see the generated machine code (compiled on clang 13 with -std=c++17 without optimization). Here’s the assembly (in Intel syntax), which we will read one piece at a time.

func(bool): # @func(bool)
        push rbp
        mov rbp, rsp
        sub rsp, 32
        mov al, dil
        and al, 1
        mov byte ptr [rbp - 1], al
        test byte ptr [rbp - 1], 1
        je .LBB0_3
        mov edi, 4
        call __cxa_allocate_exception
        mov rdi, rax
        mov dword ptr [rdi], 1
        mov esi, offset typeinfo for int
        xor eax, eax
        mov edx, eax
        call __cxa_throw
        jmp .LBB0_5
        mov rcx, rax
        mov eax, edx
        mov qword ptr [rbp - 16], rcx
        mov dword ptr [rbp - 20], eax
        lea rdi, [rbp - 8]
        call Bar::~Bar() [base object destructor]
        jmp .LBB0_4
.LBB0_3:
        lea rdi, [rbp - 8]
        call Bar::~Bar() [base object destructor]
        add rsp, 32
        pop rbp
        ret
.LBB0_4:
        mov rdi, qword ptr [rbp - 16]
        call _Unwind_Resume@PLT
.LBB0_5:
Bar::~Bar() [base object destructor]: # @Bar::~Bar() [base object destructor]
        push rbp
        mov rbp, rsp
        mov qword ptr [rbp - 8], rdi
        pop rbp
        ret
main: # @main
        push rbp
        mov rbp, rsp
        sub rsp, 32
        mov dword ptr [rbp - 4], 0
        xor edi, edi
        call func(bool)
        jmp .LBB2_1
.LBB2_1:
        lea rdi, [rbp - 8]
        call Foo::~Foo() [base object destructor]
        jmp .LBB2_4
        mov rcx, rax
        mov eax, edx
        mov qword ptr [rbp - 16], rcx
        mov dword ptr [rbp - 20], eax
        lea rdi, [rbp - 8]
        call Foo::~Foo() [base object destructor]
        mov rdi, qword ptr [rbp - 16]
        call __cxa_begin_catch
        call __cxa_end_catch
.LBB2_4:
        xor eax, eax
        add rsp, 32
        pop rbp
        ret
Foo::~Foo() [base object destructor]: # @Foo::~Foo() [base object destructor]
        push rbp
        mov rbp, rsp
        mov qword ptr [rbp - 8], rdi
        pop rbp
        ret

Zero-cost exception handling

Let’s start from the simplest ones.

.LBB0_5:
Bar::~Bar() [base object destructor]: # @Bar::~Bar() [base object destructor]
        push rbp # store the caller frame pointer on stack
        mov rbp, rsp # set the frame pointer for the callee
        mov qword ptr [rbp - 8], rdi # store `this` pointer on stack (passed in by DRI, dictated by AMD64 ABI)
        pop rbp # restore the frame pointer
        ret

Both .LBB0_5 and Bar::~Bar() symbols store the code for Bar::~Bar(). We see the common prologue and epilogue on x86.The code is the same for Foo::~Foo(). Now let’s take a look at func.

func(bool): # @func(bool)
        push rbp
        mov rbp, rsp
        sub rsp, 32
mov al, dil # dil is the lower 8 bits from RDI(http://www.tortall.net/projects/yasm/manual/html/arch-x86-registers.html) Here, it's passing the `bool` param to AL
        and al, 1
        mov byte ptr [rbp - 1], al
        test byte ptr [rbp - 1], 1
        je .LBB0_3 # jump to .LBB0_3 if `b` is not 1/true
        mov edi, 4
        call __cxa_allocate_exception
        mov rdi, rax
        mov dword ptr [rdi], 1
        mov esi, offset typeinfo for int
        xor eax, eax
        mov edx, eax
        call __cxa_throw
        jmp .LBB0_5
        mov rcx, rax
        mov eax, edx
        mov qword ptr [rbp - 16], rcx
        mov dword ptr [rbp - 20], eax
        lea rdi, [rbp - 8]
        call Bar::~Bar() [base object destructor]
        jmp .LBB0_4

After the initialization of bar, if b is false, the code will just jump to .LBB0_3which calls Bar’s destructor, unwinds the stack pointer rsp and returns.

.LBB0_3:
        lea rdi, [rbp - 8] # store `this` in rdi (dictated by AMD64 ABI)
        call Bar::~Bar() [base object destructor] # call the dtor
        add rsp, 32 # unwind the stack
        pop rbp # restore caller's frame pointer
        ret # return

Now let’s take a look at main.

main: # @main
        push rbp
        mov rbp, rsp
        sub rsp, 32
        mov dword ptr [rbp - 4], 0 # looks useless
        xor edi, edi # set edi to zero (again dictated by AMD64 ABI)
        call func(bool)
        jmp .LBB2_1
.LBB2_1:
        lea rdi, [rbp - 8] # store `this` foo's addr in rdi
        call Foo::~Foo() [base object destructor]
        jmp .LBB2_4
        ...
        
.LBB2_4:
        xor eax, eax # set eax (return value) to zero
        add rsp, 32 # unwind the stack
        pop rbp # restore caller's frame pointer
        ret        

The reason why it uses xor eax eax to set a register to 0 is for efficiency reason. It produces shorter opcode and enables the processor to perform register renaming. Notice the mov dword ptr [rbp - 4], 0 line in main. It initializes the next four bytes on the stack to zero, which we are not really using anyway. If you switch from clang to gcc, this line would disappear (with the same flags).

If you follow the labels from main, to .LBB2_1, to .LBB2_4, all they are doing is just calling func, destructing foo, and returning 0. The code runs as if the exception handling logic is not even there to begin with, when it’s not thrown. This is known as zero-cost exception handling. It’s an implementation detail for programming languages. E.g. C++ and Java have zero-cost exception handling. CPython 3.11 is thinking about adding zero-cost exception handling support.

You might be thinking, we are not even handling an exception in this code, of course it’s zero-cost. How can we call it exception handling to begin with? Well, we actually did handle the exception, it just didn’t throw. We can rewrite the code as,

int foo(bool b) {
  if (b) {return -1;} // exceptional case
  else {return 0;}
} 

int main() {
  int ret = foo(false);
  if (ret != 0) { // error handling }
}

Notice that even if func never returns -1 here, main needs to check the return value (handling the exception). A smart compiler can optimize this away in this case. But in practice, func will error out sometimes. So main always needs to check the return value of func because it’s possible that it needs to handle the error case. This check, a form of exception handling, is not free, even though 99% of the time there are no errors. This is essentially how Python implements its exception handling. It checks if there’s an exception PyErr all the time, and invokes the handling logic if needed. CPython’s exception handling (before 3.11) is not zero-cost. With C++ exception, on the other hand, it’s free (zero-cost), when there are no exceptions (if you discount iTLB misses). There are binary layout optimizers such as BOLT that can even further optimize the physical layout of the code to minimize iTLB misses.

This strategy is the equivalent of optimistic concurrency control in database. As you might have guessed, it comes with a cost. When an optimistic prediction is wrong — when an exception is actually thrown — it’s much more expensive.