This is my first blog in a long time so lets try to do something simple.

Let’s explore some loops. We will mostly be using C++ for this and see what sort of assembly we can generate. Will it be fast? Not sure. Loops being fast are also mostly a function of what the loops are doing inside the loop body so.. yeah. All code samples presented here are with C++20 unless otherwise stated.

Normal Loops

C-Style Simple Loops

Let’s start simple. Here’s a simple for loop wrapped in a function.

C++ Code

// gcc 15.2: -std=c++2a -O3
#include <span>
#include <cstdint>

using namespace std;

int simple_c_style_for_loop(int* arr, const size_t n) {
    int sum = 0;
    for(int i = 0;i < n;i++) {
        sum += arr[i] + sum / 10;
    }
    return sum;
}

Assembly (x86-64)

simple_c_style_for_loop(int*, unsigned long):
        test    rsi, rsi
        je      .L4
        lea     rsi, [rdi+rsi*4]
        xor     edx, edx
.L3:
        movsx   rax, edx
        mov     ecx, edx
        add     rdi, 4
        imul    rax, rax, 1717986919
        sar     ecx, 31
        sar     rax, 34
        sub     eax, ecx
        add     eax, DWORD PTR [rdi-4]
        add     edx, eax
        cmp     rsi, rdi
        jne     .L3
        mov     eax, edx
        ret
.L4:
        xor     edx, edx
        mov     eax, edx
        ret

What about a while loop?

C++ Code

// gcc 15.2: -std=c++2a -O3
#include <span>
#include <cstdint>

using namespace std;

int simple_c_style_while_loop(int* arr, const size_t n) {
    int loop_counter = 0;
    int sum = 0;
    while (loop_counter < n) {
        sum += arr[loop_counter++] + sum / 10;
    }
    return sum;
}

Assembly (x86-64)

simple_c_style_while_loop(int*, unsigned long):
        test    rsi, rsi
        je      .L4
        lea     rsi, [rdi+rsi*4]
        xor     edx, edx
.L3:
        movsx   rax, edx
        mov     ecx, edx
        add     rdi, 4
        imul    rax, rax, 1717986919
        sar     ecx, 31
        sar     rax, 34
        sub     eax, ecx
        add     eax, DWORD PTR [rdi-4]
        add     edx, eax
        cmp     rsi, rdi
        jne     .L3
        mov     eax, edx
        ret
.L4:
        xor     edx, edx
        mov     eax, edx
        ret

Well, pretty much the same code. not much to say here I guess.

Finally, a do-while loop for completeness’s sake.

C++ Code

// gcc 15.2: -std=c++2a -O3
#include <span>
#include <cstdint>

using namespace std;

int simple_c_style_do_while_loop(int* arr, const size_t n) {
    int loop_counter = 0;
    int sum = 0;
    do {
        sum += arr[loop_counter++] + sum / 10;
    } while (loop_counter < n);
    return sum;
}

Assembly (x86-64)

simple_c_style_do_while_loop(int*, unsigned long):
        mov     r8, rdi
        xor     ecx, ecx
        mov     rdi, rsi
        xor     edx, edx
.L2:
        movsx   rax, edx
        mov     esi, edx
        imul    rax, rax, 1717986919
        sar     esi, 31
        sar     rax, 34
        sub     eax, esi
        add     eax, DWORD PTR [r8+rcx*4]
        add     rcx, 1
        add     edx, eax
        cmp     rcx, rdi
        jb      .L2
        mov     eax, edx
        ret

Well, the assembly here is certainly shorter and we use jb instead of jne, as well as the DWORD PTR access being different.

Sooo… Is there any actual performance difference??

Performance Comparison

To see if there’s any real-world difference, I ran some benchmarks using Google Benchmark on my personal computer. As is expected and perhaps rational, there’s not really any difference to be seen. Maybe in your machine/compiler version there can be a difference in the do-while loop?

Benchmark Iterations Time (ns) Throughput
Do-While 1,000 1,965 1.90 Gi/s
Do-While 10,000 19,716 1.89 Gi/s
Do-While 100,000 197,121 1.89 Gi/s
Do-While 10,000,000 19,809,072 1.88 Gi/s
While 1,000 1,963 1.90 Gi/s
While 10,000 19,711 1.89 Gi/s
While 100,000 197,262 1.89 Gi/s
While 10,000,000 19,817,713 1.88 Gi/s
For 1,000 1,971 1.89 Gi/s
For 10,000 19,709 1.89 Gi/s
For 100,000 197,470 1.89 Gi/s
For 10,000,000 19,846,456 1.88 Gi/s

In the next part, I will be continuing checking out the other types of loops in C++.