Micro-optimizations in .NET (x86/x64)

This is the second post in the "Micro-optimizations in .NET" series. If you missed the first one, check out Conversion from Boolean to Integer.

Example 2: Integer Comparison (Branchless Design)

As discussed previously, conditional branches are costly for modern processors when the branch predictor fails. Branch misprediction occurs frequently unless the data is homogeneous (e.g., all values are identical). The core idea of optimizing such algorithms is to "straighten" the execution flow by removing conditional jumps.

Let's examine a standard integer comparison function. It returns -1 if a < b, 1 if a > b, and 0 if a = b.

public static int CompareIntegers(int a, int b)
{
    if (a < b) return -1;
    if (a > b) return 1;
    return 0;
}

In the worst-case scenario (when operands are equal), the CPU must evaluate two conditional jumps. The assembly listing confirms this:

; Examples.CompareIntegers(Int32, Int32)
    L0000: cmp ecx, edx        ; Compare a and b
    L0002: jge short L000a     ; Jump if a >= b
    L0004: mov eax, 0xffffffff ; Result = -1
    L0009: ret
    L000a: cmp ecx, edx        ; Compare again
    L000c: jle short L0014     ; Jump if a <= b
    L000e: mov eax, 1          ; Result = 1
    L0013: ret
    L0014: xor eax, eax        ; Result = 0
    L0016: ret

The Optimized Branchless Approach

We can eliminate branches by recognizing that the result is essentially the difference between two boolean states. Instead of if statements, we use the setg (set if greater) and setl (set if less) instructions, which don't trigger a jump.

public static int CompareIntegersOptimized(int a, int b)
{
    bool isGreater = a > b;
    bool isLess = a < b;

    // Using the trick from Part 1
    return ConvertBooleanToIntOptimized(isGreater) - 
           ConvertBooleanToIntOptimized(isLess);
}

The generated machine code is now linear:

; Examples.CompareIntegersOptimized(Int32, Int32)
    L0000: xor eax, eax   ; Clear EAX
    L0002: cmp ecx, edx   ; Compare a and b
    L0004: setg al        ; AL = (a > b) ? 1 : 0
    L0007: cmp ecx, edx   ; Compare again
    L0009: setl dl        ; DL = (a < b) ? 1 : 0
    L000c: movzx edx, dl  ; Zero-extend DL to EDX
    L000f: sub eax, edx   ; EAX = isGreater - isLess
    L0011: ret

Performance Benchmarks

The impact of branch prediction is clearly visible when comparing random data vs. constant data.

1. Random Data (Unpredictable):

The branchless version is ~3x faster because it avoids the massive penalty of mispredictions.

Method	Mean	Error	StdDev
CompareIntegers	3.855 ms	0.0654 ms	0.0580 ms
CompareIntegersOptimized	1.203 ms	0.0342 ms	0.0970 ms

2. Constant Data (Predictable Zeroes):

Interestingly, the branchless version is ~20% slower here. When the CPU can perfectly predict the branch, the standard if version wins because it executes only 6 instructions versus the 8 instructions in our optimized version.

Method	Mean	Error	StdDev
CompareIntegers	885.9 μs	15.62 μs	13.84 μs
CompareIntegersOptimized	1,070.7 μs	14.58 μs	12.93 μs

Generic Implementation

Thanks to Static Abstract Members in Interfaces, we can make this logic generic for any type that supports comparison operators:

public static int CompareGeneric<T>(T left, T right) 
    where T : IComparisonOperators<T, T, bool>
{
    return ConvertBooleanToIntOptimized(left > right) - 
           ConvertBooleanToIntOptimized(left < right);
}

Conclusion

Branchless code is a powerful tool for processing random or "noisy" data. However, as shown, it's not a silver bullet. Always profile your specific data distribution before committing to such micro-optimizations.