Optimization of divide in C/C++

I’d heard about that before:

The modern compiler will optimized division by 2 (and its power) as right-shift for integer.

But rently I’m tring to optimized some codes and found that’s only half right, morden compiler will only try to replace division by 2 (and its power) by right-shift implicit for unsigned integer.

Since shift of an negative integer is implement defined, which means the result of right shift on a negative interger might be different with division, for example:

-5 / 2  // => -2
-5 >> 1 // => -3

That’s why compiler wouldn’t replace division by 2 (and its power) as right-shift for signed integer, following is bench mark result for division by 2 and right shift with various integer:

Run on (4 X 2400 MHz CPU s)
CPU Caches:
  L1 Data 32K (x2)
  L1 Instruction 32K (x2)
  L2 Unified 262K (x2)
  L3 Unified 3145K (x1)
---------------------------------------------------------------
Benchmark                        Time           CPU Iterations
---------------------------------------------------------------
BM_div_int                   29162 ns      29101 ns      23729
BM_div_unsigned_int           3779 ns       3761 ns     187839
BM_div_short                 50725 ns      50643 ns      12916
BM_div_unsigned_short         7282 ns       7266 ns      94185
BM_div_char                  29228 ns      29177 ns      23323
BM_div_unsigned_char          3707 ns       3694 ns     197558
BM_shift_int                  3671 ns       3653 ns     186107
BM_shift_unsigned_int         3896 ns       3873 ns     184216
BM_shift_short                3683 ns       3673 ns     183546
BM_shift_unsigned_short       7466 ns       7447 ns      95329
BM_shift_char                 3656 ns       3651 ns     184243
BM_shift_unsigned_char        3680 ns       3666 ns     191142

Here’s the code for benchmark, I tested on Apple LLVM version 9.0.0 (clang-900.0.38), with -O3 switch.

#include <benchmark/benchmark.h>

template <typename T>
static T div(T x)
{
    for (size_t i = 0; i < 100000; ++i)
        x = x / 2;
    return x;
}

template <typename T>
static T shift(T x)
{
    for (size_t i = 0; i < 100000; ++i)
        x = x >> 1;
    return x;
}

template <typename T>
static void BM_div(benchmark::State& state)
{
    for (auto _ : state)
        benchmark::DoNotOptimize(div<T>(rand()));
}

static void BM_div_int(benchmark::State& state) { BM_div<int>(state); }
BENCHMARK(BM_div_int);
static void BM_div_unsigned_int(benchmark::State& state) { BM_div<unsigned int>(state); }
BENCHMARK(BM_div_unsigned_int);
static void BM_div_short(benchmark::State& state) { BM_div<short>(state); }
BENCHMARK(BM_div_short);
static void BM_div_unsigned_short(benchmark::State& state) { BM_div<unsigned short>(state); }
BENCHMARK(BM_div_unsigned_short);
static void BM_div_char(benchmark::State& state) { BM_div<char>(state); }
BENCHMARK(BM_div_char);
static void BM_div_unsigned_char(benchmark::State& state) { BM_div<unsigned char>(state); }
BENCHMARK(BM_div_unsigned_char);

template <typename T>
static void BM_shift(benchmark::State& state)
{
    for (auto _ : state)
        benchmark::DoNotOptimize(shift<T>(rand()));
}

static void BM_shift_int(benchmark::State& state) { BM_shift<int>(state); }
BENCHMARK(BM_shift_int);
static void BM_shift_unsigned_int(benchmark::State& state) { BM_shift<unsigned int>(state); }
BENCHMARK(BM_shift_unsigned_int);
static void BM_shift_short(benchmark::State& state) { BM_shift<short>(state); }
BENCHMARK(BM_shift_short);
static void BM_shift_unsigned_short(benchmark::State& state) { BM_shift<unsigned short>(state); }
BENCHMARK(BM_shift_unsigned_short);
static void BM_shift_char(benchmark::State& state) { BM_shift<char>(state); }
BENCHMARK(BM_shift_char);
static void BM_shift_unsigned_char(benchmark::State& state) { BM_shift<unsigned char>(state); }
BENCHMARK(BM_shift_unsigned_char);

BENCHMARK_MAIN();