Memory allocation made right
NOTE: This article is for those who are new to code optimizations. If you have already understood the design rationale of small vectors and SOO (small object optimzation), please ignore this.
TL;DR
Try to avoid dynamic allocation in critical path. Always prefer stack/static memory, or at least leverage small vectors according to the distribution of the length of your data.
Memory Storage Types
As what you were taught in college class, there’re commonly 3 memory storage types (stuff like thread_local
is not my story today):
1) static; 2) dynamic (or so-called heap memory); 3) automatic/stack
A dummy example
Look at this example containing those 3 types of memory usage.
1
2
3
4
5
6
7
int static_var {233333}; // static storage
int main() {
auto ptr_to_dynamic_var = new int{55555555}; // dynamic storage
int stack_var {666666}; // automatic/stack storage
delete ptr_to_dynamic_var;
}
Its corresponding assembly (unused labels, lib functions, directives filtered) by GCC 11.1 in -O0 -std=c++2a
flag:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
static_var:
.long 233333 # non-const static storage variable will be stored in static data region (marked by its corresponding directive `.long`).
main:
push rbp
mov rbp, rsp
sub rsp, 16 # You can simply regard this as stack allocation, making rsp pointing to the end of current stack.
mov edi, 4
call operator new(unsigned long) # allocate dynamic memory according to `operator new` (usually libc's malloc by default)
mov DWORD PTR [rax], 55555555 # After dynamic allocation, we got the space and we ship the value (55555555) into the corresponding address.
mov QWORD PTR [rbp-8], rax
mov DWORD PTR [rbp-12], 666666 # Set the value of `stack_var`
mov rax, QWORD PTR [rbp-8]
test rax, rax
je .L2
mov esi, 4
mov rdi, rax
call operator delete(void*, unsigned long)
.L2:
mov eax, 0
leave
ret
To help you read the assembly, let’s quickly review some important X86 registers:
-
%rax
: return value. We move the value55555555
torax
which was previously set by the calloperator new
, and then we put the value ofrax
to[rbp - 8]
-
%rbp
: the start address of current stack (the base pointer forrsp
(the stack pointer)).[rbp - 8]
is whereptr_to_dynamic_var
lives (in 64bit mode, pointers are of 8 bytes) and[rbp - 12]
(12 = 8 + 4) is wherestack_var
lives. Make a lot more sense right?
Some intuitive analysis
The (de)allocation of static variable is super cheap, since you cannot see any instructions regarding its (de)allocation. Actually, (de)allocation to static varibles is handled by the OS (reserved by the compiler). Like the memory for the codes (instructions), it is created once the program starts and cleaned when the program ends. Hence, there’s no runtime cost of (de)allocate them.
Stack memory is also cheap, theoretically (let’s forget program interrupt), only 2 instruction (push rbp
and mov rbp, rsp
) is used to initialize the stack and 1 for “allocating” the stack frame. Since we can simply regard that we have infinite memory towards the direction of stack growth, the allocation is as cheap as moving forward a stack pointer (or statically set its offset in the assembly). The behavior of stack allocation is simply defined during compilation time since it is only relavant to your codes. It is a bit dynamic because the program during runtime can determine which path to enter. Some paths may contain a deep stack while others may not. But anyways, in general, stack allocation is quite cheap because it is also very “static”.
Well, as for “the type of crime”, dynamic allocation is absolutely the user’s (or input’s) call. It is your call to tell when to allocate (e.g., new
) and deallcoate (e.g., delete
). The cost of flexibility is complexity and efficiency. First, flexibility makes it possible to new
/delete
anywhere you want, however, it might not be legal or well-formed. You may kill your program out of double free, deallocating a non-exist pointer, memory leak, and many others. To avoid such issues, you finally ask for some help from garbage collection, smart pointer, and many other ownership stuffs, introducing complexity to your codes (tell me the probability of successfully compiling a rust program and how much you love unsafe
) and some performance degradation. Second, everything is a tradeoff and “allocate anywhere” does not come for free. You or the memory allocation library (e.g., jemalloc) you refered to need to carefully maintain one or more logically contiguous buffer(s) to make sure either it allocates fast (monotonic allocator) or long/robust (compact allocator). The actual instructions executed in runtime behind that call operator new
can be dramatic. Usually such mechanisms makes allocation cost unstable, it can be fast when there’s plenty of space left in the pool, but can be extremely slow when you have a huge amount of memory fragments (did you make a lot of small objects allocted with malloc
?). Apart from this, for small objects, stack memory is always more cache-friendly than heap ones since you are always seeing something in the stack (see the example assembly) so that stack memory can be considered always on-cache.
In a word, dynamic memory is guilty and we should try our best to prevent it if your program is performance-critical (since our compilers at this moment cannot resolve this problem at all).
Also, I’d like to quote a sentence from Optimized C++ whose author seems to hate dynamic memory way more than me:
That’s where the money is.
—Bank robber Willie Sutton (1901–1980)
This quote was attributed to Sutton as the answer to a reporter’s 1952 question, “Why do you rob banks?” Sutton later denied ever having said it.
Except for the use of less-than-optimal algorithms, the naïve use of dynamically allocated variables is the greatest performance killer in C++ programs. Improving a program’s use of dynamically allocated variables is so often “where the money is” that a developer can be an effective optimizer knowing nothing other than how to reduce calls into the memory manager.
But he also said:
Lest I start a panic, let me say that the goal in optimizing memory management is not to live an ascetic life free of distraction from the many useful C++ features that use dynamically allocated variables. Rather, the goal is to remove performance-robbing, unneeded calls into the memory manager through skillful use of these same features.
Making std::array
and small vectors your second option
Well, talk is cheap, let’s see some real C++ codes.
Introducing std::array
Standard array occupies stack memory (or static memory if static-storage).
1
2
3
4
5
6
7
8
// ill. triggers dynamic allocation.
const std::vector ill_vec {0, 1, 2, 3, 4, 5, 6, 7};
// better (remember to include <experimental/array>)
const auto good_vec = std::experimental::make_array(0, 1, 2, 3, 4, 5, 6, 7);
// way better
constexpr auto better_vec = std::experimental::make_array(0, 1, 2, 3, 4, 5, 6, 7);
Limitation of std::array
:
- Size are statically fixed:
- Limited initialization;
- Cannot append or fallback to dynamic vector;
- No capacity-size support;
1
2
3
4
5
6
7
// illegal
int N;
std::cin >> N;
std::array<int, N> arr {};
// correct
std::array<int, 1000> arr {};
To resolve these issues, try small vectors boost::container::small_vector
.
Small vectors
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
// Try this on: https://wandbox.org/
#include <boost/container/small_vector.hpp>
#include <boost/range/irange.hpp>
int main() {
boost::container::small_vector<int, 8> small_vec;
for (auto i : boost::irange(10)) {
small_vec.push_back(i);
}
assert(small_vec.size() == 10);
return sizeof(small_vec);
// 56 = sizeof(ptr) + sizeof(size) + sizeof(capacity) + 8 * sizeof(int)
// = 8 + 8 + 8 + 8 * 4
}
Looks quite like a vector right? In fact, a small vector like small_vector<int, 8>
is consist of:
- for stack array usage: 8 int on the stack. if size <= 8;
- for full-vector fallback: fallback to a normal vector if size > 8;
If your data distribution is like: most data.size()
<= 8, you definitely wanna try a small vector since it can prevent your most of your data being shipped to the heap.
Benchmark
We compare the push_back
efficiency among some std::vector
and boost::container::small_vector
.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
#include <benchmark/benchmark.h>
#include <boost/container/small_vector.hpp>
#include <boost/range/irange.hpp>
#include <chrono>
#include <iostream>
#include <random>
#include <thread>
#include <vector>
const std::vector<int> workload = [] {
using namespace std::chrono_literals;
std::this_thread::sleep_for(100ms);
std::vector<int> v;
constexpr size_t n_workload = 100;
v.reserve(n_workload);
std::random_device rd {};
std::mt19937 gen { rd() };
std::chi_squared_distribution<float> dist(8.);
// see https://en.cppreference.com/w/cpp/numeric/random/chi_squared_distribution
std::cout << "Dataset size distribution:\n";
for (auto i : boost::irange(n_workload)) {
int s = static_cast<int>(dist(gen));
v.push_back(s);
std::cout << s << '\t';
if (0 == (i + 1) % 20)
std::cout << std::endl;
}
return v;
}();
template <size_t N> void PushBackSmallVecBench(benchmark::State& state)
{
for (auto _ : state) {
for (size_t s : workload) {
boost::container::small_vector<int, N> v;
for (int i : boost::irange(s))
v.push_back(i);
}
}
}
template <size_t N> void PushBackStdVecBench(benchmark::State& state)
{
for (auto _ : state) {
for (size_t s : workload) {
std::vector<int> v;
v.reserve(N);
for (int i : boost::irange(s))
v.push_back(i);
}
}
}
static void PushBackSmallVector_BaseSize8(benchmark::State& state) { PushBackSmallVecBench<8>(state); }
static void PushBackSmallVector_BaseSize12(benchmark::State& state) { PushBackSmallVecBench<12>(state); }
static void PushBackSmallVector_BaseSize16(benchmark::State& state) { PushBackSmallVecBench<16>(state); }
static void PushBackSmallVector_BaseSize32(benchmark::State& state) { PushBackSmallVecBench<32>(state); }
static void PushBackVector_BaseSize0(benchmark::State& state) { PushBackStdVecBench<0>(state); }
static void PushBackVector_BaseSize8(benchmark::State& state) { PushBackStdVecBench<8>(state); }
static void PushBackVector_BaseSize12(benchmark::State& state) { PushBackStdVecBench<12>(state); }
static void PushBackVector_BaseSize16(benchmark::State& state) { PushBackStdVecBench<16>(state); }
static void PushBackVector_BaseSize32(benchmark::State& state) { PushBackStdVecBench<32>(state); }
// Register the function as a benchmark
BENCHMARK(PushBackVector_BaseSize0);
BENCHMARK(PushBackVector_BaseSize8);
BENCHMARK(PushBackSmallVector_BaseSize8);
BENCHMARK(PushBackVector_BaseSize12);
BENCHMARK(PushBackSmallVector_BaseSize12);
BENCHMARK(PushBackVector_BaseSize16);
BENCHMARK(PushBackSmallVector_BaseSize16);
BENCHMARK(PushBackVector_BaseSize32);
BENCHMARK(PushBackSmallVector_BaseSize32);
// Run the benchmark
BENCHMARK_MAIN();
The logic of the benchmark is to create a “length set” which follows Chi Squared distribution (dof = 8). For each benchmark case, we let the (small) vector to append values in that length and see its efficiency. The difference among those cases are:
- Type: small vector or standard vectors;
- Pre-allocated size:
- for
std::vector
we use.reserve
to reserve some heap memory; - for
boost::container::small_vector
we simply leverage the pre-allocated stack buffer as the reserved memory.
- for
The results:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
Dataset size distribution:
5 9 10 5 10 5 6 10 5 5 6 2 6 4 8 8 4 12 13 5
4 13 14 5 10 11 0 10 13 3 16 5 4 3 10 6 3 18 6 13
5 16 14 25 1 10 3 7 7 14 9 11 12 7 14 7 9 4 8 5
8 4 9 8 7 19 7 11 6 7 2 2 9 7 4 8 4 8 7 4
8 3 7 14 2 6 8 4 5 3 8 6 7 2 9 7 8 9 7 14
2021-07-10T14:33:01+08:00
Running /Users/ganler-mac/CLionProjects/PlayBoost/cmake-build-release/PlayBoost
Run on (8 X 2200 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x4)
L1 Instruction 32 KiB (x4)
L2 Unified 256 KiB (x4)
L3 Unified 6144 KiB (x1)
Load Average: 3.08, 3.10, 3.03
-------------------------------------------------------------------------
Benchmark Time CPU Iterations
-------------------------------------------------------------------------
PushBackVector_BaseSize0 56440 ns 56058 ns 12727
PushBackVector_BaseSize8 18255 ns 17942 ns 40257
PushBackSmallVector_BaseSize8 9183 ns 9000 ns 96773
PushBackVector_BaseSize12 21918 ns 20460 ns 34947
PushBackSmallVector_BaseSize12 6073 ns 5211 ns 100000
PushBackVector_BaseSize16 13196 ns 12863 ns 44217
PushBackSmallVector_BaseSize16 1776 ns 1753 ns 371258
PushBackVector_BaseSize32 12384 ns 12059 ns 78193
PushBackSmallVector_BaseSize32 2435 ns 2104 ns 336143
Conclusions:
- reserved memory is faster;
- stack memory is faster;
- those whose reserved memory size is closer to the distribution is faster;
Conclusion
- Use stack memory if you can (but watch out stack overflow, so do it for small objects);
- Even if in your case stack memory is not applicable (to avoid stack overflow), try to reserve some;
- See if your data has some kind of distribution and reserve memory according to it;
To extend:
VLA (variable-length array) is a technique to allocate user-space stack memory of arbitrary size. This is a standard for C99 but not for C++.
1
2
3
// for C99 standard.
int n = 10;
int arr[n];
Some points regarding VLA:
- Fast:
- Can be allocated on the stack (for most time).
- Slow:
- Much more instructions compared with fixed-size array allocation since it requires some additional instructions to adjust stack size. See slide 9 of Kees’ presentation.
- (In general cases) may not be allocated right after your physical-space stack since the size is variable and the underlying implementation might simply allocate in somewhere else and virtually “concat” it with your old stack.
- Dangerous:
- Since it is not restricted, it is a blazing-fast trap to trigger stack overflow.