On the memory layout of objects and zero-cost abstractions of C++

When I was writing a financial application that stored millions of vectors some years ago I was intrigued by the overhead of the Visual Studio 2008 implementation of std::vector. Recently I discovered an (undocumented) compiler switch that gave me the answer to why that happened: /d1reportSingleClassLayoutXXX where XXX is a class name. If we compile this file main.cpp:

#include <vector>

std::vector<int> v;

like this:

cl /c /EHsc /nologo /W4 /MT main.cpp /d1reportSingleClassLayout?$vector@HV?$allocator@H@std@@
main.cpp
class ?$vector@HV?$allocator@H@std@@    size(48):
        +---
        | +--- (base class ?$_Vector_val@HV?$allocator@H@std@@)
        | | +--- (base class ?$_Container_base_aux_alloc_real@V?$allocator@H@std@@)
        | | | +--- (base class _Container_base_aux)
 0      | | | | _Myownedaux
        | | | +---
 8      | | | ?$allocator@V_Aux_cont@std@@ _Alaux
        | | | <alignment member> (size=7)
        | | +---
16      | | ?$allocator@H _Alval
        | | <alignment member> (size=7)
        | +---
24      | _Myfirst
32      | _Mylast
40      | _Myend
        +---

we could see std::vector would have at least 48 bytes of overhead over the raw data (in this case I used the mangled name of std::vector when invoking cl)

Fortunately, this changed with time and in VS 2017 you get this output:

class std::vector<int,class std::allocator >       size(24):
        +---
 0      | +--- (base class std::_Vector_alloc<struct std::_Vec_base_types<int,class std::allocator > >)
 0      | | ?$_Compressed_pair@V?$allocator@H@std@@V?$_Vector_val@U?$_Simple_types@H@std@@@2@$00 _Mypair
        | +---
        +---

So it reduced in half: you just need a pointer to the allocation (_Myfirst), a pointer to end of the utilized section of the allocation (_Mylast) and a pointer to the end of the allocation (_Myend). The VS implementation is now truly following the zero-cost overhead principle, it only stores those 3 pointers = 24 bytes.

Looking at the memory layout in VS2008 we could see that the compiler was storing a pointer to the allocator, which is unnecessary and it was removed in later versions. Similarly std::string reduced its overhead from 48 bytes in Vs 2008 to 32 bytes in VS 2017.

The compiler switch has a sister, /d1reportAllClassLayout, which would output the memory layout of all the classes that would be part of the .obj. With this kind of options it is easy to see why it is usually suggested to declare data members by decreasing alignment. E.g.

struct MyClass {
  char a;
  int* b;
  char c;
  };

produces:

class MyClass   size(24):
        +---
 0      | a
        | <alignment member> (size=7)
 8      | b
16      | c
        | <alignment member> (size=7)
        +---

but the chars are 1-byte alignments and the pointers are 8-byte aligned, so ordering them by decreasing aligment as the rule says:

struct MyClass {
  int* b;
  char a;  
  char c;  
};
class MyClass   size(16):
        +---
 0      | a
 1      | c
        | <alignment member> (size=6)
 8      | b
        +---

Saves 8 bytes. It is easy to see why the compiler needs padding. If there was no padding in the last example we would have:

*--------*--------*
|acbbbbbb|bb      *
*--------*--------* 

Accessing b would require fetching two words instead of one which would not be efficient. Padding is inserted so the memory access is aligned:

*--------*--------*
|ac      |bbbbbbbb*
*--------*--------* 

The compiler would not perform any reordering of the data members by itself as the C standard says that data members shall have incresing memory addresses. C was designed for direct memory access and that rule would allow programmers to predict memory layouts and store blocks of data read from devices directly.

But reducing memory consumption may not be the main issue at stake:
– On a 64-bit x86, a cache line is 64 bytes beginning on a self-aligned address, so you may want to store the data members that are frequently accessed together on those lines
– You may also want to prevent false sharing separating concurrently accessed data members so threads running on multiple processors do not invalidate their cache copies constantly
– In embedded systems, the offsets encoded inside instructions to point to addresses could very small, e.g. in a 16-bit ARM Thumb, you would have 5 bits (for offsets from 0 to 127) in some instructions, so you may want to keep frequently used data members at the beginning of a structure’s layout

So with this post I would like people to see of the importance of measuring and not assuming zero-cost abstractions in C++. This compiler option, /d1reportSingleClassLayout, is mentioned in a MSDN blog post on how to debug ODR violations and in one of the great Stephan T. Lavavej’s videos about the STL

Advertisements

Introducing pct, a tool to help reducing C/C++ compilation times

Until C++ modules become widely available (Microsoft released experimental support for the import statement last December: https://blogs.msdn.microsoft.com/vcblog/2015/12/03/c-modules-in-vs-2015-update-1/ ), we still need to resort to precompiled headers as one way to reduce compilation times on Windows.

I have released today the first version of a tool that allows to auto-generate precompiled headers (usually named stdafx.h on Windows). Auto-generate stdafx.h is not as simple as it may seem. One may think that he could just make a grep in search of standard headers and then include all those lines in the stdafx.h of the project; but that does not take into account that some of those lines may be disabled depending on the macro values for instance. The tool uses the Boost wave preprocessor to preprocess the source code of a project and generates a header to be precompiled, including all the standard or third-party library headers referenced in the code.

Using the tool I have been to reduce compilation times on one of my Visual Studio projects by a factor of six. The source code and the binaries are at:

https://github.com/g-h-c/pct