Sunday, September 6, 2015

C++ encapsulation for Data-Oriented Design: performance

(Many thanks to Manu Sánchez for his help with running tests and analyzing results.)
In a past entry, we implemented a little C++ framework that allows us to do DOD while retaining some of the encapsulation benefits and the general look and feel of traditional object-based programming. We complete here the framework by adding a critical piece from the point of view of usability, namely the ability to process sequences of DOD entities with as terse a syntax as we would have in OOP.
To enable DOD for a particular class (like the particle we used in the previous entry), i.e., to distribute its different data members in separate memory locations, we change the class source code to turn it into a class template particle<Access> where Access is a framework-provided entity in charge of granting access to the external data members with a similar syntax as if they were an integral part of the class itself. Now, particle<Access> is no longer a regular class with value semantics, but a mere proxy to the external data without ownership to it. Importantly, it is the members and not the particle objects that are stored: particles are constructed on the fly when needed to use its interface in order to process the data. So, code like
for(const auto& p:particle_)p.render();
cannot possibly work because the application does not have any particle_ container to begin with: instead, the information is stored in separate locations:
std::vector<char> color_;
std::vector<int>  x_,y_,dx_,dy_;
and "traversing" the particles requires that we go through the associated containers in parallel and invoke render on a temporary particle object constructed out of them:
auto itc=&color_[0],ec=itc+color_.size();
auto itx=&x_[0];
auto ity=&y_[0];
auto itdx=&dx_[0];
auto itdy=&dy_[0];
  
for(;itc!=ec;++itc,++itx,++ity,++itdx,++itdy){
  auto p=make_particle(
    access<color,x,y,dx,dy>(itc,itx,ity,itdx,itdy));
  p.render();
}
Fortunately, this boilerplate code can be hidden by the framework by using these auxiliary constructs:
template<typename T> class pointer;

template<template <typename> class Class,typename Access>
class pointer<Class<Access>>
{
  // behaves as Class<Access>>*
};

template<template <typename> class Class,typename Access>
pointer<Class<Access>> make_pointer(const Access& a)
{
  return pointer<Class<Access>>(a);
}
We won't delve into the implementation details of pointer (the interested reader can see the actual code in the test program given below): from the point of view of the user, this utility class accepts an access entity, which is a collection of pointers to the data members plus an offset member (this offset has been added to the former version of the framework), it keeps everything in sync when doing pointer arithmetic and dereferences to a temporary particle object. The resulting user code is as simple as it gets:
auto n=color_.size();
auto beg_=make_pointer<particle>(access<color,x,y,dx,dy>(
  &color_[0],&x_[0],&y_[0],&dx_[0],&dy_[0]));
auto end_=beg_+n;
  
for(auto it=beg_;it!=end_;++it)it->render();
Index-based traversal is also possible:
for(std::size_t i=0;i<n;++i)beg_[i].render();
Once the containers are populated and beg_ and end_ defined, user code can handle particles as if they were stored in [beg_, end_), thus effectively isolated from the fact that the actual data is scattered around different containers for maximum processing performance.
Are we paying an abstraction penalty for the convenience this framework affords? There are two sources of concern:
  • Even though traversal code is in principle equivalent to hand-written DOD code, compilers might not be able to optimize all the template scaffolding away.
  • Traversing with access<color,x,y,dx,dy> for rendering when only color, x and y are needed (because render does not access dx or dy) involves iterating over dx_ and dy_ without actually accessing either one: again, the compiler might or might not optimize this extra code.
We provide a test program (Boost required) that measures the performance of this framework against some alternatives. The looped-over render procedure simply updates a global variable so that resulting execution times are basically those of the produced iteration code. The different options compared are:
  • oop: iteration over a traditional object-based structure
  • raw: hand-written data-processing loop
  • dod: DOD framework with access<color,x,y,dx,dy>
  • render_dod: DOD framework with  access<color,x,y>
  • oop[i]: index-based access instead of iterator traversal
  • raw[i]: hand-written index-based loop
  • dod[i]: index-based with access<color,x,y,dx,dy>
  • render_dod[i]: index-based with access<color,x,y>
The difference between dod and render_dod (and the same applies to their index-based variants) is that the latter keeps access only to the data members strictly required by render: if the compiler were not able to optimize unnecessary pointer manipulations in dod, render_dod would be expected to be faster; the drawback is that this would require fine tuning the access entity for each member function.
Manu Sánchez has set up an extensive testing environment to build and run the program using different compilers and machines:
The figures show the release-mode execution times of the eight options described above when traversing sequences of n = 104, 105, 106 and 107 particles.
GCC 5.1, MinGW, Intel Core i7-4790k @4.5GHz
Execution times / number of elements.
As expected, OOP is the slowest due to cache effects. The rest of options are basically equivalent, which shows that GCC is able to entirely optimize away the syntactic niceties brought in by our DOD framework.
MSVC 14.0, Windows, Intel Core i7-4790k @4.5GHz
Execution times / number of elements.
Here, again, all DOD options are roughly equivalent, although raw (pointer-based hand-written loop) is slightly slower. Curiously enough, MSVC is much worse at optimizing DOD with respect to OOP than GCC is, with execution times up to 4 times higher for n = 104 and 1.3 times higher for n = 107, the latter scenario being presumably dominated by cache efficiencies.
GCC 5.2, Linux, AMD A6-1450 APU @1.0 GHz
Execution times / number of elements.
From a qualitative point of view, these results are in line with those obtained for GCC 5.1 under an Intel Core i7, although as the AMD A6 is a much less powerful processor execution times are higher (×8-10 for n = 104, ×4-5.5 for n = 107).
Clang 3.6, Linux, AMD A6-1450 APU @1.0 GHz
Execution times / number of elements.
As it happens with the rest of compilers, DOD options (both manual and framework-supported) perform equally well. However, the comparison with GCC 5.2 on the same machine shows important differences: iterator-based OOP is faster (×1.1-1.4) in Clang, index-based OOP yields the same results for both compilers, and the DOD options in Clang are consistently slower (×2.3-3.4) than in GCC, to the point that OOP outperforms them for low values of n. A detailed analysis of the assembly code produced would probably gain us more insight into these contrasting behaviors: interested readers can access the resulting assembly listings at the associated GitHub repository.