Monday, June 29, 2015

Design patterns for invariant suspension

One of the most important aspects of data encapsulation as embodied by classes and similar constructs is the ability to maintain class invariants by restricting access to the implementation data through a well defined interface. (Incidentally, a class without significant invariants is little more than a tuple of values, which is why the presence of getters and setters is a sure sign of poor design). Sometimes, however, maintaining the class invariant can lead to unacceptable costs in terms of complexity or performance. Consider two examples:
  • A sorted_vector class that maintains an internal buffer of ordered elements.
  • A spread_sheet class that implements automatic formula calculation.
These classes restrict the potential states of their implementation data so that they meet some extra requirements the interface relies on (for instance, the fact that the elements of sorted_vector are ordered allows for an lookup operations, automatic values in the spread sheet are consistent with input data cells, etc.). State can only then be changed by using the provided interface, which takes proper care that the invariant is kept upon exit from any method —and this is all good and desireable.
The problem arises when we need to do bulk modifications on the internal data and the class interface is too rigid to support this:
  • sorted_vector typically only allows for deletion of elements or insertion of new values, and does not provide mechanisms for directly modifying the elements or temporarily rearranging them.
  • spread_sheet might only accept changing one data cell at a time so as to check which formulas are affected and recalculate those alone.
If we granted unrestricted access to the data, the class would have to renormalize (i.e. restore the invariant) inside every method. Let us see this for the case of sorted_vector:
class sorted_vector
{
  value_type* data(); // unrestricted acccess to data
  iterator find(const key_type& x)const
  {
    sort();
    ...
  }
  ...
private:
  void sort()mutable;
};
This is obviously impracticable for efficiency reasons. Another alternative, sometimes found in questionable designs, is to put the burden of maintaining the invariant on the user herself:
class sorted_vector
{
  value_type* data();
  void sort();
  iterator find(const key_type& x)const // must be sorted
  ...
}
The problem with this approach is that it is all too easy to forget to restore the invariant. What is worse, there is no way that the class can detect the overlook and raise some signal (an error code or an exception). A poor attempt at solving this:
class sorted_vector
{
  value_type* data();
  void modifying_data() // to indicate invariant breach
  {
    sorted=false;
  }
  void sort();
  iterator find(const key_type& x)const
  {
    if(!sorted)throw std::runtime_exception("not sorted");
    ...
  }
  ...
private:
  bool sorted;
  ...
}
is just as brittle as the previous situation and more cumbersome for the user. If we finally renounce to giving access to the internal data, the only way to do bulk modifications of the internal state is constructing the data externally and then reassigning:
sorted_vector v;
vector aux;
... // manipulate data in aux
v=sorted_vector(aux.begin(),aux.end())
This last option is not entirely bad, but incurs serious inefficiencies related to the external allocation of the auxiliary data and the recreation of a new sorted_vector object.
We look for design patterns that enable invariant suspension (i.e. unrestricted access to a class internal data) in a controlled and maximally efficient way.
Internal invariant suspension
In this pattern, we allow for suspension only within the execution of a class method:
class X
{
  template<typename F>
  void modify_data(F f); // f is invoked with X internal data
};
How does this work? The idea is that f is a user-provided piece of code that is given access to the internal data to manipulate freely. After f is done, modify_data, which issued the invocation, can renormalize the data before exiting. Let us see this in action for sorted_vector:
class sorted_vector
{
  template<typename F>
  void modify_data(F f)
  {
    f(data);
    sort(); // restore invariant
  }
};
...
sorted_vector v;
...
v.modify_data([](value_type* data){
  ... // manipulate data as wee see fit
});
This pattern is most suitable when modifications are relatively localized in code. When the modification takes place in a more uncontrolled setting, we need something even more flexible.
External invariant suspension
Here, we move the data out and accept it back again when the user is done with it.
class X
{
  data extract_data(); // X becomes "empty"
  void accept_data(data& d); // X should be "empty"
};
At first sight, this seems just as bad as our first example with sorted_vector::data(): as soon as we grant access to the internal state, we can no longer be sure the invariant is kept. The key aspect of the pattern, however, is that extract_data sets X to some empty state, i.e., a state without internal data where the invariant is trivially true in all cases —in other words, X gives its internal data away.
class sorted_vector
{
  value_type* extract_data()
  {
    value_type res=data;
    data=nullptr;
    return res;
  }
  void accept_data(value_type*& d)
  {
    if(data)throw std::runtime("not empty");
    // alternatively: delete data
    
    data=d;
    d=nullptr;
    sort(); // restore invariant
  }
};
...
sorted_vector v;
...
value_type* data=v.extract_data()
... // manipulate data
v.accept_data(data);
How does this meet our requirements of control and efficiency?
  • Control: the invariant of v is always true regardless of the user code: when invoking extract_data, v becomes empty, i.e. a vector with zero elements which is of course sorted. The methods of sorted_vector need not check for special cases of invariant suspension.
  • Efficiency: the data is never re-created, but merely moved around; no extra allocations are required.
Conclusions
Class invariants are a key aspect of data encapsulation, but can lead to inefficient code when the class interface is too restrictive for data manipulation, We have described two general design patterns, internal invariant suspension and external invariant suspension, that allow for internal state modification in a controlled fashion without interfering with the class general interface or requiring any type of data copying.

Sunday, June 28, 2015

Traversing a linearized tree: cache friendliness analysis

We do a formal analysis of the cache-related behavior of traversing a levelorder_vector, a linear array of elements laid out according to the breadth-first order of a binary tree. Ignoring roundoff effects, let us assume that the array has n = 2d − 1 elements and the CPU provides a cache system of N lines capable of holding L contiguous array elements each. We further assume that N > d.
Under these conditions, the elements of level i, being contiguous, are paged to 2i/L lines.
The traversal of  the array corresponds to the post-order sequence of the tree implicitly encoded, and it is easy to see that this meets the elements of each level i exactly in the order they are laid out in memory: so, given that  N > d, i.e. the CPU can hold at least as many lines in cache as there are levels in the tree, an optimal caching scheduler will result in only 2i/L misses per level, or n/L overall. The number of cache misses per element during traversal is then constant and equal to 1/L, which is the minimum attainable for any traversal scheme (like the simplest one consisting of walking linearly the array from first to last).
The questions remains of whether we can justifiably hold the assumption that N > d in modern architectures. In the Intel Core family, for instance, the data L1 cache is 32 KB (per core) and line size is 64 B, so N = 512. On the other hand, the maximum d possible in 64-bit architectures is precisely 64, therefore the condition  N > d is met indeed by a large margin.

Saturday, June 27, 2015

Traversing a linearized tree

In a previous entry we have introduced the data structure levelorder_vector<T>, which provides faster binary search than sorted vectors (such as sported by Boost.Container flat associative containers) by laying out the elements in a linear fashion following the breadth-first or level order of the binary tree associated to the sequence. As usual with data structures, this is expected to come at a price in the form of poorer performance for other operations.
Let us begin with iteration: traversing a levelorder_vector is algorithmically equivalent to doing so in its homologous binary tree:
The picture above shows the three generic cases the algorithm for incrementing a position has to deal with, highlighted with the same color in the following pseudocode:
void increment(position& i)
{
  position j=right(i);
  if(j){                      // right child exists
    do{
      i=j;                    // take the path down
      j=left(i);              // and follow left children
    }while(j);
  }
  else if(is_last_leaf(i)){
    i=end;                    // next of last is end
  }
  else{
    while(is_right_child(i)){ // go up while the node is
      i=parent(i);            // the right child of its parent 
    }
    i=parent(i);              // and up an extra level
  }
}
In the case of levelorder_vector, positions are numerical indexes into an array, so these operations are purely arithmetic (by contrast, in a node-based tree we would need to follow pointers around, which is more expensive as we will see shortly):
  • parent(i) = (i − 1)/2,
  • right(i) = 2i +2 (if less than size),
  • left(i) = 2i + 1 (if less than size),
  • is_right_child(i) = (i mod 2 = 0),
  • is_last_leaf(i) = i does not have a right child and i + 2 is a power of 2.
The last identity might be a little trickier to understand: it simply relies on the fact that the last leaf of the tree must also be the last element of a fully occupied level, and the number of elements in a complete binary tree with d levels is 2d − 1, hence the zero-based index of the element + 2 has to be a power of two.
A test program is provided that implements forward iterators for levelorder_vector (backwards traversal is mostly symmetrical) and measures the time taken to traverse std::set, boost::container::flat_set and levelorder_vector instances holding n = 10,000 to 3 million integer elements.
MSVC on Windows
Test compiled with Microsoft Visual Studio 2012 using default release mode settings and run on a Windows box with an Intel Core i5-2520M CPU @2.50GHz.
Traversal time / number of elements.
As expected, boost::container::flat_set is the fastest: nothing (single threaded) can beat traversing an array of memory in sequential order. levelorder_vector is around 5x slower than boost::container::flat_set but up to 4x faster than std::set. More interestingly, its traversal time per element is virtually non-affected by the size of the container. This can be further confirmed by measuring a non-touching traversal where the range [begin, end) is walked through without actually dereferencing the iterator, and consequently not accesing the container memory at all: resulting times remain almost the same, which indicates that the traversal time is dominated by the arithmetic operations on indexes. In a later entry we provide the formal proof that traversing a levelorder_vector is indeed cache-oblivious.
GCC on Linux
Contributed by Manu Sánchez (GCC 5.1, Arch Linux 3.16, Intel Core i7 4790k @ 4.5GHz) and Laurentiu Nicola (GCC 5.1, Arch Linux x64, Intel Core i7-2760QM CPU @ 2.40GHz and Intel Celeron J1900 @ 1.99GHz). Rather than showing the plots, we provide the observed ranges in execution speed ratios between the three containers (for instance, the cell "10-30" means that boost::container::flat_set is between 10 and 30 times faster than levelorder_vector for Intel Core i7).
flat_set
levelorder_vector
levelorder_vector
     std::set 
  
i710-303-8
Celeron10-122.5-4.5
Clang on Linux
Clang 3.6.1 with libstdc++-v3 and libc++ Arch Linux 3.16, Intel Core i7 4790k @ 4.5GHz.  Clang 3.6.1, Arch Linux x64, Intel Core i7-2760QM CPU @ 2.40GHz and Intel Celeron J1900 @ 1.99GHz.
flat_set
levelorder_vector
levelorder_vector
     std::set 
  
libstdc++-v3, i710-502-7
libc++, i710-502.5-8
Celeron15-302.5-4.5
Conclusions
Traversing a levelorder_vector takes a constant time per element that sits in between those of boost::container::flat_set (10-50 times slower, depending on number of elements and environment) and std::set (2-8 times faster). Users need to consider their particular execution scenarios to balance this fact vs. the superior performance offered at lookup.

Thursday, June 25, 2015

Cache-friendly binary search

High-speed memory caches present in modern computer architectures favor data structures with good locality of reference, i.e. the property by which elements accessed in sequence are located in memory addresses close to each other. This is the rationale behind classes such as Boost.Container flat associative containers, that emulate the functionality of standard C++ node-based associative containers while storing the elements contiguously (and in order). This is an example of how binary search works in a boost::container::flat_set with elements 0 trough 30:
As searching narrows down towards the looked-for value, visited elements are progressively closer, so locality of reference is very good for the last part of  the process, even if the first steps hop over large distances. All in all, the process is more cache-friendly than searching in a regular std::set (where elements are totally scattered throughout memory according to the allocator's whims) and resulting lookup times are consequently much better. But we can get even more cache-friendly than that.
Consider the structure of a classical red-black tree holding values 0 to 30:
and lay out the elements in a contiguous array following the tree's breadth-first (or level) order:
We can perform binary search on this level-order vector beginning with the first element (the root) and then jumping to either the "left" or "right" child as comparisons indicate:
The hopping pattern is somewhat different than with boost::container::flat_set: here, the first visited elements are close to each other and hops become increasingly longer. So, we get good locality of reference for the first part of the process, exactly the opposite than before. It could seem then that we are no better off with this new layout, but in fact we are. Let us have a look at the heat map of boost::container::flat_set:
This shows how frequently a given element is visited through an average binary search operation (brighter means "hotter", that is, accessed more). As every lookup begins at the mid element 15, this gets visited 100% of the times, then 7 and 23 have 50% visit frequency, etc. On the other hand, the analogous heat map for a level-order vector looks very different:
Now hotter elements are much closer: as cache management mechanisms try to keep hot areas longer in cache, we can then expect this layout to result in fewer cache misses overall. To summarize, we are improving locality of reference for hotter elements at the expense of colder areas of the array.
Let us measure this data structure in practice. I have written a protoype class template levelorder_vector<T> that stores its contents in level-order sequence and provides a member function for binary search:
const_iterator lower_bound(const T& x)const
{
  size_type n=impl.size(),i=n,j=0;
  while(j<n){
    if(impl[j]<x){
      j=2*j+2;
    }
    else{
      i=j;
      j=2*j+1;
    }
  }
  return begin()+i;
}
A test program is provided that measures lower_bound execution times with a random sequence of values on containers std::set, boost::container::flat_set and levelorder_vector holding n = 10,000 to 3 million elements.
MSVC on Windows
Test built with Microsoft Visual Studio 2012 using default release mode settings and run on a Windows box with an Intel Core i5-2520M CPU @2.50GHz. Depicted values are microseconds/n, where n is the number of elements in the container.
lower_bound execution times / number of elements.
The theoretical lookup times for the three containers are O(log n), which should show as straight lines (horizontal scale is logarithmic), but increasing cache misses as n grows result in some upwards bending. As predicted, levelorder_vector performs better than boost::container::flat_set, with improvement factors ranging from 10% to 30%. Both containers are nevertheless much faster than std::set, and don't show much degradation until n ≃ 5·105 (~7·104 for std::set), which is a direct consequence of their superior cache friendliness.
GCC on Linux
The following people have kindly sent in test outputs for GCC on varius Linux platforms: Philippe Navaux, xckd, Manu Sánchez. Results are very similar, so we just show those corresponding to GCC 5.1 on a Intel Core i7-860 @ 2.80GHz with Linux kernel 4.0 from Debian:
lower_bound execution times / number of elements.
Although levelorder_vector performs generally better than boost::container::flat_set, there is a strange phenomenon for which I don't have an explanation: improvement factors are around 40% for low values of n, but then decrease or even become negative as n aproaches 3·106. I'd be grateful if readers can offer interpretations of this.
Laurentiu Nicola has later submitted results with -march=native -O2 settings that differ significantly from the previous ones: GCC 5.1 on an Intel Core i7-2760QM CPU @ 2.40GHz (Linux distro undisclosed):
lower_bound execution times / number of elements.
and GCC 5.1 on an Intel Celeron J1900 @ 1.99GHz:
lower_bound execution times / number of elements.
-march=native seems to make a big difference: now levelorder_vector execution times are up to 25% (i7) and 40% (Celeron) lower than boost::container::flat_set when n approaches 3·106, but there is no significant gain for low values of n.
Clang on OS X
As provided by fortch: clang-500.2.79 in -std=c++0x mode on an Darwin 13.1.0 box with an Intel Core  i7-3740QM CPU @ 2.70GHz.
lower_bound execution times / number of elements.
The improvement of levelorder_vector with respect to boost::container::flat_set ranges from 10% to 20%.
Clang on Linux
Manu Sánchez has run the tests with Clang 3.6.1 on an Arch Linux 3.16 box with an Intel Core i7 4790k @4.5GHz. In the first case he has used the libstdc++-v3 standard library:
lower_bound execution times / number of elements.
and then repeated with libc++:
lower_bound execution times / number of elements.
Results are similar, with levelorder_vector execution speeds between 5% and 20% less than those of boost::container::flat_set. The very noticeable bumps on the graphs for these two containers when using libc++ are also present on the tests for Clang on OS X but do not appear if the standard library used is libstdc++-v3, which suggests that they might be due to some artifact related to libc++ memory allocator.
Laurentiu's results for Clang 3.6.1 with -march=native -O2 are similar to Manu's for an i7 CPU, whereas for an Intel Celeron J1900 @ 1.99GHz we have:
lower_bound execution times / number of elements.
where levelorder_vector execution times are up to 35% lower than those of boost::container::flat_set.
Conclusions
Even though binary search on sorted vectors perform much better than node-base data structures, alternative contiguous layouts can achieve even more cache friendliness. We have shown a possibility based on arranging elements according to the breadth-first order of a binary tree holding the sorted sequence. This data structure can possibly perform worse, though, in other operations like insertion and deletion: we defer the study of this matter to a future entry.