Tuesday, November 26, 2013

Measuring lookup times for C++ unordered associative containers

This is the final installment of our series of entries on the performance of C++ unordered associative containers as implemented by Dinkumware, Boost.Unordered and Boost.MultiIndex. We now analyze lookup according to the following two scenarios:
void successful_lookup(const container& c,unsigned int n)
{
  while(n--)s.find(rnd());
}

void unsuccessful_lookup(const container& c,unsigned int n)
{
  while(n--)s.find(rnd2());
}
where rnd generates the same random sequence of values used to populate c with n elements and rnd2 is a different, statistically independent sequence: so, lookups in the fist scenario always succeed, while those in the second one are very likely to fail. As usual we provide profiling programs for the non-duplicate and duplicate versions of the containers, the build and test environment stays the same as before (Microsoft Visual Studio 2012, default release mode settings, Windows box with an Intel Core i5-2520M CPU @2.50GHz), times are in microseconds/element, n runs from 10,000 to 3 million. The results for the non-duplicate case are depicted in the figure (first successful lookup, then unsuccessful):
The algorithms implemented by the three libraries are entirely similar (invoke hash function, map to bucket, traverse till either a matching element or the end of the bucket is met); the differences we see in performance are explained by these two factors:
  • Dinkumware, as we have discussed in a previous entry, has the fastest hash→bucket mapping.
  • Boost.Unordered and Boost.MultiIndex use the same approach to locate the bucket to search through, namely mapping hash values to bucket entries with an expensive modulo calculation and getting to the bucket via the preceding element (thus worsening locality with respect to Dinkumware's procedure): Boost.Unordered calculates this modulo slightly less inefficiently and, in unsuccessful lookups, often has to do the calculation twice, the first time to locate the bucket, and the second one to determine when that bucket ends (if it is not empty).  
The duplicate case with average group size G = 5 and maximum load factor Fmax = 1 looks like this (first successful lookup, then unsuccessful):
For both scenarios, Dinkumware's performance degrades more than the other two libs with respect to the non-duplicate case: when a collision happens, i.e. if a different group than searched for is found in the bucket, Dinkumware has to check every element of the group, whereas Boost.Unordered and Boost.MultiIndex take advantage of their group-skipping capabilities. Boost.MultiIndex has a poorer group-skipping mechanism as it only applies to groups with size > 2, which is more apparent on unsuccessful lookups (for successful lookups the probability that two groups end up in the same bucket is very small.)
Lastly, this is the situation when we set Fmax = 5 (first successful lookup, then unsuccessful):
Now the disadvantage of Dinkumware is much more clear: with a typical load F = 0.75·Fmax =3.75, the average number of elements checked is 1+ F/2 =  2.875 for successful lookups and F = 3.75 for unsuccessful ones, while for Boost.Unordered the corresponding numbers are 1+ F/2G = 1.375 and F/G = 0.75, respectively, and slightly worse than these for Boost.MultiIndex as this library does not perform group skipping for group sizes ≤ 2.

Monday, November 18, 2013

Measuring erasure times for C++ unordered associative containers

After measuring insertion times for Dinkumware, Boost.Unordered and Boost.MultiIndex implementations of C++ unordered associative containers without and with duplicates, we now turn to profiling erasure. As a model for this operation we choose the following scenario:
void erase(container& c)
{
  for(auto it:rnd_it_range(c))c.erase(it);
}
where rnd_it_range(c) is a random shuffling of [c.begin(), ++c.begin(), ... , c.end()). I've written profiling programs for the non-duplicate and duplicate versions of the containers, respectively, which were built and run on the same environment as in previous entries. The graphics show resulting erasure times in microseconds/element for n = 10,000 to 3 million elements.
These are the results for containers without duplicate elements:
Dinkumware and Boost.MultiIndex have unconditional O(1) erasure since buckets need not be traversed to delete an element, and moreover the locality of their corresponding erasure procedures is very good: all of this shows as superior performance with respect to Boost.Unordered, which needs to do bucket traversal in order to locate the node preceding the one being deleted. The CPU memory cache, which favors algorithm locality, makes the differences in performance increase when n is large.
Containers with duplicate elements are populated with a random sequence of elements with groups of equivalent values having an average size G = 5. We first set the maximum load factor Fmax = 1:
Dinkumware performance does not change, as this library's containers treat equivalent elements in exactly the same manner than in the non-duplicate case, and bucket occupancy, greater in this case, does not affect the erasure procedure. Boost.Unordered performance increases somewhat: when deleting a element in the middle of a group of equivalent values, the library does not need to traverse the bucket because the preceding and following elements are linked in a circular list; the net result is that locality improves for these elements and stays the same for elements at the beginning or the end of a group, thus overall performance is better. Boost.MultiIndex behaves the worst here: even though erasure is still unconditionally constant-time, the complexity of the underlying data structure implies that as many as seven memory locations can be visited to erase an element, 75% more than Dinkumware and at about the same level as Boost.Unordered.
When we set Fmax = 5, the situation does not change dramatically:
Boost.Unordered is the only library that could theoretically be impacted by the higher bucket occupancy, but the effect is nonetheless not noticeable.
It should be noted that Dinkumware (in disagreement with the C++ standard) invokes the hash function of the element being erased, which would negatively affect its performance for elements harder to hash than simple unsigned ints. That said, the special provisions Boost.Unordered and Boost.MultiIndex implement to accelerate the processing of groups of equivalent elements have not yet achieved spectacular performance improvements, or even result in decidedly slower times. It is at lookup that these libraries excel, as we will see in a future entry.

Saturday, November 16, 2013

Measuring insertion times for C++ unordered associative containers with duplicate elements

We continue our measuring plan for C++ unordered associative containers and focus now on insertion in containers with duplicate elements as implemented by Dinkumware, Boost.Unordered and Boost.MultiIndex. As before, we consider non-rehashing and rehashing scenarios
// non-rehashing
void insert(container& c,unsigned int n,float Fmax)
{
  c.max_load_factor(Fmax);
  c.reserve(n);
  while(n--)c.insert(rnd());
}

// rehashing
void insert(container& c,unsigned int n,float Fmax)
{
  c.max_load_factor(Fmax);
  while(n--)c.insert(rnd());
}
with two additional complications:
  • The random source produces duplicate elements where each different value is repeated G = 5 times on average within a run of n invocations.
  • We study the performance with maximum load factors Fmax = 1 (the default specified by the standard) and Fmax = G = 5. The latter case results in an element distribution across buckets equivalent to that of a container without duplicate elements being fed the same random source (i.e. where each group of equivalent elements is reduced to just one element). As studied before, this favors implementations that are able to skip groups of equivalent elements in constant time, which is the case for Boost.Unordered and Boost.MultiIndex. 
The test program was compiled and run on the same environment as in our previous entry. The first figure shows the results for the non-rehashing scenario and Fmax = 1, times in microseconds/element.
The sigmoid shapes due to CPU memory cache effects should be familiar by now. Boost.Unordered and Boost.MultiIndex perform very similarly despite their different data structures and insertion algorithms: the former is slower at determining the end of a bucket whereas the latter does worse for groups of size < 3, and these two effects seem to cancel each other out. Why is Dinkumware as fast or even faster than the other two libraries?
  • As we have proved in another entry, the distribution of elements when duplicates are allowed varies significantly with respect to the non-duplicate situation for the same load factor: here many more buckets are empty and the probability that two groups of elements end up in the same bucket is much lower: consequently, most of the time the insertion procedure needs not traverse the entire bucket and the fact that Dinkumware cannot skip groups in constant time is irrelevant.
  • Furthermore, the locality of Dikumware is better than that of Boost.Unordered and Boost.MultiIndex, which need to access the first and the last element of a group upon each new insertion; this shows up as an additional advantage for Dinkumware when n is large and most of the memory active set lies outside the CPU cache. 
When we set Fmax = 5, results look like this:
Although the level of bucket occupancy is five times higher, Boost.Unordered and Boost.MultiIndex insertion times does not grow much, precisely because they are not affected by the size of groups of equivalent elements. On the other hand, Dinkumware performance, as expected, degrades noticeably: when two groups occupy the same bucket, which is now more likely, reaching for the second group implies traversing all the elements of the first.
The following figures show the the inserton times for the rehashing scenario and Fmax = 1 and 5, respectively.
The results are not so easy to interpret as before. There are several competing forces acting here:
  • Dinkumware rehashing procedure is the slowest from an algorithmical point of view (each element is rehashed separately) but has the best locality, which compensates its initial disadvantage when n grows.
  • Boost.Unordered rehashes fastest of all three libraries because the implementation stores precomputed hash values within element nodes. This is more apparent for large values of n, as evidenced by the height of the graph "steps", and would show even more distinctly if the elements were harder to hash than the plain unsigned ints we have used (for instance, if we had dealt with std::strings).
In future entries we will go on to profiling erasure and lookup times.

Saturday, November 9, 2013

Measuring insertion times for C++ unordered associative containers

Equipped with our theoretical analysis on the complexity of insertion for C++ unordered associative containers without duplicate elements, let us now measure the real thing. I have written a program that profiles running insertion times for Dinkumware, Boost.Unordered and Boost.MultiIndex. Tests were compiled with Microsoft Visual Studio 2012, default release mode settings, and run on a Windows machine with an Intel Core i5-2520M processor running at 2.50GHz.
We begin with the non-rehashing scenario:
void insert(container& c,unsigned int n)
{
  c.reserve(n);
  while(n--)c.insert(rnd());
}
The results for n = 10,000 to 3 million elements are depicted in the figure. Times are microseconds/element.
Well, this is anything but average constant time! What we are seeing is in most likelihood the effect of the CPU memory cache, which in our testing environment has a size of 3MB. This means that the active memory set fits into the CPU cache up to n ≈ 150,000 for Dinkumware and n ≈ 200,000 for Boost.Unordered and Boost.MultiIndex: beyond that point average memory read/write time gradually increases and is expected to stabilize when the size of the memory set is much larger than the CPU cache. Leaving these effects aside, what the figure tells us about the performance of the containers is:
  • Boost.MultiIndex does slightly faster than Boost.Unordered because determining the end of a bucket only involves some pointer checks in the former lib whereas the latter has to use more expensive verifications based on stored hash values.
  • Dinkumware bucket arrays sizes are powers of two whereas Boost.Unordered and Boost.MultiIndex resort to an (approximately) geometric progression of prime numbers: the former scenario allows for a much faster mapping of hash values to bucket entries because n%m, which is in general an expensive operation, can be efficiently implemented as n&(m-1) when m is a power of two. On the other hand, Dinkumware's pointer manipulation is the heaviest of all three libs, and the net result is that its observed insertion performance sits more or less between those of the other two.
Now we consider the rehashing scenario:
void insert(container& c,unsigned int n)
{
  while(n--)c.insert(rnd());
}
The differences with respect to the previous case are due, obviously, to the presence of rehashing:
  • Dinkumware excels when n is large because of the advantage we mentioned about using powers of two for bucket array sizes, more prevalent under rehashing conditions.
  • Unlike Boost.Unordered, Boost.MultiIndex needs to calculate and temporarily store hash values each time a rehashing occurs, which eats up its head start when n grows.
Note that the test uses unsigned ints as elements, whose associated hash function is trivial: if the elements were harder to hash (like in the case of std::strings), Boost.Unordered, which is the only lib that stores hash values, should beat the other two for large numbers of elements.