The Radix Heap

The binary heap is a well-known data structure that allows a priority
queue to be implemented in time O(log n). It's quite efficient because
it can be done as an array. However, if you accept a few restrictions,
it's possible to do better from a caching point of view. The
restrictions that you must accept are (a) that all the keys in the
heap are integers and (b) that you can never insert a new item that is
smaller than all the other items in the heap.

These restrictions are not that severe. The resulting Radix Heap still
works in many algorithms that use heaps as a subroutine: Dijkstra's
shortest-path algorithm, Prim's minimum spanning tree algorithm,
various sweepline algorithms in computation geometry.

Here is how it works. We'll assume that the keys are 32 bit
integers. The radix heap has 33 buckets of items:


   0   1   2   3   4   5   6   7   8                                  31  32
  [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] ... [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ]
       x
       x
       x

each one containing a list of keys. We also maintain one global value:
last_deleted, which is initially MIN_INT and otherwise contains the
last value deleted from the queue.

The invariant is as follows:

      Bucket k contains items where bit k - 1 is the highest bit
      that is different from last_deleted. The items in bucket 0
      are equal to last_deleted.

That means for items in bucket 10, the 9th bit is different, bit 0 to
8 are possibly different, and bit 11 to 31 are equal to those in
last_deleted. Items stored in bucket 32 are different in bit 31 and
possibly any other bit.

Operations

When a new item is inserted, it has to be added to the correct
bucket. How can we compute the bucket number? We have to find the
highest bit where the new item differs from last_deleted. This is
easily done by XOR'ing them together, and then finding the highest set
bit. Adding one then gives the bucket number. Inserting the item
clearly preserves the invariant because last_deleted didn't change and
the new item will be in the correct bucket.

Extracting the minimum involves first finding the minimal item by
walking the lowest-numbered non-empty bucket and finding the minimal
item in that bucket. Then that item is deleted and last_deleted is
updated. Then the bucket is walked again and each item is reinserted
into a new bucket according to the new last_deleted item.

We have to be sure this doesn't violate the invariant. First note that
all the items in the bucket that is being redistributed will satisfy
the invariant because they are simply being reinserted. And a bucket
with a higher number c will still be different from the new
last_deleted in bit c - 1. If it weren't, the new last_deleted would
itself have belonged in bucket c. And there are no non-empty buckets
with a lower number than the one that is being redistributed.

Performance

The time complexity if insertion is O(1) since all that is required is
adding the new item to a list. (There is also a requirement to
determine the highest set bit, but that can be done in constant time
with an instruction such as bsr).

The performance of extraction is dominated by the redistribution of
items. When a bucket is redistributed, it ends up being empty. To see
why, remember that all the items in bucket k are different from
last_deleted in the k-1'th bit. Because the new last_deleted comes
from bucket k, that means the items are now all *equal* to
last_deleted in the k-1th bit. Hence they will all be redistributed to
a lower-numbered bucket.

Now consider the life-cycle of a single element. In the worst case, it
starts out being added to bucket 31. Then each redistribution moves
the element to a lower-numbered bucket. When it reaches bucket 0, it
will be next in line for extraction. Since a redistribution takes
constant time per element distributed, it follows that the total
amount of redistribution that an element will experience is
proportional to the number of bits in the keys. 

In other words, in the worst case the time complexity of extraction is
O(d) where d is the number of bits in the integer keys. It's worth
noting that in practice, the complexity will be even better since most
items will not move through all the buckets.


Caching performance

Some descriptions of the radix heap recommend implementing the buckets
as doubly linked lists. That's a big mistake because cache locality
will be terrible. Instead, implement them as dynamically growing
arrays. 

If you do that, the top of the buckets will tend to be hot which means
the per-item time to do a redistribute a bucket will be more like
O(1/B), where B is the number of integers in a cache line. This means
the amortized complexity of extraction will be closer to O(d/B) than
O(d).

In a regular binary heap, both insertion and extraction requires O(log
n) swaps, and each swap (except those very close to the top of the
heap) will cause a cache miss. 

In other words, if d = Theta (log n), extraction from a radix heap
will need O(log n / B) cache misses, where a binary heap will require
O(log n).