The amortized cost of vector insert is 3

C++ requires many algorithms to be implemented in amortized constant time. I was always curious what this meant for vector insertion. To get started I first wrote an article about amortized time in general. Now I’d like to apply that to appending vector items. A vector is a sequence container that supports random access iterators. In addition, it supports (amortized) constant time insert and erase operations at the end;

In this analysis I’m looking only at end insertion, which is usually done with the C++ vector.push_back function.

Cost of copying

To do complexity analysis we need to know what is being counted. Often we count the number of comparisons, such as with trees, but with an unordered vector that is silly. Maybe we should include memory allocation, or is that constant anyway? What is the cost model, shall I assume a “random-access machine”? The standard only gives us this description:

23.2.1-1 All of the complexity requirements in this Clause are stated solely in terms of the number of operations on the contained objects. [ Example: the copy constructor of type vector <vector<int> > has linear complexity, even though the complexity of copying each contained vector<int> is itself linear. — end example ]

Let’s just focus on the cost of copying data, this should be okay by that definition. For common use this is the most interesting cost since it is usually the highest cost.

Given a type T we define the cost of an operation as the number of times an object of type T is copied. That is, each time an item is copied it costs 1.

What is a vector?

It’s also helpful to know what a vector actually is. It has a current number of items and a total capacity to store more. The requirements also state:

The elements of a vector are stored contiguously, meaning that if v is a vector<T, Allocator> … then it obeys the identity &v[n] == &v[0] + n for all 0 <= n < v.size().

A pseudo-structure for the vector can be:

vector {
    T* store
    int size
    int capacity

capacity is the total space available in store. size is the current number of items in store that are actually defined. The remaining are just undefined memory.

Two insertion scenarios

When we insert a new item into a vector one of two things can happen:

  1. size < capacity: The item is copied to store[size] and size increments. Cost of 1.
  2. size == capacity: New space must be allocated first. The old items must be copied to the new store. Then the new item added. Cost of size + 1.

That second cost is clearly linear, not the constant time the standard requires.

This where amortized cost comes in. We aren’t interested in the worst case, but the average case. We don’t always have to resize the vector, that is the purpose of having extra capacity. The amortized cost is the average cost over many insertions. There must be some way to get this cost to be constant.

How many resizes

Let’s first try the approach of increasing the capacity by k each time, some constant amount. This means that every k inserts will require a resize. Call r the resize operation, so at r_1 we have k items in the list, at r_2 we have k*2 items, at r_m we have k*m items. For completeness at r_0 we have 0 items.

Operation r_m requires that r_{m-1} has already happened. We can only insert one item at a time into the list, thus no resize operations can be skipped. The total cost of this sequence of resizes can be expressed as this recurisve relation:

\begin{array}{lcl}  SumResizeCost( 0 ) & = & 0 \\  SumResizeCost( m ) & = & m k + SumResizeCost( m-1 )  \end{array}

Reducing to a single closed form equation that gives us:

SumResizeCost( m ) = k m (m + 1 ) / 2

We need a form related to n, the number of insert opertions we perform in total. This is easy, since m = n / k, just assume integer math to avoid needing a floor operation. In addition to the resizing cost we need the individual insert cost, 1 for each item. So the total cost of the sequence of n insertions is:

\begin{array}{rcl}  SumCost( n ) & = & SumResizeCost( n/k ) + n \\  & = & k (n / k) (n / k + 1) / 2 + n \\  & = & (n^2 + 3 k n) / (2 k)  \end{array}

To get the amortized cost we divide by the total number of operations:

\begin{array}{rcl}  AmortizedCost( n ) & = & SumCost( n ) / n \\  & = & (n + 3 k) / (2 k)  \end{array}

Strip out the constants and we’re left with an amortized cost of O(n). Thus any constant size for k results in linear time, not the desired constant time.

Geometric size

Let’s move on to another option for resizing: start with 1 capacity and double each resize. To simplify the math we’ll use indexing starting at 0. The r_0 operation needs to move 1 item, the r_1 moves 2 items, and the r_m operation moves 2^m items.

\begin{array}{lcl}  SumResizeCost( 0 ) & = & 1 \\  SumResizeCost( m ) & = & 2^m + SumResizeCost( m-1 )  \end{array}

Basic geometric series reduce very nicely:

SumResizeCost( m ) = 2^{m+1} - 1

At this point we could swallow the cost of 1 copy and drop the - 1 part. Given the exponential term before it the constant 1 isn’t relevant. For thoroughness I’ll leave it in.

Relate back to n where m = log_2(n), and add in the individual inserts:

\begin{array}{rcl}  SumCost( n ) & = & SumResizeCost( log_2(n) ) + n \\      & = & 2^{log_2(n)+1} - 1 + n \\      & = & 3 n - 1  \end{array}

Divide by n, the number of inserts, to get the amortized cost:

\begin{array}{rcl}  AmortizedCost( n ) & = & SumCost( n ) / n \\      & = & (3 n - 1) / n \\      & = & 3 + 1/n  \end{array}

The constant 3 is the largest term, so our amortized cost of insert is O(1). Success! That’s constant linear time as the standard has requested.

The intuitive solution

That 3 represents a concrete number of operations: how many times each item will be copied. To be constant linear time it could have been 2 or 4, it doesn’t matter. Indeed, if we were off by just 1 in our definition of m it would have been either of those. So let’s think about what those 3 copies actually are.

The first copy is clear, when you first insert an element it has to be copied into the store memory. That leaves us with two copies.

Think of the structure of store at the moment a resize happens. It starts with size == capacity, it’s full, that’s why it’s being resized. After resize it ends up at size == capacity/2 as the capacity has doubled. The second half of the store is empty.

We know this second half will eventually be filled. Once filled all items have to copied somewhere new. This means that for each item inserted, 2 copies happen. The item itself will need to be copied somewhere new, as well as one item from the first half. This relation holds regardless of how many resizings have been done. Each resize always results in one full and one empty half.

If we think in terms of the accounting method that means insertion costs 3 credits. The first credit pays for its immediate insertion. The second credit pays for it’s eventual copy to a new store. The third credit pays for copying an existing item to the new store. The doubling in capacity each resize ensures these new items always have enough credit to cover the copying cost.

Categories: Efficiency, Programming

Tagged as: , ,

2 replies »

    • By the formal definition of big O I can write whatever I want for the equation. I could even leave it as O(3+1/n).

      However, for common use you are correct, we normally simplify to the largest term, so O(1)

      This is actually also \Theta (1) as well, isn’t it?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s