The Life of a Programmer


The amortized cost of vector insert is 3

The amortized cost of vector insert is 3

C++ requires many algorithms to be implemented in amortized constant time. I was curious about what this meant for vector insertion. To get started I first wrote an article about amortized time in general. Now I'd like to apply that to appending vector items. A vector is a sequence container that supports random access iterators. In addition, it supports (amortized) constant time insert and erase operations at the end;

In this analysis I'm looking at end insertion, which is done with the C++ vector::push_back function.

Cost of copying

Complexity analysis is usually based on counting how many times a dominant operation occurs. Often we count the number of comparisons, such as with trees, but that doesn't make sense for a vector insert.

What should we count? Maybe we should include memory allocation, or is that done in constant time? Do we need to consider a specific cost model, such as a "random-access machine"? The C++ standard only gives us this description:

23.2.1-1 All of the complexity requirements in this Clause are stated solely in terms of the number of operations on the contained objects. [Example: the copy constructor of type vector <vector<int> > has linear complexity, even though the complexity of copying each contained vector<int> is itself linear. β€” end example]

Let's focus on the cost of copying data, this should be okay by that definition. For common use, this is the most interesting cost since it is usually the highest cost.

Given a type T we define the cost of an operation as the number of times an object of type T is copied. Each time an item is copied, it costs 1.

What is a vector?

It's helpful to know what a vector actually is. It has a current number of items, its size, and capacity to store more. The requirements also state:

The elements of a vector are stored contiguously, meaning that if v is a vector<T, Allocator> … then it obeys the identity &v[n] == &v[0] + n for all 0 <= n < v.size().

We can use this pseudo-structure to represent the vector:

vector {
    T* store
    int size
    int capacity

capacity is the total space available in store. size is the current number of items in store that are actually defined. The remaining are undefined memory.

Two insertion scenarios

When we insert a new item into a vector, one of two things can happen:

1. size < capacity: The item is copied to store[size] and size is incremented. That's one copy: a cost of 1. 2. size == capacity: New space must be allocated first. The old items must be copied to the new store. Then the new item added. We copy once for each existing item, plus once for the new item: a cost of size + 1.

That second cost is clearly linear, not the constant time the standard requires.

This where the amortized cost comes in. We aren't interested in the worst case, but the average case. As the vector reserves capacity beyond its current size, we don't always have to resize the vector to add a new item. The amortized cost is the average cost of many insertions.

There must be some way to make this cost be constant time.

How many resizes

Let's first try the approach of increasing the capacity by \(k\) each time, some constant amount. This means that every \(k\) inserts will require a resize. Call \(r\) the resize operation, so at \(r_1\) we have \(k\) items in the list, at \(r_2\) we have \(k*2\) items, at \(r_m\) we have \(k*m\) items. For completeness at \(r_0\) we have \(0\) items.

Operation \(r_m\) requires that \(r_{m-1}\) has already happened. We can only insert one item at a time into the list, thus no resize operations can be skipped. The total cost of this sequence of resizes can be expressed as this recurisve relation:

\(\begin{array}{lcl} SumResizeCost( 0 ) & = & 0 \\ SumResizeCost( m ) & = & m k + SumResizeCost( m-1 ) \end{array}\)

Reducing to a single closed form equation that gives us:

\(SumResizeCost( m ) = k m (m + 1 ) / 2\)

We need a form related to \(n\), the number of insert opertions we perform in total. This is easy, since \(m = n / k\); assume integer math to avoid needing a floor operation. In addition to the resizing cost we need the individual insert cost, 1 for each item. So the total cost of the sequence of \(n\) insertions is:

\( \begin{array}{rcl} SumCost( n ) & = & SumResizeCost( n/k ) + n \\ & = & k (n / k) (n / k + 1) / 2 + n \\ & = & (n^2 + 3 k n) / (2 k) \end{array} \)

To get the amortized cost we divide by the total number of operations:

\( \begin{array}{rcl} AmortizedCost( n ) & = & SumCost( n ) / n \\ & = & (n + 3 k) / (2 k) \end{array} \)

Strip out the constants, and we're left with an amortized cost of \(O(n)\). Thus any constant size of \(k\) results in linear time, not the desired constant time.

Geometric size

Let's move on to another option for resizing: start with 1 capacity and double each resize. Let \(r_m\) refer to the resizing of the vector. To simplify the math, we'll use indexing starting at 0. The \(r_0\) operation needs to move 1 item; this is the first resize. The \(r_1\) moves 2 items; this is the second resize. The \(r_m\) operation moves \(2^m\) items.

\( \begin{array}{lcl} SumResizeCost( 0 ) & = & 1 \\ SumResizeCost( m ) & = & 2^m + SumResizeCost( m-1 ) \end{array} \)

Basic geometric series reduce very nicely:

\(SumResizeCost( m ) = 2^{m+1} – 1\)

At this point we could swallow the cost of 1 copy and drop the \(– 1\) part. Given the exponential term before it, the constant 1 isn't relevant. For thoroughness, I'll leave it in.

Relate back to \(n\) where \(m = log_2(n)\), and add in the initial copying of each item on insert:

\( \begin{array}{rcl} SumCost( n ) & = & SumResizeCost( log_2(n) ) + n \\ & = & 2^{log_2(n)+1} – 1 + n \\ & = & 3 n – 1 \end{array} \)

Divide by \(n\), the number of inserts, to get the amortized cost:

\( \begin{array}{rcl} AmortizedCost( n ) & = & SumCost( n ) / n \\ & = & (3 n – 1) / n \\ & = & 3 – 1/n \end{array} \)

The constant 3 is the most significant term, so our amortized cost of insert is \(O(1)\). Success! That's constant linear time as the standard has requested.

The intuitive solution

That 3 represents a concrete number of operations: how many times each item will be copied. It could have been 2 or 4 and still been constant time. Indeed, if we were off by 1 in our definition of \(m\), it would have been 2 or 4, not 3.

What are those 3 copies? The first copy is clear: when you first insert an element it has to be copied into the store memory. That leaves us with two copies.

Think of the structure of store at the moment a resize happens. It starts with size == capacity, it's full, that's why it's being resized. After resize it ends up at size == capacity/2 as the capacity has doubled. The second half of the store is empty.

This second half will eventually be filled, and once filled, all items have to be copied somewhere new. This means that for each item inserted, 2 copies happen. The item itself will need to be copied somewhere new, as well as one item from the first half. This relation holds regardless of how many resizes have been done. Each resize always results in one full and one empty half.

If we think in terms of the accounting method that means insertion costs 3 credits. The first credit pays for its immediate insertion. The second credit pays for it's eventual copy to a new store. The third credit pays for copying an existing item to the new store. The doubling in capacity each resize ensures these new items always have enough credits to cover the copying cost.

Please join me on Discord to discuss, or ping me on Mastadon.

The amortized cost of vector insert is 3

A Harmony of People. Code That Runs the World. And the Individual Behind the Keyboard.

Mailing List

Signup to my mailing list to get notified of each article I publish.

Recent Posts