Is reference counting slower than GC?

“Reference counting is slower than garbage collection”, a claim often made in the discussion of memory management. I heard it again recently when discussing Leaf; somebody took issue with my arguments against garbage collection. Making a comparison though is extremely difficult, but I will attempt to look at some of the contributing costs here.

In this article when I mention GC I’m speaking of a tracing garbage collector. I find this is most often meant when people say GC, so I’m going with that term. Keep in mind that no modern memory management scheme is likely trivial enough to use just reference counting or tracing garbage collection; I’m considering the broad implications here of the general approaches.

The different models

There is a definite overhead to using reference counting for memory, but in practice it’s generally efficient. The overhead is incurred throughout the code during pointer assignment. It comes primarily in the form of memory synchronization between CPU cores, and a little bit in added code size. Though most pointer assignments don’t need this overhead, it still happens enough be noticed in the overall performance.

In a scanning GC there is no real overhead during pointer assignment. The question of whether an object is free is deferred to a later time. The overhead in the GC comes from having to scan memory at frequent intervals searching for things to release.

If the GC language is doing a synchronized memory operation on assignment, it is likely incurring the same overhead cost as reference counting does. It’s the memory synchronization which is the most costly. I haven’t looked much at GC systems, but I have to assume they don’t use synchronized assignment, otherwise reference counting would certainly be the faster solution.

This means that the functional code paths will definitely be faster than reference counting so long as the GC thread doesn’t interrupt it. The moment the GC thread starts working however it will impact all code. A lot of synchronized memory access is required by the GC, causing some caching issues on the CPU (it requires memory flushing much like reference counting’s increment and decrement do).

Many GC’s result in a stop-the-world effect, suspending all processing during some collection cycles. In some domains this type of pause is unacceptable to the application, but in many domains it really isn’t a problem if it’s kept short enough. A more iterative GC may reduce the stopping time, but the CPU cache synchronization will impact the code paths on a continual basis, much like in reference counting.

Size matters

GC’s overhead relates to the total memory allocated, not just the memory actively used in the code. Each new allocation results in an increased time per scanning cycle. Certainly there are many optimizations involved in a good GC to limit this, but the fundamental relationship is still linear. As a program uses more memory the overhead of it’s GC increases.

Reference counting only involves objects involved in an assignment; it’s cost is not related to the total memory used by an application.

Freeing an object during reference counting could however have a cascading effect. The local release of an object could result in a large tree of objects being released. In theory this type of delay could, like a GC pause, negatively impact some applications. I’ve not seen this problem in practice before though, as destructors and memory freeing tend to be quite fast.

Reference counting with cycle detection adds an interesting twist. I believe both D and Python do this. In order to eliminate memory loops objects are “sometimes” scanned for reference loops. This involves a lot more work than simply touching a reference count — though still not nearly as much as scanning all allocated memory. The impact of this loop detection depends heavily on how often it is scanned, and what the triggers are.

Total memory matters

GC tends to result in more memory being used in total. The trigger for scanning is often low memory conditions, meaning that a lot of memory needs to be allocated prior to a scanning operation. This has significant OS level implications: memory used by the app must be swapped in/out, and also prevents other apps, and file caches, from using the memory.

It’s my feeling, from experience, that it’s this aspect of GC’d applications that cause the biggest performance impact. The overall system just goes slower due to the large amount of memory involved. It doesn’t seem like it needs to be a fundamental problem of a GC though. The structure of a standard library, and typical programming paradigms, contributes a lot to this effect.

Structure matters

One problem in comparison is that programmers often design around the limitations of the system. Despite C++ having a reference counted shared_ptr, the vast majority of pointer operations aren’t using that type. Direct pointers, stack objects and memory pools are common ways to reduce allocation overhead. Even manual memory management can used to improve performance well beyond what a GC can achieve.

But that requires a lot of direct programmer optimization, and those techniques can’t all be easily used in language like Python. Certain techniques, like object pools, can also be used in GC languages, though the setup may be more work.

High level comparisons

The level of complexity in memory management makes comparison of reference counting to a scanning garbage collector quite difficult. We must consider the entire system, not just an individual program. We also can’t just switch from one strategy to another, even if it were an option, since programs tend to be written around their memory manager.

I don’t see how a claim that “garbage collection is faster” can be substantiated. One would need to construct a test that considers all of the above points. My general feeling is that GC is probably slower, strictly based on the increased total memory use of the applications. Perhaps that’s an issue of bulky and poorly constructed standard libraries more than an issue of GC itself though.

8 replies »

  1. Yeah, it always bugged me seeing comparisons between GC and RC because they always assumed that the RC code would naively throw every possible object onto the heap and/or insist on using reference counting when the compiler could easily prove that the reference is incremented to 1 and then decremented to 0 all within the same scope.

    I loved seeing a diagram a while back indicating that (I hope I don’t get this backwards) RC is better for objects with long lives (rarely hit the cost of increment/decrement), and GC is better for objects with short lives (rarely hit the cost of traversing dying objects).

  2. Hmm I think more research would have helped the article. Such a contentious subject naturally has plenty of papers and articles written about it that include hard data, something sorely lacking from this article.

    A bit of googling could have greatly helped here…

    • I’ve seen benchmarks done in isolation before, but never system-wide benchmarks that consider the total memory requirement and process interaction. If you know of any please provide a link.

  3. Reference counting is garbage collection. All garbage collectors are hybrids between tracing for throughput, and reference counting for incrementality [1]. So let’s be precise and say that you’re biased against tracing, not garbage collection.

    Since you asked about benchmarks, Bacon in the paper below has designed and built many reference counting garbage collectors over the years. Until very recently [2], reference counting performance lagged behind tracing variants by ~30%. This was largely due to inefficiencies related to collecting cyclic garbage.

    Reference counting latency was better since it has incremental properties as previously mentioned, and it did better in smaller heaps because of this, but in most scenarios it just couldn’t compete.

    [1] http://researcher.watson.ibm.com/researcher/files/us-bacon/Bacon04Unified.pdf
    [2] https://users.cecs.anu.edu.au/~steveb/downloads/pdf/rc-ismm-2012.pdf

    • Note in my comment after the first paragraph I explain how I’m using the term GC.

      I also disagree with the results of the first article. There’s no general way to derive conclusions from various idealized garbage collectors as opposed to one’s that actually exist. There’s also no reason why a garbage collector actually needs a scanning component — it may simply be incomplete instead, not resolving loops.

      I’m not clear the second paper is actually doing time measurment, they seem to be focused on counting mutations instead. They also seem to make the assumption that “tracing GC is superior” is somehow an accepted and common conclusion. That seems very odd.

  4. You can disagree all you like, that doesn’t make your disagreement justifiable. You might as well disagree that there’s no way to discuss the runtime of idealized algorithms as opposed to actual implementations, despite the existence of big-O notation. While cache effects sometimes skew results to violate big-O, by and large, big-O is a precise and valid way of reasoning about runtime complexity. Bacon’s results hold for GC for the same reasons.

    Re:incomplete ref counting, sure, and tracing could be even faster if it didn’t have to actually be complete either. It’s frankly absurd to suggest relying on an unreliable service. If you want reliable memory managmeent, use a complete GC. If you want fast allocation with only some space overhead, then use arena allocation. There’s your safety/performance trade-off.

    Furthermore, tracing *is* superior for throughput. This has been well established over the past 50 years. Some variants of ref counting are superior for incrementality/bounded-latency. Those are the facts corroborated by literally hundreds of papers.

    Even a naive, simplistic mark-sweep has better throughput than a highly tuned shared_ptr based on ref counting [1].

    Finally, ref counting gets an order of magnitude even *more* expensive in concurrent scenarios unless you violate it’s incremental properties by deferring updates somehow. Then you’re doing the same things you would in a concurrent tracing GC, ie. write barriers and the like to defer ref count updates.

    [1] https://github.com/knizhnik/cppgc/wiki/Mark&Sweep-garbage-collector-for-C

  5. I happened upon this: I’m just jumping in here to say that as written the following sentence is incorrect for many implementations:

    “GC’s overhead relates to the total memory allocated, not just the memory actively used in the code. Each new allocation results in an increased time per scanning cycle. ”

    Some, relatively common implementations, have overhead in collection time only proportional to the number of objects which are live. Frequency is different but that’s not how this is worded.

  6. Just gonna pop in here to say that the following line is just wrong for many tracing GCs:

    “GC’s overhead relates to the total memory allocated, not just the memory actively used in the code. Each new allocation results in an increased time per scanning cycle.”

    Some (relatively common) implementations have time per scanning cycle proportional only to the amount of live data. Only GC frequency relates to the total allocation.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s