, , , , , ,

“Reference counting is slower than garbage collection”, a claim often made in the discussion of memory management. I heard it again recently when discussing Leaf; somebody took issue with my arguments against garbage collection. Making a comparison though is extremely difficult, but I will attempt to look at some of the contributing costs here.

In this article when I mention GC I’m speaking of a tracing garbage collector. I find this is most often meant when people say GC, so I’m going with that term. Keep in mind that no modern memory management scheme is likely trivial enough to use just reference counting or tracing garbage collection; I’m considering the broad implications here of the general approaches.

The different models

There is a definite overhead to using reference counting for memory, but in practice it’s generally efficient. The overhead is incurred throughout the code during pointer assignment. It comes primarily in the form of memory synchronization between CPU cores, and a little bit in added code size. Though most pointer assignments don’t need this overhead, it still happens enough be noticed in the overall performance.

In a scanning GC there is no real overhead during pointer assignment. The question of whether an object is free is deferred to a later time. The overhead in the GC comes from having to scan memory at frequent intervals searching for things to release.

If the GC language is doing a synchronized memory operation on assignment, it is likely incurring the same overhead cost as reference counting does. It’s the memory synchronization which is the most costly. I haven’t looked much at GC systems, but I have to assume they don’t use synchronized assignment, otherwise reference counting would certainly be the faster solution.

This means that the functional code paths will definitely be faster than reference counting so long as the GC thread doesn’t interrupt it. The moment the GC thread starts working however it will impact all code. A lot of synchronized memory access is required by the GC, causing some caching issues on the CPU (it requires memory flushing much like reference counting’s increment and decrement do).

Many GC’s result in a stop-the-world effect, suspending all processing during some collection cycles. In some domains this type of pause is unacceptable to the application, but in many domains it really isn’t a problem if it’s kept short enough. A more iterative GC may reduce the stopping time, but the CPU cache synchronization will impact the code paths on a continual basis, much like in reference counting.

Size matters

GC’s overhead relates to the total memory allocated, not just the memory actively used in the code. Each new allocation results in an increased time per scanning cycle. Certainly there are many optimizations involved in a good GC to limit this, but the fundamental relationship is still linear. As a program uses more memory the overhead of it’s GC increases.

Reference counting only involves objects involved in an assignment; it’s cost is not related to the total memory used by an application.

Freeing an object during reference counting could however have a cascading effect. The local release of an object could result in a large tree of objects being released. In theory this type of delay could, like a GC pause, negatively impact some applications. I’ve not seen this problem in practice before though, as destructors and memory freeing tend to be quite fast.

Reference counting with cycle detection adds an interesting twist. I believe both D and Python do this. In order to eliminate memory loops objects are “sometimes” scanned for reference loops. This involves a lot more work than simply touching a reference count — though still not nearly as much as scanning all allocated memory. The impact of this loop detection depends heavily on how often it is scanned, and what the triggers are.

Total memory matters

GC tends to result in more memory being used in total. The trigger for scanning is often low memory conditions, meaning that a lot of memory needs to be allocated prior to a scanning operation. This has significant OS level implications: memory used by the app must be swapped in/out, and also prevents other apps, and file caches, from using the memory.

It’s my feeling, from experience, that it’s this aspect of GC’d applications that cause the biggest performance impact. The overall system just goes slower due to the large amount of memory involved. It doesn’t seem like it needs to be a fundamental problem of a GC though. The structure of a standard library, and typical programming paradigms, contributes a lot to this effect.

Structure matters

One problem in comparison is that programmers often design around the limitations of the system. Despite C++ having a reference counted shared_ptr, the vast majority of pointer operations aren’t using that type. Direct pointers, stack objects and memory pools are common ways to reduce allocation overhead. Even manual memory management can used to improve performance well beyond what a GC can achieve.

But that requires a lot of direct programmer optimization, and those techniques can’t all be easily used in language like Python. Certain techniques, like object pools, can also be used in GC languages, though the setup may be more work.

High level comparisons

The level of complexity in memory management makes comparison of reference counting to a scanning garbage collector quite difficult. We must consider the entire system, not just an individual program. We also can’t just switch from one strategy to another, even if it were an option, since programs tend to be written around their memory manager.

I don’t see how a claim that “garbage collection is faster” can be substantiated. One would need to construct a test that considers all of the above points. My general feeling is that GC is probably slower, strictly based on the increased total memory use of the applications. Perhaps that’s an issue of bulky and poorly constructed standard libraries more than an issue of GC itself though.