Garbage Collection in C Programs

Gianluca Insolvibile

Issue #113, September 2003

LISP and Java programmers take garbage collection for granted. With the Boehm-Demers-Weiser library, you easily can use it in C and C++ projects, too.

The first word that came to mind when I heard about introducing Garbage Collection techniques into a C or C++ program was “nonsense”. As with any other decent C programmer who loves this language, the thought of leaving the management of my own memory to others seemed nearly offensive. I had a similar feeling 15 years ago, when I first heard about compilers that would generate assembly code on my behalf. I was used to writing my code directly in 6510 opcodes, but that was Commodore 64—and a totally different story.

What Is Garbage Collection?

Garbage Collection (GC) is a mechanism that provides automatic memory reclamation for unused memory blocks. Programmers dynamically allocate memory, but when a block is no longer needed, they do not have to return it to the system explicitly with a free() call. The GC engine takes care of recognizing that a particular block of allocated memory (heap) is not used anymore and puts it back into the free memory area. GC was introduced by John McCarthy in 1958, as the memory management mechanism of the LISP language. Since then, GC algorithms have evolved and now can compete with explicit memory management. Several languages are natively based on GC. Java probably is the most popular one, and others include LISP, Scheme, Smalltalk, Perl and Python. C and C++, in the tradition of a respectable, low-level approach to system resources management, are the most notable exceptions to this list.

Many different approaches to garbage collection exist, resulting in some families of algorithms that include reference counting, mark and sweep and copying GCs. Hybrid algorithms, as well as generational and conservative variants, complete the picture. Choosing a particular GC algorithm usually is not a programmer's task, as the memory management system is imposed by the adopted programming language. An exception to this rule is the Boehm-Demers-Weiser (BDW) GC library, a popular package that allows C and C++ programmers to include automatic memory management into their programs. The question is: Why would they want to do a thing like this?

The Boehm-Demers-Weiser GC Library

The BDW library is a freely available library that provides C and C++ programs with garbage collection capabilities. The algorithm it employs belongs to the family of mark and sweep collectors, where GC is split into two phases. First, a scan of all the live memory is done in order to mark unused blocks. Then, a sweep phase takes care of putting the marked blocks in the free blocks list. The two phases can be, and usually are, performed separately to increase the general response time of the library. The BDW algorithm also is generational; it concentrates free space searches on newer blocks. This is based on the idea that older blocks statistically live longer. To put it another way, most allocated blocks have short lifetimes. Finally, the BDW algorithm is conservative in that it needs to make assumptions on which variables are actually pointers to dynamic data and which ones only look that way. This is a consequence of C and C++ being weakly typed languages.

The BDW collector comes as a static or dynamic library and is installed easily by downloading the corresponding package (see Resources) and running the traditional configure, make and make install commands. Some Linux distributions also come with an already-made package. For example, with Gentoo you need to type only emerge boehm-gc to install it. The installed files include both a shared object (libgc.o) and a static library (libgc.a).

Using the library is a fairly straightforward task; for newly developed programs, you simply call GC_alloc() to get memory and then forget about it when you do not need it anymore. “Forget about it” means setting all the pointers that reference it to NULL. For already existing sources, substitute all allocation calls (malloc, calloc, realloc) with the GC-endowed ones. All free() calls are replaced with nothing at all, but do set any relevant pointers to NULL.

GC_alloc() actually sets the allocated memory to zero to minimize the risk that preexisting values are misinterpreted as valid pointers by the GC engine. Hence, GC_alloc() behaves more like calloc() than malloc().

Using GC in Existing C Programs

If you want to try GC in an existing application, manually editing the source code to change mallocs and frees is not necessary. In order to redirect those calls to the GC version, you basically have three options: using a macro, modifying the malloc hooks and overriding glibc's malloc() with libgc's malloc(). The first approach is the easiest one; you simply need to insert something like:


#define malloc(x) GC_malloc(x)
#define calloc(n,x) GC_malloc((n)*(x))
#define realloc(p,x) GC_realloc((p),(x))
#define free(x) (x) = NULL

into the appropriate include files. This code substitutes only the explicit calls contained in your code, leaving startup and library allocations to traditional malloc/free calls.

A different approach is to hook malloc and friends to functions of your own, which in turn would call the GC versions. Listing 1 shows how to do it, and it can be linked directly to an existing program. See my article “Advanced Memory Allocation” [LJ, May 2003] for details on these hooks. With this method, any heap allocation is guaranteed to go through libgc, even if it is not performed directly by your code.

Listing 1. Using glibc's malloc and free Hooks to Enable Garbage Collection


/*
 * Using the malloc hooks to substitute GC functions
 * to existing malloc/free.
 * Similar wrapper functions can be written
 * to redirect calloc() and realloc()
 */

#include <malloc.h>
#include <gc.h>

static void gc_wrapper_init(void);
static void *gc_wrapper_malloc(size_t,const void *);
static void gc_wrapper_free(void*, const void *);

__malloc_ptr_t (*old_malloc_hook)
    __MALLOC_PMT((size_t __size,
                  __const __malloc_ptr_t));
void (*old_free_hook)
    __MALLOC_PMT ((__malloc_ptr_t __ptr,
                   __const __malloc_ptr_t));


/* Override initializing hook from the C library. */
void (*__malloc_initialize_hook)(void) =
                                    gc_wrapper_init;

static void gc_wrapper_init() {
    old_malloc_hook = __malloc_hook;
    old_free_hook = __free_hook;
    __malloc_hook = gc_wrapper_malloc;
    __free_hook = gc_wrapper_free;
}

void *
gc_wrapper_malloc(size_t size, const void *ptr) {
    void *result;
    /* Restore all old hooks */
    __malloc_hook = old_malloc_hook;
    __free_hook = old_free_hook;

    /* Call the Boehm malloc */
    result = GC_malloc(size);


    /* Save underlying hooks */
    old_malloc_hook = __malloc_hook;
    old_free_hook = __free_hook;

    /* Restore our own hooks */
    __malloc_hook = gc_wrapper_malloc;
    __free_hook = gc_wrapper_free;

    return result;
}

static void
gc_wrapper_free(void *ptr, const void *caller)
{
  /* Nothing done! */
}

As a third alternative, you can pass --enable-redirect-malloc to configure before compiling the libgc library. Doing so provides the library with wrapper functions that have the same names as the standard glibc malloc family. When linking with your code, the functions in libgc override the standard ones, with a net effect similar to using malloc hooks. In this case, though, the effect is system-wise, as any program linked with libgc is affected by the change.

Do not expect to endow huge programs with GC easily using any of these methods. Some simple tricks are needed in order to exploit GC functions and help the collector algorithm work efficiently. For example, I tried to recompile gawk (version 3.1.1) using GC and obtained an executable ten times slower than the original. With some adjustments, such as setting each pointer to NULL after having freed it, the execution time improved significantly, even if it was still greater than the original time.

Garbage Collection in New Programs

If you are developing a new program and would like to take advantage of automated memory management, all you need to do is use the GC_malloc() family in place of the plain malloc() one and link with libgc. Memory blocks no longer needed simply can be disposed of by setting any referencing pointers to NULL. Alternatively, you can call GC_free() to free the block immediately.

Always remember that your whole heap is scanned periodically by the collector to look for unused blocks. If the heap is large, this operation may take some time, causing the performance of the program to degrade. This behavior is suboptimal, because large blocks of memory often are guaranteed never to contain pointers, including buffers used for file or network I/O and large strings. Typically, pointers are contained only in fixed positions within small data structures, such as list and tree nodes. Were C and C++ strongly typed languages, the collector could have decided whether to scan a memory block, based on the type of pointer. Unfortunately, this is not possible because it is perfectly legal in C to have a char pointer reference a list node.

For optimal performance, the programmer should try to provide some basic runtime type information to the collector. To this end, the BDW library has a set of alternative functions that can be used to allocate memory. GC_malloc_atomic() can be used in place of GC_malloc() to obtain memory blocks that will never contain valid pointers. That is, the collector skips those blocks when looking for live memory references. Furthermore, those blocks do not need to be cleared on allocation. GC_malloc_uncollectable() and GC_malloc_stubborn() also can be used to allocate fixed and rarely changing blocks, respectively. Finally, it is possible to provide some rough type information by using GC_malloc_explicitly_typed() and building block maps with GC_make_descriptor(). See gc_typed.h on the Linux Journal FTP site for more information [available at ftp.linuxjournal.com/pub/lj/listings/issue113/6679.tgz].

The collector's behavior also can be controlled by the user through a number of function calls and variables. Among the most useful ones are GC_gcollect(), which forces a full garbage collection on the whole heap; GC_enable_incremental(), which enables incremental mode collection; and GC_free_space_divisor, which tunes the trade-off between frequent collections (high values, causing low heap expansion and high CPU overhead) and time efficiency (low values).

Heap status and debug information is available through a number of functions, including GC_get_heap_size(), GC_get_free_bytes(), GC_get_bytes_since_gc(), GC_get_total_bytes() and GC_dump(). Many of these parameters and functions are not documented at all, not even in the source code itself. As always, a good editor is your friend.

Performance and Behavior

A single best approach to memory management that is effective for any program does not exist. Given a specific application, the optimal solution must be found by compromising on a number of different factors, including CPU overhead, heap expansion, allocation latency and, last but not least, manageability and robustness of the code. Profiling the program and testing different memory strategies probably is the best solution for dealing with these issues.

Garbage Collection and Compiler Optimizations

One subtle point against GC is it requires extra care if compiler optimizations are switched on. The collector may wrongly assume a certain pointer has disappeared simply because references to it have been optimized out. Thus, the corresponding memory block might be freed even if it still is in use by the program, with the obvious consequences. Hence, it might be tempting to turn compiler optimizations off to be safe, losing part of the performance gain obtained by using GC.

In order to have an idea of the behavior and performance of GC vs. traditional memory management, we have experimented with a test program, gctest, which loops over the creation and destruction of a simple list. Simple as it may seem, the test raises some interesting points. The source code is not really instructive and is too long to be printed, so it is available for download on the FTP site (ftp.linuxjournal.com/pub/lj/listings/issue113/6679.tgz).

gctest can be controlled with a number of options that allow you to experiment with different list lengths and node sizes, as well as to change working parameters—enabling and disabling specific features of the GC library. Before we comment on the results we obtained with this test tool, it is important to point out that they were obtained in an artificial and not excessively realistic environment. So, again, you are invited to test the GC library yourself and evaluate it for your own code. Take the parameters presented here as possible indicators of the library suitability to a particular application.

The measurements we collected are overall execution time, the CPU load; heap expansion, how much memory is requested by the system with respect to how much actually was allocated by the program; and allocation latency average and standard deviation, how long it takes to allocate a single block of memory and how much this time varies across different allocations. The meaning of the first parameter is quite obvious and needs no further explanation. Heap expansion is a measure of how much of the allocated memory is fragmented and what amount of extra memory is requested by the library from the operating system. As we will see, the library may allocate a 1MB block ten times, freeing it after each allocation and requesting a total of 10MB of the system, as if freed memory was not put back on the free list. Although this sometimes is desired behavior, needed to optimize the allocation strategy, it can be annoying on systems with a limited availability of RAM. It can become a further source of CPU overhead if swap space is involved. Finally, allocation latency is important for real-time applications, which need the longest allocation time to be bounded. Typical cases are multimedia playback applications and specialized embedded systems that need to react to external events in a predictable amount of time.

Some Comments on the Results

Our test box A is a Pentium 4 2.53GHz system, with 1GB of RAM running Gentoo Linux (all code is optimized for the CPU architecture) and glibc 2.3, which has an improved memory management algorithm over glibc 2.2. Test box B is a K6-II 400MHz laptop with 128MB of RAM, running Slackware Linux with glibc 2.2.

Our first test consisted of allocating a list of 150,000 nodes, 16 bytes each, 30 times. On each loop, the allocated list was destroyed, that is, free()ed in case of traditional management, unlinked in case of GC. The test commands were:


gctest -tu -s 4 -n 150000 -l 30

and:


gctest -gu -s 4 -n 150000 -l 30

The overall execution time, on box A, was 3.80 seconds with traditional management and 2.43 seconds with GC, an improvement of about 35%. The same test performed on box B showed an even greater improvement, around 40%. This first test shows that, contrary to popular belief, GC actually can be quite faster than malloc/free. Heap expansion is rather limited and amounts to about 2 in both cases. What is even more surprising is that allocation latency is the same—6.7 microseconds—with a slightly larger deviation for the GC case. Also interesting is that by calling GC_gcollect() at each loop (option -G), the overall execution time decreases by 0.1 second. This result is counterintuitive, because we have one more function call in the loop.

Now, let's see what happens if we forget to destroy the list at the end of each loop. In the traditional management case, the test executes faster, 2.58 vs. 3.80 seconds, but the peak heap expansion is 140MB, which is twice the overall allocated memory. In the GC case, the test aborts due to memory exhaustion unless we call for an explicit collection (-G) at the end of each loop. By doing so, we obtain the lowest execution time, 2.32 seconds. This probably is quite far from what we could have imagined a priori—that's why actual experimentation is important for finding the optimal solutions.

The same test also has been performed on box B, but with an unoptimized Slackware distribution and glibc 2.2. It is interesting that although the improvement of GC over malloc/free still was around 40%, the test ran 27% faster under Gentoo.

The second test we made shows some limitations of the GC library. The test conditions actually were quite extreme: we tried to allocate five lists, each having 1,500,000 nodes, with each node being 16 bytes long. Although the malloc/free version ran correctly, the GC version did not complete the test because of memory exhaustion. The problem probably is due to the large number of blocks consecutively allocated.

The third test used larger nodes, 140 bytes each, and a shorter list length, 150,000 nodes. We ran:


gctest -tu -s 128 -n 150000 -l 5

and:


gctest -gu -s 128 -n 150000 -l 5

Under these conditions, the improvement of GC over malloc/free was around 20%, 0.85 vs. 0.60 seconds. Recall, though, the GC library always clears the allocated blocks, but malloc() does not. The overhead associated with this operation grows linearly with node size and thus is more important with larger nodes. To make a fair comparison, then, we need to substitute malloc() with calloc() at those points where GC_malloc() is called, as happens in:


gctest -tuc -s 128 -n 150000 -l 5

This test yields an execution time of 0.88 seconds and brings the GC improvement to 32%. Heap expansion is greater in the GC case, with a value of 1.7 vs. 1.0. Allocation latency is practically the same for both traditional and GC management, although a larger latency variation was experienced in the latter. Enabling incremental collection (-i option) did not lower the variation, although introducing calls to GC_free() (-F option) to explicitly free the list nodes explicitly actually did yield better results than the malloc/free case, on both execution time and latency. However, in this case we are not strictly using a real garbage collection approach.

Testing even larger memory blocks makes the difference between traditional and GC memory management quite noticeable. On 4KB nodes, GC performed quite poorly in comparison with malloc/free on execution time, 0.85 vs. 2 seconds; heap expansion, 2.75 vs. 1; and latency, 0.7 milliseconds vs. 1.6 microseconds. When compared with calloc/free performance, the execution time of GC still is quite competitive (40% faster). But, issues related to heap and latency remain.

Conclusion

GC techniques often are surrounded by myths and legends. In this article we have shown that GC actually can perform better than malloc/free. These advantages do not come for free, however, and the correct use of the library requires minimum knowledge on its internal mechanisms.

There is no final verdict on the suitability of GC for C programs. For now, the BDW library can be one more tool in your box, to be given serious consideration the next time you deal with a complex piece of software. Several open-source projects, like the Mono Project and GNU gcj Java runtime, have been using it for a while.

Pros and Cons

Before deciding that GC is for wimps and a hard-core C programmer would never need it, it may be beneficial to consider the actual advantages GC may offer with respect to the traditional C/C++ memory management schemes. As Ellis and Stroustrup say in The Annotated C++ Reference Manual, “C programmers think memory management is too important to be left to the computer. LISP programmers think memory management is too important to be left to the user.”

That memory-related problems contain some of the most insidious and time-wasting bugs a C programmer can encounter is a well-known fact. Even an experienced programmer might have a hard time tracking down bugs caused by invalid accesses, overflowing writes, accesses to dead memory, memory leaks and the like. Furthermore, from a software design point of view, the need to prevent such situations often leads designers to unclean interfaces between modules that should be decoupled. Think of functions that return a dynamically allocated memory block that then must be freed by the caller or a pointer to an internal static buffer, spoiling threadsafety—strtok() is a typical example in this sense—not to mention problems arising from memory areas shared among several modules. Each module has to free such areas only if no other modules are or will be using it, resulting in a tighter coupling between the modules themselves. This problem usually is addressed by using reference counting, where an internal counter is kept for each active user of the area; multiple copies, where one copy of the area is kept for each module; or, for those who fancy C++, smart pointers.

GC is an effective way of dealing with memory-related problems in C, relieving the programmer of memory accounting burdens. Each memory block knows when it is in use, actually, the GC engine knows, and the block automatically disappears when it is not referenced anymore. GC eliminates a priori all premature frees and memory leak problems.

GC has some drawbacks, of course, the most annoying being the feeling of having lost control that afflicts C programmers. More concretely, drawbacks stem from the automated management of resources, which translates into:

Not knowing when an unused memory block actually is freed and if it ever will be. This has further consequences when the memory block is a C++ object, whose destructor is called at an unspecified time.
Not being able to determine in advance how much time an allocation takes (allocation latency and jitter), which often is an issue for real-time systems. It must be said, though, that traditional malloc() also presents this problem sometimes. Incremental GC can help alleviate the problem and provide limited constraints to allocation times.
Longer execution time due to GC processing overhead. This often is not true, as sometimes the overhead associated with traditional free()s is equal to or even greater than that of GC. In this sense, only storage allocators written specifically for a given program might be faster. Even so, programmers often write their own memory management systems without a preliminary profiling activity, sometimes resulting in negative effects on performance.
Higher memory fragmentation (heap expansion), caused by unused objects not being freed by the GC engine when more memory is allocated by the program. It is quite common to have a heap much larger than the amount of memory in actual use at a certain time. Even worse, the heap size sometimes is proportional to the total amount of allocation performed since the program started. This may be a serious issue on systems with scarce RAM availability, where frequent collections and explicit free() calls might be necessary to limit fragmentation.
Reduced locality of reference, due to the garbage detection phase that has to traverse the whole addressable heap space to look for unused blocks. This results in cache and virtual memory misses more often than happens with traditional allocations. Generational GC algorithms limit the severity of this problem.

GC enthusiasts claim that these issues are not so relevant, considering the cleaner design and greater robustness that GC offers. Even if these advantages are difficult to be evaluated concretely, they sometimes provide a convenient trade-off between code manageability and loss of resources.

Finally, the BDW library also can be used proficiently as a leak detector; the Mozilla team uses it this way, for example. For more information, have a look at the included documentation.

Resources

The Boehm-Demers-Weiser home page at www.hpl.hp.com/personal/Hans_Boehm/gc provides the source code for the library and a wealth of documentation and links for more information on GC. Good starting points are Hans Boehm and David Chase, “A proposal for Garbage-Collector-Safe C Compilation”, 1992, available at www.hpl.hp.com/personal/Hans_Boehm/gc/papers/boecha.ps.gz; Paul Wilson, “Uniprocessor Garbage Collection Techniques”, University of Texas, 1992, available at ftp.cs.utexas.edu/pub/garbage/gcsurvey.ps; and Benjamin Zorn, “The Measured Cost of Conservative Garbage Collection”, Tech Report, University of Colorado at Boulder, 1992.