Two-Space Copying Garbage Collection

Post Metadata

When choosing a programming language to use for a project—or when implementing one—there are many considerations to keep in mind. One of those is whether or not the programming language uses any sort of garbage collection. Today, I’m going to talk about one method of garbage collection that we can implement ourselves fairly easily: two-space copying garbage collection.

What is Two-Space Copying Garbage Collection?

Two-space copying garbage collection (or semi-space garbage collection) involves dividing the available memory in half: half is used for normal allocation and half is used for garbage collection. The specific method of doing so that I’ll be talking about is Cheney’s algorithm. Two-space copying garbage collection is a stop and copy method of tracing garbage collection. “Stop and copy” means that, at some point, the garbage collector will stop the execution of a program and will copy all of alive memory. “Tracing” means that the garbage collector determines what memory is still alive by tracing through objects in memory from starting point. This means that our two-space copying garbage collection traces which objects are being used by a program by building a chain of referenced objects from some initial objects—which we’ll call roots and, for simplicity’s sake, today we’ll just pretend that we know what they are automatically; we’re not going to talk about finding them—and stops program execution to copy all memory still in use while discarding unused objects.

As a concrete example of how a tracing garbage collector works, let’s say that we’re running a program that has the following memory stored and that our roots are the memory stored at addresses 0x0, 0x5, and 0x7:

Address Value Type Root?
0x0 0x1 Pointer Yes
0x1 9 Int No
0x2 40 Int No
0x3 A Char No
0x4 0x3 Pointer No
0x5 0x6 Pointer Yes
0x6 0x2 Pointer No
0x7 1 Int Yes

And let’s say that we now need to collect garbage. To do so, let’s trace through memory to figure out which objects are still alive and need to be kept after garbage collection.

We start at our roots. By definition, our roots are still alive memory. So, 0x0, 0x5, and 0x7 are all still alive—i.e., they are still being used by our program. Therefore, our garbage collector needs to keep these values. Next we need to figure out if any of our roots reference any objects. The data stored at address 0x7 is an integer, it doesn’t reference anything. However, the values at 0x0 and 0x5 are pointers, they do reference things. And, since they’re alive, the objects they point to are also still alive. So, addresses 0x1 and 0x6 are also alive. What do we do now? Well, we still need to trace through these new alive objects. Since the value at 0x6 is again a pointer, what it points to is also still alive. So, the data at address 0x2 is still alive. Because it’s an integer and doesn’t reference anything, we’re done. In the end, this means that memory at addresses 0x0, 0x1, 0x2, 0x5, and 0x7 all needs to be alive after garbage collection. Just like a tracing garbage collector, we figured that out by tracing through references from some starting points in memory.

If we were a stop and copy garbage collector, we would have run our program until we ran out of memory. Then, we would have stopped program execution, figured out what memory we needed to keep—in this example, by tracing—and copied that memory into some new space. Finally, we would have discard all the un-copied memory.

To do this, a two-space copying garbage collector splits our memory—in this case, the heap—up into two equal parts: a “From Space” and a “To Space.” During normal program execution, we allocate memory into only the From Space—nothing is allocated into the To Space. When the From Space is full, the garbage collector beings collecting garbage. Tracing references through the From Space from the roots, the collector copies all alive objects into the To Space, leaving only garbage left in the From Space. Then, the collector empties everything in the From Space. Finally, the collector swaps the To and From Space. The old From Space (which is now empty) becomes the new To Space and the old To Space (which has all alive objects) becomes the new From Space. Then, program execution continues.

Pros and Cons of Two-Space Copying Garbage Collection

Now that we know how a two-space copying garbage collector works, why might someone choose to use a two-space copying garbage collection? Here are three pros of this type of garbage collector:

  1. No memory fragmentation: Two-space copying collection moves all live objects close together, preventing memory fragmentation. This is because two-space copying collection, live objects post-collection are all at the start of the old To Space/new From Space. This is unlike strategies like mark and sweep which leave live objects in the same memory locations they were in before garbage collection.
  2. Collection time is proportional to the amount of live data: Two-space copying garbage collection only requires traversing the chains of live objects and live pointers instead of iterating through all memory. We don’t go spend time looking at memory that won’t kept post-collection.
  3. Allocation only requires incrementing a pointer: To store objects in memory with a two-space copying garbage collector, we don’t have to do much work. Allocation only requires storing an object in the next available memory location and then incrementing the pointer storing the location of the next available memory location.

That being said, it does have downsides:

  1. Only half of the heap is able to be used for storing objects since the other half is needed for collection: Since we need to have heap space open for copying live objects during collection, we can’t use the entire heap to store data. In fact, since theoretically all objects could still be live when collecting garbage, we need the same amount of space available in both the From Space and To Space. So, we can only use half the heap when allocating data.
  2. Has terrible memory locality: In two-space copying garbage collection, objects aren’t copied over to the To Space next to the objects they reference. This means that the garbage collection has terrible virtual memory performance.
  3. Requires the entire program to stop executing to collect garbage: Like other stop-the-world methods, program execution has to completely stop while garbage is collected.

There are definitely more pros and cons to two-space copying garbage collection, but these are just a few.

How Does Two-Space Copying Garbage Collection Work

Now that we know what two-space copying collection is, let’s figure out how we could implement it! Since we’re talking about Cheney’s algorithm, let’s go over all of the steps we need to follow:

  1. Copy program roots: Iterate through the roots of the current state of the program in order. For each:
    1. Copy the root from the From Space into the first available space in the To Space.
    2. Replace the memory location in the From Space originally inhabited by the root with a forwarding pointer pointing to its new location in the To Space—this is so that we know where to find this alive memory in its new location.
  2. Copy remaining live memory: Iterate through the objects in the To Space in order. For each:
    1. If the object references anything, find the referenced memory in the From Space; at this point, all references will still be pointing to the From Space. Then:
      1. If that memory is a forwarding pointer, update the reference in the To Space with the location pointed to by the forwarding pointer and move on.
      2. Otherwise, copy the object referenced into the next available space in the To Space.
      3. Replace the memory location in the From Space originally inhabited by the object with a forwarding pointer pointing to its new location in the To Space.
      4. Update the reference in the To Space that pointed to this location in the From Space with the object’s new location in the To Space.
  3. Empty the From Space: Empty all memory in the From Space.
  4. Swap the From and To Spaces: Flip the To Space and the From Space and return to program execution, allocating new memory in the new From Space.

That’s it! Those are all the steps to Cheney’s algorithm. However, that was a bit abstract, so let’s try to clarify with a concrete example.

Mock Example

Let’s see how this works using a simple, concrete example. Suppose we have a programming language with only two data types:

  1. Integers (represented by i) that only store a single number
  2. Pairs (represented by c for cons) that store two pointers each of which references some address in memory

To denote what type each object in memory is, we’ll put the data type before the values stored in the object. This means that each integer requires 2 “spots” in memory, one for the data type and one for the number. Pairs require 3 “spots,” one for the data type and one for each pointer, of which there are two. We’ll use a space to represent empty memory. Let’s also say that our From Space is in the first half of our heap, addresses 0x00 through 0x0F, and our To Space is in the second half, addresses 0x10 through 0x1F. Let’s also say our program has been executing, and we’re now out of memory. Remember that this means that the entire From Space is full, the To Space is still empty because it hasn’t been used yet. The current state of the allocated memory is:

Address Value Address Value
0x00 i 0x10  
0x01 1 0x11  
0x02 i 0x12  
0x03 2 0x13  
0x04 i 0x14  
0x05 3 0x15  
0x06 i 0x16  
0x07 4 0x17  
0x08 c 0x18  
0x09 0x00 0x19  
0x0A 0x04 0x1A  
0x0B i 0x1B  
0x0C 5 0x1C  
0x0D c 0x1D  
0x0E 0x08 0x1E  
0x0F 0x0B 0x1F  

Finally, let’s say that the roots are at 0x00 and 0x08.

What happens when we collect garbage? First, we need to copy over our two roots and leave forwarding pointers, represented by f, in their places (Step 1 Copy program roots from above). When we copy objects over from the From Space to the To Space, we place them at the first available free address in the To Space. Therefore, we need some pointer pointing to the first free address in the To Space. We’ll call it our free pointer, since it points to the first free space in the To Space. When we start collecting garbage, free points to the start of the To Space—address 0x10 in this example—and we’ll increment it after we copy over each object in the From Space.

After copying over the roots, we need to copy the remaining live memory by iterating through everything already in the To Space (Step 2 Copy remaining live memory). To do so, we need a second pointer, the scan pointer (since it scans the current values in the To Space). Initially, scan also needs to point to the start of the To Space because we haven’t iterated through anything yet. In our example, we’ll bold changes in memory after each step.

free = 0x10, scan = 0x10

Address Value Address Value
0x00 i 0x10<- free,scan  
0x01 1 0x11  
0x02 i 0x12  
0x03 2 0x13  
0x04 i 0x14  
0x05 3 0x15  
0x06 i 0x16  
0x07 4 0x17  
0x08 c 0x18  
0x09 0x00 0x19  
0x0A 0x04 0x1A  
0x0B i 0x1B  
0x0C 5 0x1C  
0x0D c 0x1D  
0x0E 0x08 0x1E  
0x0F 0x0B 0x1F  

Now that we know where to start copying objects over, let’s copy our roots! Our first root, at 0x00, is an integer, so it takes up two addresses. We’ll copy it over to the first two address starting at our free pointer: 0x10 and 0x11. Then, we need to leave a forwarding pointer at its old location, 0x00, pointing to its new address, 0x10—remember that these forwarding pointers will help us find the new locations of objects already copied to the To Space if anything else references them. Now, we increment our free pointer by two since we took up two addresses. Notice that we don’t increment our scan pointer yet because we haven’t iterated through anything in the To Space yet.

free = 0x12, scan = 0x10

Address Value Address Value
0x00 f 0x10<-scan i
0x01 0x10 0x11 1
0x02 i 0x12<-free  
0x03 2 0x13  
0x04 i 0x14  
0x05 3 0x15  
0x06 i 0x16  
0x07 4 0x17  
0x08 c 0x18  
0x09 0x00 0x19  
0x0A 0x04 0x1A  
0x0B i 0x1B  
0x0C 5 0x1C  
0x0D c 0x1D  
0x0E 0x08 0x1E  
0x0F 0x0B 0x1F  

Next, we do this same thing for our second root. We copy it over to first free address in the To Space, leave a forwarding pointer behind, and increment our free pointer again.

free = 0x15, scan = 0x10

Address Value Address Value
0x00 f 0x10<-scan i
0x01 0x10 0x11 1
0x02 i 0x12 c
0x03 2 0x13 0x00
0x04 i 0x14 0x04
0x05 3 0x15<-free  
0x06 i 0x16  
0x07 4 0x17  
0x08 f 0x18  
0x09 0x12 0x19  
0x0A 0x04 0x1A  
0x0B i 0x1B  
0x0C 5 0x1C  
0x0D c 0x1D  
0x0E 0x08 0x1E  
0x0F 0x0B 0x1F  

We’ve copied our two roots over to the To space, so what’s next? The objects we just copied over might reference other objects. If they do that, then those objects still need to be alive too. Otherwise, we would throw away memory that might be referenced in the future, which we don’t want to do. That’s a great way to cause a lot of seg faults. Now, we need to iterate through the objects currently in the To Space and copy over any objects that they reference (Step 2. Copy remaining live memory); this is the point (get it?) of the scan pointer.

Right now, scan points to address 0x10. At 0x10, there’s just have an integer. Integers don’t reference anything, so we can move on to the next object in the To Space. Let’s increment scan just like we increment free, bumping it up by two since integers take up two addresses in memory.

free= 0x15, scan = 0x12

Address Value Address Value
0x00 f 0x10 i
0x01 0x10 0x11 1
0x02 i 0x12<-scan c
0x03 2 0x13 0x00
0x04 i 0x14 0x04
0x05 3 0x15<-free  
0x06 i 0x16  
0x07 4 0x17  
0x08 f 0x18  
0x09 0x12 0x19  
0x0A 0x04 0x1A  
0x0B i 0x1B  
0x0C 5 0x1C  
0x0D c 0x1D  
0x0E 0x08 0x1E  
0x0F 0x0B 0x1F  

The data at address 0x12 is a pair (a cons), meaning it does reference other memory! We need to copy what it references over to the To Space as well. Since we have a pair at 0x12, it references two addresses in memory. The first location referenced by this pair is 0x00, so let’s go there first. At 0x00, we have a forwarding pointer to 0x10. This makes our life easy, we just need to update the address that 0x13 points to to the new location: 0x10. Now, instead of pointing to 0x00, the first pointer in our pair points to 0x10:

free = 0x15, scan = 0x12

Address Value Address Value
0x00 f 0x10 i
0x01 0x10 0x11 1
0x02 i 0x12<-scan c
0x03 2 0x13 0x10
0x04 i 0x14 0x04
0x05 3 0x15<-free  
0x06 i 0x16  
0x07 4 0x17  
0x08 f 0x18  
0x09 0x12 0x19  
0x0A 0x04 0x1A  
0x0B i 0x1B  
0x0C 5 0x1C  
0x0D c 0x1D  
0x0E 0x08 0x1E  
0x0F 0x0B 0x1F  

The cons at 0x12 also references 0x04, so we need to copy that too. We don’t have a forwarding pointer here, so we move the object at 0x04 to the next available space in the To Space, the address of which is stored in free. Then, we leave a forwarding pointer in 0x04 pointing to the new address in the To Space. Finally, we increment both free and scan.

free = 0x17, scan = 0x15

Address Value Address Value
0x00 f 0x10 i
0x01 0x10 0x11 1
0x02 i 0x12 c
0x03 2 0x13 0x10
0x04 f 0x14 0x15
0x05 0x15 0x15<-scan i
0x06 i 0x16 3
0x07 4 0x17<-free  
0x08 f 0x18  
0x09 0x12 0x19  
0x0A 0x04 0x1A  
0x0B i 0x1B  
0x0C 5 0x1C  
0x0D c 0x1D  
0x0E 0x08 0x1E  
0x0F 0x0B 0x1F  

Now, scan still points to an object in the To Space. So, we repeat the process. Since the object at 0x15 doesn’t reference anything, we have nothing new from the From Space to copy. We just update scan.

free = 0x17, scan = 0x17

Address Value Address Value
0x00 f 0x10 i
0x01 0x10 0x11 1
0x02 i 0x12 c
0x03 2 0x13 0x10
0x04 f 0x14 0x15
0x05 0x15 0x15 i
0x06 i 0x16 3
0x07 4 0x17<-free,scan  
0x08 f 0x18  
0x09 0x12 0x19  
0x0A 0x04 0x1A  
0x0B i 0x1B  
0x0C 5 0x1C  
0x0D c 0x1D  
0x0E 0x08 0x1E  
0x0F 0x0B 0x1F  

At this point, scan == free. This means that we’re done copying! Because scan == free, we know there are no more objects left to scan over in the To Space. So we’ve collected all the live data and just need to empty out the old From Space (Step 3. Empty the From Space).

Address Value Address Value
0x00   0x10 i
0x01   0x11 1
0x02   0x12 c
0x03   0x13 0x10
0x04   0x14 0x15
0x05   0x15 i
0x06   0x16 3
0x07   0x17  
0x08   0x18  
0x09   0x19  
0x0A   0x1A  
0x0B   0x1B  
0x0C   0x1C  
0x0D   0x1D  
0x0E   0x1E  
0x0F   0x1F  

Finally, we swap the start of the From and To Spaces, leaving the live memory as the start of our new From Space and leaving our new To Space empty for the next time we need to run the garbage collector (Step 4. Swap the From and To Spaces). Two-space copying garbage collection is as simple as that!

Tl;dr

Two-Space Garbage Collection is a garbage collection strategy that involves copying live from one half of the heap to the other, discarding all dead memory left behind in the first half. The strategy has some valuable benefits, such as no memory fragmentation, collection time being proportional to the amount of live data, allocation only requiring incrementing a pointer, and collection not requiring much state. However, there are also some important downsides to consider, including only half of the heap being able to be used for storing objects since the other half is needed for collection, having terrible memory locality, and requiring the entire program to stop executing to collect garbage.

References and Further Reading