Two-Space Copying Garbage Collection
Post Metadata
When choosing a programming language to use for a project—or when implementing one—there are many considerations to keep in mind. One of those is whether or not the programming language uses any sort of garbage collection. Today, I’m going to talk about one method of garbage collection that we can implement ourselves fairly easily: two-space copying garbage collection.
What is Two-Space Copying Garbage Collection?
Two-space copying garbage collection (or semi-space garbage collection) involves dividing the available memory in half: half is used for normal allocation and half is used for garbage collection. The specific method of doing so that I’ll be talking about is Cheney’s algorithm. Two-space copying garbage collection is a stop and copy method of tracing garbage collection. “Stop and copy” means that, at some point, the garbage collector will stop the execution of a program and will copy all of alive memory. “Tracing” means that the garbage collector determines what memory is still alive by tracing through objects in memory from starting point. This means that our two-space copying garbage collection traces which objects are being used by a program by building a chain of referenced objects from some initial objects—which we’ll call roots and, for simplicity’s sake, today we’ll just pretend that we know what they are automatically; we’re not going to talk about finding them—and stops program execution to copy all memory still in use while discarding unused objects.
As a concrete example of how a tracing garbage collector works, let’s say that we’re running a program that has the following memory stored and that our roots are the memory stored at addresses 0x0
, 0x5
, and 0x7
:
Address | Value | Type | Root? |
---|---|---|---|
0x0 |
0x1 |
Pointer | Yes |
0x1 |
9 |
Int | No |
0x2 |
40 |
Int | No |
0x3 |
A |
Char | No |
0x4 |
0x3 |
Pointer | No |
0x5 |
0x6 |
Pointer | Yes |
0x6 |
0x2 |
Pointer | No |
0x7 |
1 |
Int | Yes |
And let’s say that we now need to collect garbage. To do so, let’s trace through memory to figure out which objects are still alive and need to be kept after garbage collection.
We start at our roots.
By definition, our roots are still alive memory.
So, 0x0
, 0x5
, and 0x7
are all still alive—i.e., they are still being used by our program.
Therefore, our garbage collector needs to keep these values.
Next we need to figure out if any of our roots reference any objects.
The data stored at address 0x7
is an integer, it doesn’t reference anything.
However, the values at 0x0
and 0x5
are pointers, they do reference things.
And, since they’re alive, the objects they point to are also still alive.
So, addresses 0x1
and 0x6
are also alive.
What do we do now?
Well, we still need to trace through these new alive objects.
Since the value at 0x6
is again a pointer, what it points to is also still alive.
So, the data at address 0x2
is still alive.
Because it’s an integer and doesn’t reference anything, we’re done.
In the end, this means that memory at addresses 0x0
, 0x1
, 0x2
, 0x5
, and 0x7
all needs to be alive after garbage collection.
Just like a tracing garbage collector, we figured that out by tracing through references from some starting points in memory.
If we were a stop and copy garbage collector, we would have run our program until we ran out of memory. Then, we would have stopped program execution, figured out what memory we needed to keep—in this example, by tracing—and copied that memory into some new space. Finally, we would have discard all the un-copied memory.
To do this, a two-space copying garbage collector splits our memory—in this case, the heap—up into two equal parts: a “From Space” and a “To Space.” During normal program execution, we allocate memory into only the From Space—nothing is allocated into the To Space. When the From Space is full, the garbage collector beings collecting garbage. Tracing references through the From Space from the roots, the collector copies all alive objects into the To Space, leaving only garbage left in the From Space. Then, the collector empties everything in the From Space. Finally, the collector swaps the To and From Space. The old From Space (which is now empty) becomes the new To Space and the old To Space (which has all alive objects) becomes the new From Space. Then, program execution continues.
Pros and Cons of Two-Space Copying Garbage Collection
Now that we know how a two-space copying garbage collector works, why might someone choose to use a two-space copying garbage collection? Here are three pros of this type of garbage collector:
- No memory fragmentation: Two-space copying collection moves all live objects close together, preventing memory fragmentation. This is because two-space copying collection, live objects post-collection are all at the start of the old To Space/new From Space. This is unlike strategies like mark and sweep which leave live objects in the same memory locations they were in before garbage collection.
- Collection time is proportional to the amount of live data: Two-space copying garbage collection only requires traversing the chains of live objects and live pointers instead of iterating through all memory. We don’t go spend time looking at memory that won’t kept post-collection.
- Allocation only requires incrementing a pointer: To store objects in memory with a two-space copying garbage collector, we don’t have to do much work. Allocation only requires storing an object in the next available memory location and then incrementing the pointer storing the location of the next available memory location.
That being said, it does have downsides:
- Only half of the heap is able to be used for storing objects since the other half is needed for collection: Since we need to have heap space open for copying live objects during collection, we can’t use the entire heap to store data. In fact, since theoretically all objects could still be live when collecting garbage, we need the same amount of space available in both the From Space and To Space. So, we can only use half the heap when allocating data.
- Has terrible memory locality: In two-space copying garbage collection, objects aren’t copied over to the To Space next to the objects they reference. This means that the garbage collection has terrible virtual memory performance.
- Requires the entire program to stop executing to collect garbage: Like other stop-the-world methods, program execution has to completely stop while garbage is collected.
There are definitely more pros and cons to two-space copying garbage collection, but these are just a few.
How Does Two-Space Copying Garbage Collection Work
Now that we know what two-space copying collection is, let’s figure out how we could implement it! Since we’re talking about Cheney’s algorithm, let’s go over all of the steps we need to follow:
- Copy program roots: Iterate through the roots of the current state of the program in order. For each:
- Copy the root from the From Space into the first available space in the To Space.
- Replace the memory location in the From Space originally inhabited by the root with a forwarding pointer pointing to its new location in the To Space—this is so that we know where to find this alive memory in its new location.
- Copy remaining live memory: Iterate through the objects in the To Space in order. For each:
- If the object references anything, find the referenced memory in the From Space; at this point, all references will still be pointing to the From Space. Then:
- If that memory is a forwarding pointer, update the reference in the To Space with the location pointed to by the forwarding pointer and move on.
- Otherwise, copy the object referenced into the next available space in the To Space.
- Replace the memory location in the From Space originally inhabited by the object with a forwarding pointer pointing to its new location in the To Space.
- Update the reference in the To Space that pointed to this location in the From Space with the object’s new location in the To Space.
- If the object references anything, find the referenced memory in the From Space; at this point, all references will still be pointing to the From Space. Then:
- Empty the From Space: Empty all memory in the From Space.
- Swap the From and To Spaces: Flip the To Space and the From Space and return to program execution, allocating new memory in the new From Space.
That’s it! Those are all the steps to Cheney’s algorithm. However, that was a bit abstract, so let’s try to clarify with a concrete example.
Mock Example
Let’s see how this works using a simple, concrete example. Suppose we have a programming language with only two data types:
- Integers (represented by
i
) that only store a single number - Pairs (represented by
c
forcons
) that store two pointers each of which references some address in memory
To denote what type each object in memory is, we’ll put the data type before the values stored in the object.
This means that each integer requires 2 “spots” in memory, one for the data type and one for the number.
Pairs require 3 “spots,” one for the data type and one for each pointer, of which there are two.
We’ll use a space to represent empty memory.
Let’s also say that our From Space is in the first half of our heap, addresses 0x00
through 0x0F
, and our To Space is in the second half, addresses 0x10
through 0x1F
.
Let’s also say our program has been executing, and we’re now out of memory.
Remember that this means that the entire From Space is full, the To Space is still empty because it hasn’t been used yet.
The current state of the allocated memory is:
Address | Value | Address | Value |
---|---|---|---|
0x00 |
i |
0x10 |
|
0x01 |
1 |
0x11 |
|
0x02 |
i |
0x12 |
|
0x03 |
2 |
0x13 |
|
0x04 |
i |
0x14 |
|
0x05 |
3 |
0x15 |
|
0x06 |
i |
0x16 |
|
0x07 |
4 |
0x17 |
|
0x08 |
c |
0x18 |
|
0x09 |
0x00 |
0x19 |
|
0x0A |
0x04 |
0x1A |
|
0x0B |
i |
0x1B |
|
0x0C |
5 |
0x1C |
|
0x0D |
c |
0x1D |
|
0x0E |
0x08 |
0x1E |
|
0x0F |
0x0B |
0x1F |
Finally, let’s say that the roots are at 0x00
and 0x08
.
What happens when we collect garbage?
First, we need to copy over our two roots and leave forwarding pointers, represented by f
, in their places (Step 1 Copy program roots from above).
When we copy objects over from the From Space to the To Space, we place them at the first available free address in the To Space.
Therefore, we need some pointer pointing to the first free address in the To Space.
We’ll call it our free
pointer, since it points to the first free space in the To Space.
When we start collecting garbage, free
points to the start of the To Space—address 0x10
in this example—and we’ll increment it after we copy over each object in the From Space.
After copying over the roots, we need to copy the remaining live memory by iterating through everything already in the To Space (Step 2 Copy remaining live memory).
To do so, we need a second pointer, the scan
pointer (since it scans the current values in the To Space).
Initially, scan
also needs to point to the start of the To Space because we haven’t iterated through anything yet.
In our example, we’ll bold changes in memory after each step.
free = 0x10
, scan = 0x10
Address | Value | Address | Value |
---|---|---|---|
0x00 |
i |
0x10 <- free ,scan |
|
0x01 |
1 |
0x11 |
|
0x02 |
i |
0x12 |
|
0x03 |
2 |
0x13 |
|
0x04 |
i |
0x14 |
|
0x05 |
3 |
0x15 |
|
0x06 |
i |
0x16 |
|
0x07 |
4 |
0x17 |
|
0x08 |
c |
0x18 |
|
0x09 |
0x00 |
0x19 |
|
0x0A |
0x04 |
0x1A |
|
0x0B |
i |
0x1B |
|
0x0C |
5 |
0x1C |
|
0x0D |
c |
0x1D |
|
0x0E |
0x08 |
0x1E |
|
0x0F |
0x0B |
0x1F |
Now that we know where to start copying objects over, let’s copy our roots!
Our first root, at 0x00
, is an integer, so it takes up two addresses.
We’ll copy it over to the first two address starting at our free
pointer: 0x10
and 0x11
.
Then, we need to leave a forwarding pointer at its old location, 0x00
, pointing to its new address, 0x10
—remember that these forwarding pointers will help us find the new locations of objects already copied to the To Space if anything else references them.
Now, we increment our free
pointer by two since we took up two addresses.
Notice that we don’t increment our scan
pointer yet because we haven’t iterated through anything in the To Space yet.
free = 0x12
, scan = 0x10
Address | Value | Address | Value |
---|---|---|---|
0x00 |
f |
0x10 <-scan |
i |
0x01 |
0x10 |
0x11 |
1 |
0x02 |
i |
0x12 <-free |
|
0x03 |
2 |
0x13 |
|
0x04 |
i |
0x14 |
|
0x05 |
3 |
0x15 |
|
0x06 |
i |
0x16 |
|
0x07 |
4 |
0x17 |
|
0x08 |
c |
0x18 |
|
0x09 |
0x00 |
0x19 |
|
0x0A |
0x04 |
0x1A |
|
0x0B |
i |
0x1B |
|
0x0C |
5 |
0x1C |
|
0x0D |
c |
0x1D |
|
0x0E |
0x08 |
0x1E |
|
0x0F |
0x0B |
0x1F |
Next, we do this same thing for our second root.
We copy it over to first free address in the To Space, leave a forwarding pointer behind, and increment our free
pointer again.
free = 0x15
, scan = 0x10
Address | Value | Address | Value |
---|---|---|---|
0x00 |
f |
0x10 <-scan |
i |
0x01 |
0x10 |
0x11 |
1 |
0x02 |
i |
0x12 |
c |
0x03 |
2 |
0x13 |
0x00 |
0x04 |
i |
0x14 |
0x04 |
0x05 |
3 |
0x15 <-free |
|
0x06 |
i |
0x16 |
|
0x07 |
4 |
0x17 |
|
0x08 |
f |
0x18 |
|
0x09 |
0x12 |
0x19 |
|
0x0A |
0x04 |
0x1A |
|
0x0B |
i |
0x1B |
|
0x0C |
5 |
0x1C |
|
0x0D |
c |
0x1D |
|
0x0E |
0x08 |
0x1E |
|
0x0F |
0x0B |
0x1F |
We’ve copied our two roots over to the To space, so what’s next?
The objects we just copied over might reference other objects.
If they do that, then those objects still need to be alive too.
Otherwise, we would throw away memory that might be referenced in the future, which we don’t want to do.
That’s a great way to cause a lot of seg faults.
Now, we need to iterate through the objects currently in the To Space and copy over any objects that they reference (Step 2. Copy remaining live memory); this is the point (get it?) of the scan
pointer.
Right now, scan
points to address 0x10
.
At 0x10
, there’s just have an integer.
Integers don’t reference anything, so we can move on to the next object in the To Space.
Let’s increment scan
just like we increment free
, bumping it up by two since integers take up two addresses in memory.
free= 0x15
, scan = 0x12
Address | Value | Address | Value |
---|---|---|---|
0x00 |
f |
0x10 |
i |
0x01 |
0x10 |
0x11 |
1 |
0x02 |
i |
0x12 <-scan |
c |
0x03 |
2 |
0x13 |
0x00 |
0x04 |
i |
0x14 |
0x04 |
0x05 |
3 |
0x15 <-free |
|
0x06 |
i |
0x16 |
|
0x07 |
4 |
0x17 |
|
0x08 |
f |
0x18 |
|
0x09 |
0x12 |
0x19 |
|
0x0A |
0x04 |
0x1A |
|
0x0B |
i |
0x1B |
|
0x0C |
5 |
0x1C |
|
0x0D |
c |
0x1D |
|
0x0E |
0x08 |
0x1E |
|
0x0F |
0x0B |
0x1F |
The data at address 0x12
is a pair (a cons), meaning it does reference other memory!
We need to copy what it references over to the To Space as well.
Since we have a pair at 0x12
, it references two addresses in memory.
The first location referenced by this pair is 0x00
, so let’s go there first.
At 0x00
, we have a forwarding pointer to 0x10
.
This makes our life easy, we just need to update the address that 0x13
points to to the new location: 0x10
.
Now, instead of pointing to 0x00
, the first pointer in our pair points to 0x10
:
free = 0x15
, scan = 0x12
Address | Value | Address | Value |
---|---|---|---|
0x00 |
f |
0x10 |
i |
0x01 |
0x10 |
0x11 |
1 |
0x02 |
i |
0x12 <-scan |
c |
0x03 |
2 |
0x13 |
0x10 |
0x04 |
i |
0x14 |
0x04 |
0x05 |
3 |
0x15 <-free |
|
0x06 |
i |
0x16 |
|
0x07 |
4 |
0x17 |
|
0x08 |
f |
0x18 |
|
0x09 |
0x12 |
0x19 |
|
0x0A |
0x04 |
0x1A |
|
0x0B |
i |
0x1B |
|
0x0C |
5 |
0x1C |
|
0x0D |
c |
0x1D |
|
0x0E |
0x08 |
0x1E |
|
0x0F |
0x0B |
0x1F |
The cons
at 0x12
also references 0x04
, so we need to copy that too.
We don’t have a forwarding pointer here, so we move the object at 0x04
to the next available space in the To Space, the address of which is stored in free
.
Then, we leave a forwarding pointer in 0x04
pointing to the new address in the To Space.
Finally, we increment both free
and scan
.
free = 0x17
, scan = 0x15
Address | Value | Address | Value |
---|---|---|---|
0x00 |
f |
0x10 |
i |
0x01 |
0x10 |
0x11 |
1 |
0x02 |
i |
0x12 |
c |
0x03 |
2 |
0x13 |
0x10 |
0x04 |
f |
0x14 |
0x15 |
0x05 |
0x15 |
0x15 <-scan |
i |
0x06 |
i |
0x16 |
3 |
0x07 |
4 |
0x17 <-free |
|
0x08 |
f |
0x18 |
|
0x09 |
0x12 |
0x19 |
|
0x0A |
0x04 |
0x1A |
|
0x0B |
i |
0x1B |
|
0x0C |
5 |
0x1C |
|
0x0D |
c |
0x1D |
|
0x0E |
0x08 |
0x1E |
|
0x0F |
0x0B |
0x1F |
Now, scan
still points to an object in the To Space.
So, we repeat the process.
Since the object at 0x15
doesn’t reference anything, we have nothing new from the From Space to copy.
We just update scan
.
free = 0x17
, scan = 0x17
Address | Value | Address | Value |
---|---|---|---|
0x00 |
f |
0x10 |
i |
0x01 |
0x10 |
0x11 |
1 |
0x02 |
i |
0x12 |
c |
0x03 |
2 |
0x13 |
0x10 |
0x04 |
f |
0x14 |
0x15 |
0x05 |
0x15 |
0x15 |
i |
0x06 |
i |
0x16 |
3 |
0x07 |
4 |
0x17 <-free ,scan |
|
0x08 |
f |
0x18 |
|
0x09 |
0x12 |
0x19 |
|
0x0A |
0x04 |
0x1A |
|
0x0B |
i |
0x1B |
|
0x0C |
5 |
0x1C |
|
0x0D |
c |
0x1D |
|
0x0E |
0x08 |
0x1E |
|
0x0F |
0x0B |
0x1F |
At this point, scan == free
.
This means that we’re done copying!
Because scan == free
, we know there are no more objects left to scan over in the To Space.
So we’ve collected all the live data and just need to empty out the old From Space (Step 3. Empty the From Space).
Address | Value | Address | Value |
---|---|---|---|
0x00 |
0x10 |
i |
|
0x01 |
0x11 |
1 |
|
0x02 |
0x12 |
c |
|
0x03 |
0x13 |
0x10 |
|
0x04 |
0x14 |
0x15 |
|
0x05 |
0x15 |
i |
|
0x06 |
0x16 |
3 |
|
0x07 |
0x17 |
||
0x08 |
0x18 |
||
0x09 |
0x19 |
||
0x0A |
0x1A |
||
0x0B |
0x1B |
||
0x0C |
0x1C |
||
0x0D |
0x1D |
||
0x0E |
0x1E |
||
0x0F |
0x1F |
Finally, we swap the start of the From and To Spaces, leaving the live memory as the start of our new From Space and leaving our new To Space empty for the next time we need to run the garbage collector (Step 4. Swap the From and To Spaces). Two-space copying garbage collection is as simple as that!
Tl;dr
Two-Space Garbage Collection is a garbage collection strategy that involves copying live from one half of the heap to the other, discarding all dead memory left behind in the first half. The strategy has some valuable benefits, such as no memory fragmentation, collection time being proportional to the amount of live data, allocation only requiring incrementing a pointer, and collection not requiring much state. However, there are also some important downsides to consider, including only half of the heap being able to be used for storing objects since the other half is needed for collection, having terrible memory locality, and requiring the entire program to stop executing to collect garbage.
References and Further Reading
- Dimoulas, C., Findler, R., St. Amour, V., Northwestern University COMP_SCI 321 Programming Languages Lecture Material: Garbage Collection Basics, Mark and Sweep Garbage Collection, Copying Garbage Collection
- Drakos, N., Moore, R., CSE 5317/4305: Design and Construction of Compilers Section 13.2.2 Copying Garbage Collection
- Wilson, Paul R., Uniprocessor Garbage Collection Techniques