Authors: David Mulder, Curtis Welborn


This paper describes the implementation of a garbage collector as an undergraduate research project. The garbage collector is a continuation of a project where an Assembler, Virtual Machine and Compiler were implemented as a capstone project. The project required modifying the compiler to allocate memory compatible with a mark and sweep algorithm, and adding the mark and sweep algorithm to the virtual machine. Computer Science faculty should take note that this was all completed by an undergraduate student within a year’s time, and that such challenges can reasonably be accomplished by undergraduate seniors.


Challenges of the nature described in this paper are generally reserved for graduate students, and are considered outside the grasp of undergraduates. Computer Science faculty and students alike will be interested to note that the preceding project was completed entirely by an undergraduate senior. It is also interesting to note that the project was completed by an average student of little academic achievement, and that a Compiler, Assembler and Multi-Processed Virtual Machine with Garbage Collection were written within a year’s time.

As part of an undergraduate capstone, an Assembler, Virtual Machine and Compiler were implemented. The two-pass Compiler performs Lexical and Syntax analysis in the first pass, and Semantic analysis and Intermediate code generation in the second pass. Using Intermediate code as input, the Compiler then generates target code (assembly) that can be fed to the Assembler to generate byte-code. The byte-code can then be loaded into the Virtual Machine for execution [1]. Through independent study, a garbage collector was added to the virtual machine, which included extensive changes to the compiler.

The compiler had to be modified to place by-reference parameters onto the run-time stack first. An additional parameter is also added that counts the number of by-reference parameters. The counter is needed by the Mark and Sweep algorithm to determine the location of the by-reference parameters within an Activation Record.

To support the mark and sweep garbage collection algorithm, the virtual machine had to be modified to allocate memory from a list of available blocks. The mark and sweep algorithm is triggered when a call to allocate memory detects no free blocks. In the mark phase of the algorithm, the run-time stack is traversed searching for points that reference blocks allocated on the heap. Any block of memory reachable from the run-time stack must be marked as in use. During the sweep phase of the algorithm, the heap is searched looking for any blocks not marked as in use during the mark phase.
Screenshot from 2015-10-09 13:18:55


The compiler was originally written to pass parameters to a function call via an activation record in the order they were written in the source language (a language called KXI, which is similar to java). It was necessary for the compiler to segregate between by-reference and by-value parameters. The mark phase of the algorithm must know the location of every by-reference parameter within a function calls’ activation record. See the ordering of variables in the current Activation Record of Figure 1. Because the variables are segregated, references are easily located based on offsets. For example, the first reference is located by an offset of 4*sizeof(int) from the base of the activation record (the return address, previous activation record pointer and reference counters are all integers), and the second reference is offset 5*sizeof(int) from the base.

The mark phase of the algorithm needed a special field inserted in each activation record that counts the number of by-reference parameters. This field was added in the activation record in such a way that the mark phase could iterate over each by-reference parameter and set the mark bit in the corresponding block for each object allocated on the heap. The field is referred to as the Reference Counter in Figure 1. The reference counter in this activation record tells us that there are 2 references. If we decremented the reference counter to 1, for example, the second reference on the stack would be interpreted as an integer.


In order to make space allocation more efficient, the heap was separated into large and small blocks. Bit vectors were used to keep track of which blocks are allocated on the heap.

An instruction was added to the virtual machine for allocating memory. Prior to the instruction, memory allocation consisted of just moving a pointer within the heap. The new allocation instruction calls the mark and sweep algorithm. If there is space available on the heap, the instruction returns a pointer to that block and marks the block as unavailable within one of the vectors that tracks all blocks in the heap.

If no available space is found, garbage collection is started. The algorithm must calculate the location of the activation record for the current function call then calculates the offset to the by-reference counter field in the activation record. Now having the number of pass by-references parameters in the current activation record, the algorithm can calculate the offset to each allocated block on the heap and set its mark bit to in use.

The algorithm must recursively follow references allocated inside of each block. This was achieved by organizing objects within a block in a similar manner to how function calls are organized in an activation record. In this way, lists, trees and even recursive data structures allocated on the heap can all be traversed and marked. In Figure 1, the mark phase will follow all references inside of Small Block A, so Small Block B is also marked.

Once all objects (blocks) and descendant objects have been marked, the algorithm calculates the location of the previous activation record. The same process of following references and marking objects proceeds for each activation record until all function calls on the run-time stack have been searched.

The sweep phase of the algorithm now iterates over the entire heap, checking the mark bit. If the mark bit is set (the block is in use), it clears the bit and moves on. This means that the block has a live reference to it which originated in an activation record that is currently on the run-time stack. Blocks A, B and D in Figure 1 have references pointing to them. These blocks will be un-marked and will remain allocated.

If the mark bit is not set, the algorithm clears the relevant marker in the vector of available blocks, freeing the block for reuse (See Small Block C in Figure 1) [2]. This allocated block will be freed during the sweep phase because it has no references pointing to it on the run-time stack.


This method of garbage collection is effective, but did prove to be inefficient. Because the algorithm isn’t activated until there is no remaining free space, the algorithm isn’t suitable for real time applications. The algorithm could be modified to operate incrementally, but it would still cause an occasional lag. These are known problems of garbage collection.

The greatest achievement of this project was the recognition of what can be accomplished by undergraduate seniors. Computer Science faculty should recognize that computing projects of this caliber can reasonably be expected in an undergraduate senior project course.


The primary author has started work on an hp-ux pa-risc virtual machine. The first obstacle will be setting up the correct environment, based on descriptions in the pa-risc specifications. The second will be translating the pseudo code in the specification into C. A compiler/translator will be written to automatically convert the specification pseudo code into C.

A significant part of the project will be writing the loader, since it will need to include things such as run time linking. Ideally, the virtual machine will run at boot time and will include a boot loader. This would allow pa-risc software to run on virtually any hardware.


[1] Mulder, D., Welborn, C., Lessons in converting from python to c++, Journal of Computing Sciences in Colleges, 29(2), 49-57, December 2013, http://dl.acm.org/citation.cfm?id=253542
[2] Jones, R., Lins, R. D., Garbage collection: Algorithms for automatic dynamic memory management, West Sussex, England

© CCSC, (2014). This is the author’s version of the work. It is posted here by permission of CCSC for your personal use. Not for redistribution. The definitive version was published in The Journal of Computing Sciences in Colleges, {volume 30, number 2, December 2014}, http://dl.acm.org/.


Authors: David Mulder; Curtis Welborn


Compilers are a core technology within Computer Science, whether they create a native executable file that runs directly on the hardware or a byte-code file that runs in a Virtual Machine. This paper begins by outlining the primary author’s capstone where he implemented an Assembler, Virtual Machine and Compiler all from scratch. No existing code-base or any special purpose libraries could be used in the implementation. The major components for these programs will be discussed and the base approach taken to creating them discussed. Following this overview, a more detailed description of lessons learned by the author while converting part of his capstone project from Python to C++ will be given. The project used in this paper provided for a rather unique perspective on the issue of converting a program from one language to another as the example program was the non-trivial Virtual Machine. The paper will conclude with a brief section describing future work where a Garbage Collector will be implemented.


As part of a capstone project, students in the Computer Science Department are required to implement an Assembler, Virtual Machine (VM) and a Compiler. The Two-pass Assembler reads an assembly source file (.asm ) containing assembly code patterned on the MIPS architecture. During the first pass of the assembler, the syntax of the assembly code is verified and the addresses of labels are computed and stored in a dictionary for later reference. If any syntax errors are detected, an error message is produced and the assembler stops; otherwise, the assembler continues with pass two. In pass two, some semantic checks are performed, such as verifying that all referenced labels are unique and are defined. If the assembly code passes the semantic checks, numeric byte-code is produced.

Because the Two-pass Assembler and the Virtual Machine are constructed as a single project, byte-code is not written to a file; rather, it is loaded directly into the memory of the VM at the time it is produced during pass two of the assembler, as seen in Figure 1. This eliminates the need for students to write a separate Loader program which would have to read the byte-code from the file and load it into the memory of the VM.
Once the source code for an assembly program has been converted into byte-code and loaded into the VM’s memory, the VM begins by executing the first line of the assembly program. A VM is composed of two major components: memory and a byte-code execution engine. The byte-code execution engine is implemented as a loop with a big switch statement. During each pass through the loop, the VM picks the byte-code instruction pointed to by the Program Counter (PC) to execute. A separate case clause is implemented for each byte-code instruction within the switch statement. Details for each instruction, such as number of arguments and addressing modes used, are implemented in the case clause for each instruction. Memory of the VM is implemented as a large byte array which is used to store both byte-code and data. Memory of the VM can be logically segregated into three sections.
The first section which is stored in low memory (addresses 0 … N) contains byte-code, global variables and constants. The second section, which starts at the max address of memory and works backward (address max … L), contains the run-time stack where activation records for function calls are stored. Activation records and the run-time stack where they are stored are what make it possible to execute programs with recursive function calls. The third section of memory located between the first and second sections (address N+1 … L-1) is the heap. Programs that dynamically allocate space at run-time via commands such as ‘new’ or ‘malloc’ are allocating space on the heap.
Once students have completed writing the Two-pass Assembler, the Virtual Machine and various assembly programs, they implement a Two-pass Compiler for an Object-based language created just for the capstone class known as kxi. Kxi source code is a subset of both Java and C++. As kxi is Object-based, classes can be defined and instances of a class can be created at run-time, but there is no inheritance supported. The kxi Compiler is implemented using the standard phases of lexical analysis, syntax analysis, semantic analysis, intermediate code generation and target code generation [1]. No optimizations of intermediate or target code are implemented as part of the capstone.
Lexical analysis reads tokens from the source file and passes them to syntax analysis. Syntax analysis is implemented using a recursive descent parser which loads a symbol table that contains definitions for all the classes, functions and variables defined in a kxi source program, as well as any temporary variables that may be created during the compilation process. Semantic analysis, which runs during pass two of the compiler, relies upon a combination of the recursive descent parser, the symbol table, a semantic action stack and the operator stack as defined in the shunting-yard algorithm. This combination allows the shunting-yard algorithm, which converts infix expressions to post-fix expressions, to be applied at the same time that semantic checks are being performed. Using this combination of structures and algorithms makes it possible to implement a compiler without creating and traversing a syntax tree. This has proven to be significantly easier for students to implement in a single semester. The details of this process are beyond the scope of this paper; however, a tutorial has been presented [2] outlining this process.
Python was used by the primary author to implement all of these capstone projects over two semesters. After completing the capstone courses, the author wanted to complete a research class where he would add a Mark-and-Sweep garbage collector to the Virtual Machine to reclaim objects allocated on the heap. Currently there is no way to reclaim space once allocated in kxi. During the initial phase of this research project, the VM was converted from Python to C++. Additionally, the kxi Compiler was converted to produce byte-code directly to a file and a Loader was implemented to read the numeric byte-code from the file so it could be executed by the new VM written in C++. The following sections outline lessons learned and observations made by the author during this process.


For students completing the capstone, one of the first issues they must address is how to implement memory in the Virtual Machine and create the code which can read/write to any address as either a byte or integer (multiple bytes). Once a student truly understands how to deal with the memory of the VM, most of the remaining code for the VM is rather straightforward.
The Python implementation of the virtual machine memory used a bytearray object. The bytearray could either be declared with an initial size in bytes, or bytes could be appended individually. Appending to the bytearray is advantageous during compilation, since the size of memory needed to store all the byte-code and data cannot be determined until after the assembler has run.
Inserting byte-code into main memory (the bytearray object) involves casting an 8-bit binary sequence to an integer before making the assignment. Python did not require that the binary sequence be cast as a char, since Python integers are an arbitrary length. To make a 32-bit integer assignment, the bits are split into 8-bit segments, cast to integers, and assigned to the bytearray in sequence. This provides control over the endianness of the integers. Endianness refers to the order in which the bytes are stored in memory. If integers were stored in an integer array, then the endianness of our integers would be dependent on the architecture of the host machine.

data = “{:032b}”.format(x & 0xffffffff)
memory[i] = int(data[:8], 2)
memory[i+1] = int(data[8:16], 2)
memory[i+2] = int(data[16:24], 2)
memory[i+3] = int(data[24:], 2)
Example 1: Python memory assignment

Retrieving a 32-bit integer from the bytearray required some careful observance of the sign bit in the left most byte (using big endian notation). A 32-bit array could be constructed from a 4-byte subsection of main memory using Python list comprehension and formatting. After checking the sign bit, the bit array can be cast to an integer, and then converted from 2’s complement if the sign bit was negative. Note that inserting the 32-bit integer into main memory did not require any bit manipulation because Python already uses 2’s compliment notation, but when retrieving the bit array, special care had to be taken because Python integers are arbitrary in length. Technically, a 2’s complement integer in Python requires an infinite sequence of negation bits.

binary = ”.join([“{:08b}”.format(byte) for byte in memory[y:y+4]])
if binary[0] == ‘1’: # Convert from 2’s Complement
binary = ”.join([str(int(not int(val))) for val in binary])
return -(int(binary, 2))-1
return int(binary, 2)
Example 2: Python memory retrieval

In C++, the Virtual Machine memory was implemented using an unsigned char array. The size of the array is declared statically. Inserting into memory requires bit shifting a 32-bit integer 24, 16, 8 or 0 bits, respectively, and making the assignment in the array (essentially casting each set of shifted bits to an unsigned char).

int shift = 24;
for (int i = 0; i > shift;
shift -= 8; }
Example 3: C++ memory assignment

Similarly, a 32-bit integer is retrieved by bit shifting each unsigned char and adding the resulting integers together. There is no need to check the sign bit as long as the result is cast to a 32-bit integer.

return (memory[index] << 24) + (memory[index+1] << 16) +
(memory[index+2] << 8) + memory[index+3];
Example 4: C++ memory retrieval

Accessing individual bytes in memory was similar in both Python and C++, as it only required an assignment statement.


Once single byte and multiple-byte information can be read/written to any address in memory, the byte-code execution engine must be implemented. Python does not support a switch statement, so a sequence of cascading if statements were written. During each iteration through the Virtual Machine loop, the execution engine loads the opcode, a single byte instruction, pointed to by the Program Counter, then looks for the correct opcode instruction to execute. Each instruction includes two operands. The first operand is always a register and is stored in a single byte, which sets the maximum amount of registers to 255. The second operand is 4 bytes in length and can be either a register or an address. The lone exception to this rule is the jump command, which places its destination address in the second operand. Retrieving the opcode and operands is done in the same manner as retrieving an integer or byte retrieval, as described in the previous section. Once the opcode and its associated operands are retrieved, the byte-code instruction can be executed.
Rather than encapsulate the machine state, the components of the Python Virtual Machine are all global. The functions for setting and retrieving values from memory are also global. This convention was chosen out of convenience, rather than for design purposes.

registers[command_operand1()] = registers[command_operand1()] *
Example 5: Python MUL instruction execution

In the C++ implementation of the VM loop, the machine state is encapsulated in a singleton class. Machine components and methods for setting and retrieving memory values are referenced in the class. Converting the instruction loop from Python simply involved inserting object references and making syntax adjustments, such as enclosing blocks in braces. For consistency, method and variable names were not changed.

machine.registers[machine.command_operand1()] = machine.registers[machine.command_operand1()] * machine.registers[machine.command_operand2()];
Example 6: C++ MUL inst ruction execution

The assembly trap instruction required some special adjustments for reading and writing data to and from standard I/O. In Python, the sys module is used to read from stdin and write to stdout. The method sys.stdin.read(1) is used to read a single character from the input buffer. Similarly, the C module stdio.h contains a buffered input method getchar(), which (when cast to an unsigned char) reads a single character from stdin.


The original capstone project did not require a loader. So, the Python implementation which was an Assembler/VM combination program began execution by reading assembly instructions from a file then writing the byte-code directly to the VM’s memory for execution. In the C++ implementation, the compilation from assembly to byte-code is extracted from the VM, and added into the kxi Compiler. The Compiler outputs byte-code directly to a file in the final stage of compilation. Because the data segment of the byte-code (constants and global variables) comes before the first instruction in the file, a single 4-byte integer is prepended to the file indicating the starting address of the first byte-code instruction.
Loading requires opening the byte-code file for reading, setting the program counter based on the first 4 bytes of the file, then reading the remaining data byte for byte into main memory (the unsigned char array). A counter is kept during the file read to count the number of bytes in the file, which determines the total size of the data segment plus the byte-code segment. This counter becomes the stack limit, which is the start of the heap. The stack base is set to the length of the main memory array. The runtime stack extends from the stack base up to the stack limit. The stack limit will vary at runtime based on allocation on the heap.


The VM can be executed in a debug mode, which outputs each instruction as it is run. It's useful to have the contents of registers output to a file, and some sort of break point instruction is also necessary. This process makes debugging less tedious.
Because the Python VM begins it's execution by compiling from assembly to byte-code, it has access to the symbol table at run time. This proved advantageous during debugging because branches and methods could be labeled within a back trace. This also provided access to line numbers within the source assembly file, so that error messages could specifically indicate the location of a failure.
The C++ implementation of the VM executes directly from byte-code, which means extra symbol loading would need to be added to the byte-code loader in order to have labels and line numbers in error messages. Line numbers can be roughly estimated, based on the distance of the current instruction from the START label, but this is useless when executing instructions outside the main method (methods are on negative line numbers if they are above the START label). This could be remedied by including debug symbols, during the load phase, which indicate where the code segment begins. Unfortunately, line numbers could only be calculated based on the distance between instructions, which ignores white space and comments. Removing the white space and comment lines from the assembly would also complicate debugging.


The amount of code to implement memory management for the VM was similar between Python and C++; however, the Python code was considerably more complex. Whereas in C++ you would simply shift and input bits in their proper order, Python required complicated array manipulations and careful observance of the rules of 2's complement negation. Python's representation of bit arrays as characters may have been convenient, but certainly effected execution time.
Syntactically, converting the VM’s execution engine from Python code to C++ was relatively simple. The greatest difference between the two languages was Python's indentation blocks. Adding brackets around each block and semi-colons after each statement was simple enough. Creating the singleton object which held the machine state took very little time, and would not have required many syntax changes if it had been implemented originally in the Python code.
The most significant difference between the Python and C++ implementations of the Virtual Machine was performance. After accounting for the extra compile/load time in the Python Virtual Machine (time was measured directly within the start of the execution engine), the Python implementation was much slower. The performance was tested using recursive Fibonacci calls. Both virtual machines were executed using the same byte-code, with the same number of available registers and stack space. To calculate the 27th number in the Fibonacci sequence (196,418), it took the C++ implementation 1.8 seconds to return the answer. In Python it took nearly 13 minutes to return the answer. The execution time increases exponentially in both implementations, but begins much sooner in Python. It should be pointed out again that the byte-code execution engines are nearly identical between the Python and C++ implementations. The main bottleneck appears to be in the setting and retrieving of integers in Python. Another concern is that Python is already executing against it's own virtual machine, so kxi is adding an additional layer of abstraction between the source and machine code.
Interestingly, each preceding Fibonacci number takes approximately 1.62 times longer than the previous calculation time in both Python and C++. Based on that average, it would take the C++ Virtual Machine implementation 1.4 days to calculate the 50th Fibonacci number. It would take the Python implementation 1.6 years.
Although these results are interesting, the disadvantage to using Fibonacci is that the algorithm is inherently exponential. Therefore, we can't draw any concrete determination of performance from the results. A simple time wasting method would better measure the performance differences between the implementations.

public int traverse(int n) {
if (n == 0) return;
else return traverse(n – 1);
Example 7: Algorithm for testing performance

This method pushes n activation records to the run-time stack, then returns. Over a large data set, the results are very similar. The C++ implementation returns so quickly, that the algorithm causes a stack overflow at 56/100 of a second (187 thousand activation records). The Python implementation exceeds 1 second after a little over 1 thousand activation records are pushed onto the stack.
This project demonstrates an obvious need to pick the right tool for the job. In this case, Python was not the right tool for implementing a Virtual Machine. Any language that executes against it's own virtual machine is probably insufficient for this project. Python, in particular, did not allow enough low-level manipulation to properly manage memory access. A compiled language is much more suitable for this task. The low level access that languages such as C and C++ provide for memory management makes them better suited for this work.
In contrast, although it is not discussed in this paper, Python proved to be a very good choice for implementing a Compiler. Python’s flexibility with string manipulation and easy data passing proved very appropriate for interpreting and compiling the kxi language to byte-code.


While the initial goal of this project was to dramatically increase the performance of the Virtual Machine, the ultimate goal is to implement a Mark-and-Sweep Garbage Collection algorithm. To complete the Garbage Collector, activation records and object instances need to be modified to store all references (pointers) in a known location. The Garbage Collection algorithm will have to traverse the run-time stack searching for each reference to an object instance on the heap so that the memory can be marked as currently in use. In the sweep phase of the algorithm, all the memory allocated to the heap will be traversed to search for unused blocks of memory that can be added back to a list of available memory.


[1] Aho, Alfred, Lam, Monica, Sethi, Ravi, Ullman, Jeffrey, Compilers Principles, Techniques, & Tools, 2nd, Pearson, Addison Wesley, 2007.
[2] Welborn, R., Curtis, Making your Undergraduate Compiler Course: Interesting, Relevant and Passable Tutorial, CCSC Rocky Mountain Region, Ft. Collins, Co., 2010.

© CCSC, (2013). This is the author's version of the work. It is posted here by permission of CCSC for your personal use. Not for redistribution. The definitive version was published in The Journal of Computing Sciences in Colleges, {volume 29, number 2, December 2013}, http://dl.acm.org/.