Chris
Harrison

PPC Compiler for the 'A' Language

Lexical Analyzer

The lexical analyzer processes the code (syntax) into its tokens shown below:

  • Keywords: const, case, catch, continue, do, else, elsif, end, then, extends, for, if, private, protected, return, struct, switch, try, while, package, function, procedure, exit, when, type, is, begin, “..”, integer, boolean, float
  • Symbols: . , : ; ( ) [ ] { }
  • Operations: + - * / = != < <= > >= ! & && | || := += **
  • Strings were parsed by looking for character sequences surrounded in quotes. The entire length was stored in a single string token.
  • Integers were identified and stored in the numeric form as integer tokens
  • Comments were recognized by character sequences surrounded in /* and */.
  • Function names, function calls, variable declarations, and variable uses were stored as identifier tokens.

Given a file with...

function add (x : integer; y : integer) return integer is
begin
return x + y;
end;

...the lexical output would be:

function <KEYWORD> add <ID> ( <SYMBOL> x <ID> : <SYMBOL> integer <ID> ; <SYMBOL> y <ID> : <SYMBOL> integer <ID> ) <SYMBOL> return <KEYWORD> integer <ID> is <KEYWORD> begin <KEYWORD> return <KEYWORD> x <ID> + <OPERATION> y <ID> ; <SYMBOL> end <KEYWORD> ; <SYMBOL>


Parser and Compiler

Given a program converted into lexical tokens, the parser and compiler produces executable PPC assembly. There are several stages, which are briefly outlined below:


Building an Abstract Syntax Tree (AST)

The compiler uses a recursive decent technique to parse the program. The grammar for the language is defined by a series of productions, some of which are shown below. The full grammar can be seen here.

assignment_statement ::= name := expression ";"

if_statement ::= IF condition THEN sequence_of_statements
[ELSE sequence_of_statements]
END;

function_declaration ::= FUNCTION defining_identifier
( parameter_list ) RETURN type_name IS
declarations
BEGIN sequence_of_statements END ";"

parameter_list ::= parameter_specification {; parameter_specification}

relational_operator ::= = | /= | < | > | <= | >=

relation ::= simple_expression [relational_operator simple_expression]

simple_expression ::= term {binary_adding_operator term}

Using the grammar’s rules, a tree representing the structure of the program is constructed. Below is the AST output for the add function:

0> program
1> function_declaration: add
2> defining_identifier: add
2> parameter_list
3> parameter_specification: x
4> defining_identifier: x
4> type_name: integer
3> parameter_specification: y
4> defining_identifier: y
4> type_name: integer
2> type_name: integer
2> declarations
2> sequence_of_statements
3> statement
4> simple_statement
5> return_statement of type integer
6> expression of type integer
7> relation of type integer
8> simple_expression: + of type integer
9> term of type integer
10> factor of type integer
11> primary: id of type integer
12> name: x of type integer
9> simple_expression of type integer
10> term of type integer
11> factor of type integer
12> primary: id of type integer
13> name: y of type integer


Names Table

A names table has to be built to keep track of the variables and functions defined in the program. It also has to keep track of nested scopes, so that if many variables exist with the same name, the variable with the closest scope (least depth) is used (aka, static scoping). If the variable exists in a scope that is not accessible by the current scope, it halts compilation and reports the error (basically, private variables are kept private). Another complexity is that the language allows for the creation of new types, for instance:

type OneDList is array of integer;
type TwoDList is array of OneDList;
type ThreeDList is array of TwoDList;

The names table resolves these user-defined types to the language’s primitive types (integer, float, boolean):

OneDList of kind integer[ ]
TwoDList of kind integer[ ][ ]
ThreeDList of kind integer[ ][ ][ ]

The names table for the add function appears below:

- Function “add”
VARIABLES:
- Parameter #0: 'x' of kind 'parameter_specification' returns integer
- Parameter #1: 'y' of kind 'parameter_specification' returns integer
SUBSCOPES:
[none]


Type Checking

It is also important to check that the types of variables and literal values are of the correct type. For instance, you can’t add a float and a boolean (illegal: “4.2 + false”), or have an if statement that evaluates a condition that does not produce a boolean value. (illegal: “if 5+2 then”, must be “if 5+2=7 then”).

Expressions can be quite complex, including nesting of other expressions using parenthesis, function calls, and variables.

( 4 + ( 2 * 6 ) - add(1, 5) )

One must traverse down the tree to the bottom most elements, and propagate upwards. For example:

( 2 * 6 ) = ( integer * integer ) = integer

This is valid, and produces an integer result, so then we can assume that whole expression is just an integer. Now we have:

( 4 + ( 2 * 6 ) - add(1, 5) ) = ( integer + integer – add(1, 5) )

The return type of the add function can be determined with a simple query to the names table. In this case, it returns an integer value, and so we can resolve the type of the whole expression:

( integer + integer – integer ) = integer

Numeric values can be used in conditional statements, switching the type to boolean.

( 4 + ( 2 * 6 ) - add(1, 5) ) > 6
integer > integer = boolean

There are many rules on how types can be used together. For instance, “and” and “or” statements can only be used with booleans.


Determining Local Variable Stack Offsets

Local variables are allocated space on the stack frame of the current function. On the PPC architecture, all primitive types are 4 bytes in size. This simplifies calculation of offsets. The names table keeps track of all local variables, and knows the total count and order of the variables for any given function. Using this information, the local variables are given static stack offsets. This is used during code generation.

One additional point of interest is local arrays. These are handled by allocating a large block on the stack. Other variable’s offsets have to take this into account.


Code Generation

Finding good material covering PPC assembly was tough to find. I designed my own stack convention for function calls (allowing for recursion). Code is generated in one final pass over the fully checked and decorated AST. It mimics the recursive decent of the parsing stage. Below is the assembly produced from the add function. The compiler generates all the comments and formatting.

//------------------------- add -------------------------
.data ; data section
.text ; code section
.globl _add
_add:
//FUNCTION INIT
mflr r0 //save caller address into r0
stw r1, -256(r1) //save old sp on top of stack
subi r1, r1, 256 //create frame, move sp to 'top' of stack (256 bytes)
stmw r14, 16(r1) //save registers r14-r32 onto stack (72 bytes)
stw r0, 8(r1) //save return address onto stack
//FUNCTION BODY
//RETURN STATEMENT
mr r14, r3 //moving parameter into temporary register
mr r15, r4 //moving parameter into temporary register
add r14, r14, r15 //add two registers
mr r3, r14 //move temporary value into return register (r3)
ba L_END0: //jump to function return
//FUNCTION END
L_END0: //end label
lwz r0, 8(r1) //get the caller's link address
mtlr r0 //move address into link register (r0)
lmw r14, 16(r1) //restore registers r14-r32
lwz r1,0(r1) //set the original stack pointer
blr //branch link register


Major Features

  • Parser shows errors in context:
Error! Looking for ')' but found ',' at token 313
( base : integer , exponent : integer ) is
Error! Encountered malformed or unknown type 'floating' at token 6
add ( x : floating ; y : integer )
Error! Found unknown loop type 'foor' at token 111
begin result := 1 ; foor j in 1 ..
Error! Looking for 'return' but found 'is' at token 318
exponent : integer ) is temp : integer ; begin
  • Parser can resolve types of unlimited type chaining. Types are resolved to their base type immediately. Although the original type is stored, returnType() determines the base type, either a primitive or an array of primitives. Example shown below:
type boolArray is array of boolean;
type OneDList is array of integer;
type TwoDList is array of OneDList;
boolArray of kind boolean
OneDList of kind integer[]
TwoDList of kind integer[][]
  • The parser supports scopes of unlimited depth. Variables with the same name in higher scopes are hidden and only references to “newer” variables are used. This feature was disabled for code generation because of the complexity to reference previous local variables on the stack.
  • Assembly produced by the code generator allows for function calls and recursion.
  • Single dimensional arrays are supported. These have a limited length due to the fact they are stored on the stack and frames are of fixed size.
  • Pointers are supported. Any integer value can be used as a memory location. Memory can be accessed like an array. ptr[0] is equivalent to *ptr in C; ptr[4] accesses memory location ptr+4. Since there is no memory management in our language, one must create (malloc) some memory in the C driver program, and pass a pointer to the function or procedure.
  • Type checking is done as the parse tree is built. If the types do not match, compilation halts and the error is noted.
  • Compiler generates working PPC assembly.
  • Compiler generated assembly is commented and neatly formatted.
  • AST is neatly formatted (with indentation) for easier reading.
  • Names table is neatly formatted (with indentation) for easier reading. All local variables and sub-scopes are printed (to unlimited depth)


Other Notes

I had an urge when I was first starting the project to do it entirely in C/C++ and not rely on any external programs or libraries. In fact, I haven’t used any of the STL besides string.h (for strcmp) and fstream.h for file IO. It would be fairly trivial to remove the very light object oriented code that I used. I’d like to think I could go back in time and armed with pascal, fortran or a similar language, and write a compiler.

Strings and floats are disabled in the lexical analyzer.

The code generator does not support elseif statements. One can have multiple if statements or nested if/else statements to achieve the same effect.

Exit statements are also not supported.


Source Files

  • Sample C Driver Program - main.c
  • 'A' Language Sample Functions (Actual Input for Lexical Analyzer) - program.a


Output and Generated Files

  • Output from C Driver Program and Sample Functions (Actual Output) - MainOutput.txt
  • Lexical Tokens of Sample Functions (Actual Output from Lexical Analyzer) - program.lex
  • AST of the Sample Functions (Actual Output from Parser) - ParseTree.txt
  • Names Table for the Sample Functions (Actual Output from Parser) - NamesTable.txt
  • Assembly Generated from Sample Functions (Actual Output from Compiler) - asm.s


Compiler Code

The lexical analyzer writes to STDOUT. To save this output, one must either copy and paste, or redirect output to a file. There must not be any trailing white space at the end of the file. The final character in the file should be a “>”, which terminates the final token.
A string in main.cpp specifies the input file. The default value is "program.lex". The AST and Names Table are written to STDOUT. The assembly file is written to a file called "asm.s".
Once you have generated the assembly file and a written a driver function in C, you can use gcc to compile and link them together like so: " gcc main.cpp asm.s".
© Chris Harrison