PPC Compiler for the 'A' LanguageLexical AnalyzerThe lexical analyzer processes the code (syntax) into its tokens shown below:
Given a file with... function add (x : integer; y : integer) return integer is
...the lexical output would be: function <KEYWORD> add <ID> ( <SYMBOL> x <ID> : <SYMBOL> integer <ID> ; <SYMBOL> y <ID> : <SYMBOL> integer <ID> ) <SYMBOL> return <KEYWORD> integer <ID> is <KEYWORD> begin <KEYWORD> return <KEYWORD> x <ID> + <OPERATION> y <ID> ; <SYMBOL> end <KEYWORD> ; <SYMBOL>
Parser and CompilerGiven a program converted into lexical tokens, the parser and compiler produces executable PPC assembly. There are several stages, which are briefly outlined below:
Building an Abstract Syntax Tree (AST)The compiler uses a recursive decent technique to parse the program. The grammar for the language is defined by a series of productions, some of which are shown below. The full grammar can be seen here.
Using the grammar’s rules, a tree representing the structure of the program is constructed. Below is the AST output for the add function: 0> program
1> function_declaration: add
2> defining_identifier: add
2> parameter_list
3> parameter_specification: x
4> defining_identifier: x
4> type_name: integer
3> parameter_specification: y
4> defining_identifier: y
4> type_name: integer
2> type_name: integer
2> declarations
2> sequence_of_statements
3> statement
4> simple_statement
5> return_statement of type integer
Names TableA names table has to be built to keep track of the variables and functions defined in the program. It also has to keep track of nested scopes, so that if many variables exist with the same name, the variable with the closest scope (least depth) is used (aka, static scoping). If the variable exists in a scope that is not accessible by the current scope, it halts compilation and reports the error (basically, private variables are kept private). Another complexity is that the language allows for the creation of new types, for instance: type OneDList is array of integer;
type TwoDList is array of OneDList;
type ThreeDList is array of TwoDList;
The names table resolves these user-defined types to the language’s primitive types (integer, float, boolean): OneDList of kind integer[ ]
TwoDList of kind integer[ ][ ]
ThreeDList of kind integer[ ][ ][ ]
The names table for the add function appears below: - Function “add”
VARIABLES:
- Parameter #0: 'x' of kind 'parameter_specification' returns integer
- Parameter #1: 'y' of kind 'parameter_specification' returns integer
SUBSCOPES:
[none]
Type CheckingIt is also important to check that the types of variables and literal values are of the correct type. For instance, you can’t add a float and a boolean (illegal: “4.2 + false”), or have an if statement that evaluates a condition that does not produce a boolean value. (illegal: “if 5+2 then”, must be “if 5+2=7 then”). Expressions can be quite complex, including nesting of other expressions using parenthesis, function calls, and variables. ( 4 + ( 2 * 6 ) - add(1, 5) )
One must traverse down the tree to the bottom most elements, and propagate upwards. For example: ( 2 * 6 ) = ( integer * integer ) = integer
This is valid, and produces an integer result, so then we can assume that whole expression is just an integer. Now we have: ( 4 + ( 2 * 6 ) - add(1, 5) ) = ( integer + integer – add(1, 5) )
The return type of the add function can be determined with a simple query to the names table. In this case, it returns an integer value, and so we can resolve the type of the whole expression: ( integer + integer – integer ) = integer
Numeric values can be used in conditional statements, switching the type to boolean. ( 4 + ( 2 * 6 ) - add(1, 5) ) > 6
integer > integer = boolean
There are many rules on how types can be used together. For instance, “and” and “or” statements can only be used with booleans.
Determining Local Variable Stack OffsetsLocal variables are allocated space on the stack frame of the current function. On the PPC architecture, all primitive types are 4 bytes in size. This simplifies calculation of offsets. The names table keeps track of all local variables, and knows the total count and order of the variables for any given function. Using this information, the local variables are given static stack offsets. This is used during code generation. One additional point of interest is local arrays. These are handled by allocating a large block on the stack. Other variable’s offsets have to take this into account.
Code GenerationFinding good material covering PPC assembly was tough to find. I designed my own stack convention for function calls (allowing for recursion). Code is generated in one final pass over the fully checked and decorated AST. It mimics the recursive decent of the parsing stage. Below is the assembly produced from the add function. The compiler generates all the comments and formatting. //------------------------- add -------------------------
.data ; data section
.text ; code section
.globl _add
_add:
//FUNCTION INIT
mflr r0 //save caller address into r0
stw r1, -256(r1) //save old sp on top of stack
subi r1, r1, 256 //create frame, move sp to 'top' of stack (256 bytes)
stmw r14, 16(r1) //save registers r14-r32 onto stack (72 bytes)
stw r0, 8(r1) //save return address onto stack
//FUNCTION BODY
//RETURN STATEMENT
mr r14, r3 //moving parameter into temporary register
mr r15, r4 //moving parameter into temporary register
add r14, r14, r15 //add two registers
mr r3, r14 //move temporary value into return register (r3)
ba L_END0: //jump to function return
//FUNCTION END
L_END0: //end label
lwz r0, 8(r1) //get the caller's link address
mtlr r0 //move address into link register (r0)
lmw r14, 16(r1) //restore registers r14-r32
lwz r1,0(r1) //set the original stack pointer
blr //branch link register
Major Features
Error! Looking for ')' but found ',' at token 313
( base : integer , exponent : integer ) is
Error! Encountered malformed or unknown type 'floating' at token 6
add ( x : floating ; y : integer )
Error! Found unknown loop type 'foor' at token 111
begin result := 1 ; foor j in 1 ..
Error! Looking for 'return' but found 'is' at token 318
exponent : integer ) is temp : integer ; begin
type boolArray is array of boolean;
type OneDList is array of integer;
type TwoDList is array of OneDList;
boolArray of kind boolean
OneDList of kind integer[]
TwoDList of kind integer[][]
Other NotesI had an urge when I was first starting the project to do it entirely in C/C++ and not rely on any external programs or libraries. In fact, I haven’t used any of the STL besides string.h (for strcmp) and fstream.h for file IO. It would be fairly trivial to remove the very light object oriented code that I used. I’d like to think I could go back in time and armed with pascal, fortran or a similar language, and write a compiler. Strings and floats are disabled in the lexical analyzer. The code generator does not support elseif statements. One can have multiple if statements or nested if/else statements to achieve the same effect. Exit statements are also not supported.
Source Files
Output and Generated Files
Compiler Code
The lexical analyzer writes to STDOUT. To save this output, one must either copy and paste, or redirect output to a file. There must not be any trailing white space at the end of the file. The final character in the file should be a “>”, which terminates the final token.
A string in main.cpp specifies the input file. The default value is "program.lex". The AST and Names Table are written to STDOUT. The assembly file is written to a file called "asm.s".
Once you have generated the assembly file and a written a driver function in C, you can use gcc to compile and link them together like so: " gcc main.cpp asm.s".
| |
© Chris Harrison |