|
Lexical
Analyzer
The lexical analyzer processes the code
(syntax) into its tokens shown below:
• Keywords:
const,
case, catch, continue, do, else, elsif, end, then, extends,
for, if, private, protected, return, struct, switch,
try, while, package, function, procedure, exit, when,
type, is, begin, “..”, integer, boolean,
float |
• Symbols:
• Operations:
+
- * / = != < <= > >= ! & &&
| || := += ** |
• Strings were parsed by looking
for character sequences surrounded in quotes. The entire length
was stored in a single string token.
• Integers were identified and
stored in the numeric form as integer tokens
• Comments were recognized by
character sequences surrounded in /* and */.
• Function names, function calls,
variable declarations, and variable uses were stored as identifier
tokens. |
Given a file with...
function
add (x : integer; y : integer) return integer is
begin
return x + y;
end; |
...the lexical output would be:
function
<KEYWORD> add <ID> ( <SYMBOL> x <ID>
: <SYMBOL> integer <ID> ; <SYMBOL> y <ID>
: <SYMBOL> integer <ID> ) <SYMBOL> return
<KEYWORD> integer <ID> is <KEYWORD> begin
<KEYWORD> return <KEYWORD> x <ID> + <OPERATION>
y <ID> ; <SYMBOL> end <KEYWORD> ; <SYMBOL>
|
Parser and Compiler
Given a program converted into lexical tokens,
the parser and compiler produces executable PPC assembly. There
are several stages, which are briefly outlined below:
Building an Abstract Syntax Tree (AST)
The compiler uses a recursive decent technique
to parse the program. The grammar for the language is defined by
a series of productions, some of which are shown below. The full
grammar can be seen here.
assignment_statement
::= name := expression ";"
if_statement
::= IF condition THEN sequence_of_statements
[ELSE sequence_of_statements]
END;
function_declaration
::= FUNCTION defining_identifier
( parameter_list ) RETURN type_name IS
declarations
BEGIN sequence_of_statements END ";"
parameter_list
::= parameter_specification {; parameter_specification}
relational_operator
::= = | /= | < | > | <= | >=
relation
::= simple_expression [relational_operator simple_expression]
simple_expression
::= term {binary_adding_operator term} |
Using the grammar’s rules, a tree
representing the structure of the program is constructed. Below
is the AST output for the add function:
Names Table
A names table has to be built to keep track
of the variables and functions defined in the program. It also has
to keep track of nested scopes, so that if many variables exist
with the same name, the variable with the closest scope (least depth)
is used (aka, static scoping). If the variable exists in a scope
that is not accessible by the current scope, it halts compilation
and reports the error (basically, private variables are kept private).
Another complexity is that the language allows for the creation
of new types, for instance:
type
OneDList is array of integer;
type TwoDList is array of OneDList;
type ThreeDList is array of TwoDList; |
The names table resolves these user-defined types to the language’s
primitive types (integer, float, boolean):
OneDList
of kind integer[ ]
TwoDList of kind integer[ ][ ]
ThreeDList of kind integer[ ][ ][ ] |
The names table for the add function appears below:
-
Function “add”
VARIABLES:
- Parameter #0: 'x' of kind 'parameter_specification' returns
integer
- Parameter #1: 'y' of kind 'parameter_specification' returns
integer
SUBSCOPES:
[none] |
Type Checking

It is also important to check that the types
of variables and literal values are of the correct type. For instance,
you can’t add a float and a boolean (illegal: “4.2 +
false”), or have an if statement that evaluates a condition
that does not produce a boolean value. (illegal: “if 5+2 then”,
must be “if 5+2=7 then”).
Expressions can be quite complex, including
nesting of other expressions using parenthesis, function calls,
and variables.
(
4 + ( 2 * 6 ) - add(1, 5) )
One must traverse down the tree to the bottom
most elements, and propagate upwards. For example:
(
2 * 6 ) = ( integer * integer ) = integer
This is valid, and produces an integer result,
so then we can assume that whole expression is just an integer.
Now we have:
(
4 + ( 2 * 6 ) - add(1, 5) ) = ( integer + integer – add(1,
5) )
The return type of the add function can
be determined with a simple query to the names table. In this case,
it returns an integer value, and so we can resolve the type of the
whole expression:
(
integer + integer – integer ) = integer
Numeric values can be used in conditional
statements, switching the type to boolean.
(
4 + ( 2 * 6 ) - add(1, 5) ) > 6
integer
> integer = boolean
There are many rules on how types can be
used together. For instance, “and”
and “or”
statements can only be used with booleans.
Determining Local Variable Stack
Offsets
Local variables are allocated space on the
stack frame of the current function. On the PPC architecture, all
primitive types are 4 bytes in size. This simplifies calculation
of offsets. The names table keeps track of all local variables,
and knows the total count and order of the variables for any given
function. Using this information, the local variables are given
static stack offsets. This is used during code generation.
One additional point of interest is local
arrays. These are handled by allocating a large block on the stack.
Other variable’s offsets have to take this into account.
Code Generation
Finding good material covering PPC assembly
was tough to find. I designed my own stack convention for function
calls (allowing for recursion). Code is generated in one final pass
over the fully checked and decorated AST. It mimics the recursive
decent of the parsing stage. Below is the assembly produced from
the add function. The compiler generates all the comments and formatting.
//-------------------------
add -------------------------
.data ; data section
.text ; code section
.globl _add
_add:
//FUNCTION INIT
mflr r0 //save caller address into r0
stw r1, -256(r1) //save old sp on top of stack
subi r1, r1, 256 //create frame, move sp to 'top' of stack
(256 bytes)
stmw r14, 16(r1) //save registers r14-r32 onto stack (72 bytes)
stw r0, 8(r1) //save return address onto stack
//FUNCTION BODY
//RETURN STATEMENT
mr r14, r3 //moving parameter into temporary register
mr r15, r4 //moving parameter into temporary register
add r14, r14, r15 //add two registers
mr r3, r14 //move temporary value into return register (r3)
ba L_END0: //jump to function return
//FUNCTION END
L_END0: //end label
lwz r0, 8(r1) //get the caller's link address
mtlr r0 //move address into link register (r0)
lmw r14, 16(r1) //restore registers r14-r32
lwz r1,0(r1) //set the original stack pointer
blr //branch link register |
Major Features
• Parser shows errors in context:
Error!
Looking for ')' but found ',' at token 313
( base : integer , exponent : integer ) is
Error!
Encountered malformed or unknown type 'floating' at token
6
add ( x : floating ; y : integer )
Error!
Found unknown loop type 'foor' at token 111
begin result := 1 ; foor j in 1 ..
Error!
Looking for 'return' but found 'is' at token 318
exponent : integer ) is temp : integer ; begin
|
• Parser can resolve types of unlimited
type chaining. Types are resolved to their base type immediately.
Although the original type is stored, returnType() determines the
base type, either a primitive or an array of primitives. Example
shown below:
type
boolArray is array of boolean;
type OneDList is array of integer;
type TwoDList is array of OneDList;
boolArray
of kind boolean
OneDList of kind integer[]
TwoDList of kind integer[][] |
• The parser supports scopes of unlimited
depth. Variables with the same name in higher scopes are hidden
and only references to “newer” variables are used. This
feature was disabled for code generation because of the complexity
to reference previous local variables on the stack.
• Assembly produced by the code generator
allows for function calls and recursion.
• Single dimensional arrays are supported.
These have a limited length due to the fact they are stored on the
stack and frames are of fixed size.
• Pointers are supported. Any integer value
can be used as a memory location. Memory can be accessed like an
array. ptr[0] is equivalent to *ptr in C; ptr[4] accesses memory
location ptr+4. Since there is no memory management in our language,
one must create (malloc) some memory in the C driver program, and
pass a pointer to the function or procedure.
• Type checking is done as the parse tree
is built. If the types do not match, compilation halts and the error
is noted.
• Compiler generates working PPC assembly.
• Compiler generated assembly is commented
and neatly formatted.
• AST is neatly formatted (with indentation)
for easier reading.
• Names table is neatly formatted (with indentation)
for easier reading. All local variables and sub-scopes are printed
(to unlimited depth)
Other Notes
I had an urge when I was first starting the project
to do it entirely in C/C++ and not rely on any external programs
or libraries. In fact, I haven’t used any of the STL besides
string.h (for strcmp) and fstream.h for file IO. It would be fairly
trivial to remove the very light object oriented code that I used.
I’d like to think I could go back in time and armed with pascal,
fortran or a similar language, and write a compiler.
Strings and floats are disabled in the lexical analyzer.
The code generator does not support elseif statements.
One can have multiple if statements or nested if/else statements
to achieve the same effect.
Exit statements are also not supported.
Source Files
• Sample C Driver Program - main.c
• 'A' Language Sample Functions (Actual Input
for Lexical Analyzer) - program.a
Output and Generated Files
• Output from C Driver Program and Sample
Functions (Actual Output) - MainOutput.txt
• Lexical Tokens of Sample Functions (Actual
Output from Lexical Analyzer) - program.lex
• AST of the Sample Functions (Actual Output
from Parser) - ParseTree.txt
• Names Table for the Sample Functions (Actual
Output from Parser) - NamesTable.txt
• Assembly Generated from Sample Functions
(Actual Output from Compiler) - asm.s
Compiler Code
• Lexical Analyzer- LexicalCode.zip
The lexical analyzer writes to STDOUT. To save this
output, one must either copy and paste, or redirect output to a
file. There must not be any trailing white space at the end of the
file. The final character in the file should be a “>”,
which terminates the final token.
• Parser and Compiler - CompilerCode.zip
A string in main.cpp specifies the input file. The
default value is "program.lex". The AST and Names Table
are written to STDOUT. The assembly file is written to a file called
"asm.s".
Once you have generated the assembly file and a
written a driver function in C, you can use gcc to compile and link
them together like so: " gcc main.cpp asm.s".
Go to
Home Page
Go to Projects Page
|