Compiler for Procedural Programming Language "Starlet".
The following compiler, is a semester project for the subject ΜΥΥ802, at the Department of Computer Science & Engineering, University of Ioannina.
Project refers to a compiler -of a high-level language- written in python which produce MIPS assembly low-level code.
The project is divided in three phases:
-
The first phase concerned the implementation of the lexical and syntactic analysis of the compiler.
-
The second phase involved the generation of the intermediate code as well as the semantic analysis.
-
Finally, in the third and last phase was the production of the final code in symbolic language(assembly) MIPS.
The task of the lexical analyzer is to read the input file character by character and identify symbols and words of the language Starlet. In order to represent the right behaviour of the analyzer in an efficient way, we created an automaton to identify every possible outcome.
States:
- STATE_START(P0): starting state of automaton
- STATE_ALPHA(P1): read an alphabet cahracter ready to return variable name or a keyword.
- STATE_DIGIT(P2): read a number ready to return an integer nuber.
- STATE_LESS(P3): read operator ‘<’.
- STATE_GREATER(P4): read operator ‘>’.
- STATE_COLON(P5): read operator ‘:’.
- STATE_SLASH(P6): read operator ‘/’. Things that might follow: single line comment, multiline comment or a division operation.
- STATE_LINECOMMENT(P7): read operator ‘/’, while in P6. That means we have single line comment.
- STATE_LINECOMMENT_CHECK(P8): read operator ‘/’, while in P7. We need to check for nested comments which are forbidden.
- STATE_MULTILINECOMMENT_OPEN(P9): read operator ‘*’, while in P6. That means we have multiline comment.
- STATE_MULTILINECOMMENT_CHECK(P10): read operator ‘/’, while in P9. We need to check for nested comments which are forbidden.
- STATE_MULTILINECOMMENT_CLOSE(P11): read operator ‘*’ , while in P9. Automaton "waits" for opwrator ‘/’ to complete the multiline comment or any other symbol to return to P9.
Final States:
- (OK): final state where the lexical analyzer returns all the collected characters of the buffer.
- (OKR): final state where the lexical analyzer returns all the collected characters of the buffer except from the last character that was read. This character returns to the input and awaits to be read.
- (Error): an error was detected while reading the input character by character.
Extra Symbolisms:
- ALPHA: Characters in {‘a’, ..., ‘z’} and {‘A’, ..., ‘Z’}
- DIGIT: Integers in {–32767, ..., 32767}
- SYMBOLS: {‘ ‘, ‘\t’, ‘\n’, ‘+’, ‘-’, ‘*’, ‘/’, ‘<’, ‘>’, ‘=’, ‘:’, ‘,’, ‘;’, ‘(‘, ‘)’, ‘[‘, ‘]’, EOF}
- EOL: New line character ‘\n’.
- EOF: End of file.
In this project we simulate the function of the automaton with a two-dimensional array named decision_table.
For each rule of the grammar we created a function with the same name that checks if the words that lexical analyzer returns comply with the corresponding rule.
<program> ::= program id < block > endprogram
<block> ::= <declaclarations> <subprograms> <statements>
<declaclarations> ::= (declare <varlist>;)*
<varlist> ::= ε | id (, id )*
<subprograms> ::= (<subprogam>)*
<subprogram> ::= function id <funcbody> endfunction
<funcbody> ::= <formalpars> <block>
<formalpars> ::= (<formalpalist>)
<formalparlist> ::= <formalparitem> (, <formalparitem>)* | ε
<formalparitem> ::= in id | inout id | inandout id
<statements> ::= <statement> (; <statement>)*
<statement> ::= ε | <assignment-stat> | <if-stat> | <while-stat> | <do-while-stat> | <loop-stat> | <exit-stat> | <forcase-stat> | <incase-stat> | <return-stat> | <input-stat> | <print-stat>
<assignment-stat> ::= id := <expression>
<if-stat> ::= if (<condition>) then <statements> <elsepart> endif
<elsepart> ::= ε | else <statements>
<while-stat> ::= while (<condition>) <statements> endwhile
<do-while-stat> ::= dowhile <statements> enddowhile (<condition>)
<loop-stat> ::= loop <statements> endloop
<exit-stat> ::= exit
<forcase-stat> ::= forcase ( when (<condition>) : <statements>)* default: <statements> enddefault endforcase
<incase-stat> ::= incase ( when (<condition>) : <statements>)* endincase
<return-stat> ::= return <expression>
<input-stat> ::= input id
<print-stat> ::= print <expression>
<actualpars> ::= (<actualparlist>)
<actualparlist> ::= <actualparitem> (, <actualparitem>)* | ε
<actualparitem> ::= in <expression> | inout id | inandout id
<condition> ::= <boolterm> (or <boolterm>)*
<boolterm> ::= <boolfactor> (and <boolfactor>)*
<boolfactor> ::= not [<condition>] | [<condition>] | <expression> <relational-oper> <expression>
<expression> ::= <optional-sign> <term> (<add-ope> <term>)*
<term> ::= <factor> (<mul-oper> <factor>)*
<factor> ::= constant | (<expression>) | id <idtail>
<idtail> ::= ε | <actualpars>
<relational-oper> ::= = | <= | >= | > | < | <>
<add-oper> ::= + | -
<mul-oper> ::= * | /
<optional-sign> ::= ε | <add-oper>
In the analysis-synthesis model of a compiler, the front end of a compiler translates a source program into an independent intermediate code, then the back end of the compiler uses this intermediate code to generate the target code(assembly).
The intermidiate code for this project cosists of quads(op, term0, term1, term2) where:
- op: the operator of the quad {+, -, /, *, :=, >, <, =, >=, <=, <>, inp, out, begin_block, end_block, call, halt, jump, retv, par, call}
- term0, term1, term2: differ depending on the operator
Quads are created inside the functions that implement the rules of the grammar for example:
def genquad(op, term0, term1, term2): is responsible for generating quads and adding them to a list for later use in the generation of the target code.
Symbol table is an important data structure created and maintained by compilers in order to store information about the occurrence of various entities such as variable names, function names, etc. In this project we represent the symbol table with a list named symbol_table that contains objects of the class Scope.
class Scope: the objects of this class represent the nesting depth. Fields of this class:
- name: name of the scope. For example if the scope corresponds to a function "f" then name = f.
- entities: list of entities(an object in the program, such as a class, method, field, variable, etc.) that exist in the current scope
- framelength: length of the scope
The Semantic Analysis phase concerns the satisfaction of specific requirements of the language which are not explicitly defined by its grammar, but are necessary for the correct production of the target code.
These requirements are listed below.
- Every function in the language must have a return statement.
- Only functions can have a return statement.
- The exit command exists only in loops.
- When declaring a variable, check that there is no other function or variable with the same name in the nesting depth being declared.
- When declaring a function, check that there is no other function or variable with the same name in the nesting depth being declared.
- When using a variable, check whether this variable is declared at the current or a smaller nesting depth and that it is declared as a variable or parameter.
- When using a function, check that the function is declared in the current or some smaller nesting depth and that it is declared as a function. Also check that the arguments are passed based on how the function parameters are passed.
In this project we chose as target code to be the assembly language of the MIPS processor. The basic idea in this phase of compilation is that we have a list of quads(op, term0, term1, term2) that were created during the parsing of a block of the input code. So we maintain at any given time a list to which we add each new block. When we find end_blockAt before the current Scope is deleted from the symbol table we should generate the target code for all blocks in this list. The following are the functions that contribute to the generation process of the final code.
-
def gnvlcode(variable): gets called when the value of a non local variable is requested. So this function takes from the accessor link the parent of the function calling the variable and starting from its nesting depth it descends to the lower nesting depths until it finds the variable. When it finds the nesting depth declared for that variable it sets the register $t0 equal to the memory address where that variable is located.
-
def loadvr(variable, register): loads the value of the variable into a register. We distinguish the following cases according to the type of the variable:
- constant:
- li register, v
- global:
- lw register,-offset($s0)
- local variable or parameter passed by value:
- lw register,-offset($sp)
- local parameter passed by reference:
- lw $t0,-offset($sp)
- lw register,($t0)
- not a local variable or parameter passed by value:
- gnlvcode()
- lw register,($t0)
- not a local variable passed by reference:
- gnlvcode()
- lw $t0,($t0)
- lw register,($t0)
- constant:
-
def storerv(register, variable): writes the value of the register in the memory address of the variable. We distinguish the following cases according to the type of the variable:
- global:
- sw register,-offset($s0)
- local variable or parameter passed by value:
- sw register,-offset($sp)
- local parameter passed by reference:
- lw $t0,-offset($sp)
- sw register,($t0)
- not a local variable or parameter passed by value:
- gnlvcode()
- sw register,($t0)
- not a local variable passed by reference:
- gnlvcode()
- lw $t0,($t0)
- sw register,($t0)
- global:
-
def compile_quads(): traverses the list of uncompiled_quads with the quads and calls the compile_quad() function for each of them in order to eventually generate the target code.
-
def compile_quad(quad, args): produces the final code of the input quad. Some of the different quad types can be seen bellow:
- jump quad: (jump, _, _, label)
- j label
- condition jump quad: (relop, x, y, label):
- loadvr(x,$t1)
- loadvr(y,$t2)
- branch, $t1, $t2, label
- assignment quad: (:=, x, _, z)
- loadvr(x,$t1)
- storerv($t1,z)
- sign quad: (+, x, _, z)
- loadvr(x, $t1)
- add $t1, $zero, $t1
- storerv($t1, z)
- arithmetic operation quad: (op x,=,z)
- loadvr(x,$t1)
- loadvr(y,$t2)
- op_instr $t1,$t1,$t2 #op_instr є {add, sub, mul, div}
- storerv($t1,z)
- output quad: (out x, _, _)
- li $v0,1
- loadvr(x, $t1)
- move $a0, $t1
- syscall
- input quad: (inp _, _, x)
- li $v0,5
- syscall
- storerv($v0, x)
- function return value quad: (retv _, _, x)
- loadvr(x,$t1)
- lw $t0, -8($sp)
- sw $t1, ($t0)
- jump quad: (jump, _, _, label)
Regarding the errors that may occur during the analysis, a 3-level directory named error_map was used. The first level corresponds to the errors that can occur in the three main categories: during Verbal Analysis, during Syntactic Analysis, during Semantic Analysis. Level 2 categorizes errors according to:
- the id of the input symbol that caused the error (if it is a Lexical Analysis error) or
- the compiler function in which the error occurred(if it is a Syntactic Analysis erroror a Semantic Analysis error).
Finally, the 3rd level separates the different errors of the second level based on an id chosen to precisely characterize the specific error.
Readme Language: Greek
Team Members:
George Androutsopoulos
Christoforos Karvelis