|
|
6 years ago | |
|---|---|---|
| .. | ||
| README.md | 6 years ago | |
| makefile | 6 years ago | |
| mini.c | 6 years ago | |
| t | 6 years ago | |
| t3 | 6 years ago | |
| tinyc | 6 years ago | |
| tinyc.c | 6 years ago | |
README.md
Implementing a Language in C
In this post we're implementing a small programming language in C. The language, which I extended from tinyc by Marc Feeley, is a statement language. I added a print statement to the language (which means extending the tokenizer and the parser, and adding a new machine instruction and modifying the virtual machine interpreter accordingly). I also removed the restriction that only one-symbol pre-defined identifiers are allowed; this again leads to changes in the tokenizer, the parser, the compiler, and the virtual machine interpreter. Moreover, as will be seen, I added an abstract syntax tree printer and an abstract syntax interpreter to the program.
In this language a program is simply a statement, where a statement can be a
condition statement, a loop statement, a print statement, an empty statement,
or zero or more statements enclosed in brackets. Each statement is followed by
a semi-colon. To be precise, the definition is shown below in the BNF grammar
of the language. The non-terminals are enclosed in <> while the terminals are
shown between double quotes. Note that zero or more is simply enclosed inside a
pair of brackets (instead of the usual *).
<program> := <statement>
<statement> := "if" <paren_expr> <statement>
:= "if" <paren_expr> <statement> "else" <statement>
:= "while" <paren_expr> <statement>
:= "do" <statement> "while" <paren_expr> ";"
:= "print" <paren_expr> ";"
:= "{" { <statement> } "}"
:= <expr> ";"
:= ";"
<paren_expr> := "(" <expr> ")"
<expr> := <test>
:= <id> "=" <expr>
<test> := <sum>
:= <sum> "<" <sum>
<sum> := <term>
:= <sum> "+" <term>
:= <sum> "-" <term>
<term> := <id>
:= <int>
:= <paren_expr>
<id> := <a_finite_sequence_of_acceptable_symbols>
<num> := <an_unsigned_decimal_integer>
The Tokenizer
As is known, the tokenizer makes available the current token and makes possible moving to the next token. In general, a token has a type and a value. In our tokenizer, we have the following token types
enum
{
DO_SYM,
ELSE_SYM,
IF_SYM,
WHILE_SYM,
PRINT_SYM,
LBRA_SYM,
RBRA_SYM,
LPAR_SYM,
RPAR_SYM,
PLUS_SYM,
MINUS_SYM,
LESS_SYM,
SEMI_SYM,
EQUAL_SYM,
NUM_SYM,
ID_SYM,
EOI_SYM
};
Except EOI_SYM to help with parsing later, the other token types are all
obtained from the grammar. Every terminal has a token type, and moreover the
non-terminals that directly generate terminals (<id> and <int>) also belong
to some token types. The latter two have associated values which we'll keep in
an integer and a string:
int num_val;
char id_name[100];
On the other hand, the current token (more precisely the current token type) and getting the next token type are realized by a variable and a function:
int sym;
void next_sym()
{
// To be filled in
}
In order to signal a syntax error we'll have the function
void syntax_error(char *msg)
{
fprintf(stderr, "syntax error - %s\n", msg);
exit(1);
}
Now let's write the tokenizer, i.e. implement next_sym. The basic idea is
to examine the current character and update the token type kept in sym, and
then move on to the next character. In the case of numbers and identifiers the
characters are accumulated and put in num_val and id_name respectively. So
we'll have first of all
int ch = ' ';
void next_ch() { ch = getchar(); }
Space characters are simply ignored, and we'll have
void next_sym()
{
again:
switch (ch)
{
case ' ':
case '\n':
next_ch();
goto again;
// To be continued
}
}
To continue, when EOF is reached, sym is updated to EOI_SYM.
void next_sym()
{
again:
switch (ch)
{
// ...
case EOF:
sym = EOI_SYM;
break;
// To be continued
}
}
In the case of +, -, =, and so on, sym is updated accordingly, and the
next character is made available by a call to next_ch().
void next_sym()
{
again:
switch (ch)
{
// ...
case '{':
next_ch();
sym = LBRA_SYM;
break;
case '}':
next_ch();
sym = RBRA_SYM;
break;
case '(':
next_ch();
sym = LPAR_SYM;
break;
case ')':
next_ch();
sym = RPAR_SYM;
break;
case '+':
next_ch();
sym = PLUS_SYM;
break;
case '-':
next_ch();
sym = MINUS_SYM;
break;
case '<':
next_ch();
sym = LESS_SYM;
break;
case ';':
next_ch();
sym = SEMI_SYM;
break;
case '=':
next_ch();
sym = EQUAL_SYM;
break;
// To be continued
}
}
The remaining case is either a number, an identifier, or a syntax error. If the
current character is a digit, we simply accumulate all the following digits,
convert the token to a number and save it in num_val, and update sym to
NUM_SYM:
void next_sym()
{
again:
switch (ch)
{
// ...
default:
if (ch >= '0' && ch <= '9')
{
num_val = 0;
while (ch >= '0' && ch <= '9')
{
num_val = num_val * 10 + (ch - '0');
next_ch();
}
sym = NUM_SYM;
}
// To be continued
}
}
If on the other hand the current character is an alphabet, we're dealing with
an identifier. Here, there is a subtlety: The identifier could be one of the
words in our language, such as do, else, and so on. Thus we'll have
char *words[] = {"do", "else", "if", "while", "print", NULL};
Note the order that we have put the words: the order of do match that of
DO_SYM, else ELSE_SYM, and so on. This is so, so that after having
accumulated the characters of an identifier to id_name (ended properly with
\0), we can reset sym to 0 which correponds to the index of do in
words (and the index of DO_SYM). Then we can simply increment sym to
check all the words which at the same time makes sure sym have the correct
value if a word has indeed been seen. Otherwise we have seen an identifier and
we simply update sym to ID_SYM. Thus
void next_sym()
{
again:
switch (ch)
{
// ...
default:
// ...
else if (ch >= 'a' && ch <= 'z')
{
int i = 0;
while ((ch >= 'a' && ch <= 'z') || ch == '_' || (ch >= '0' && ch <= '9'))
{
id_name[i++] = ch;
next_ch();
}
id_name[i] = '\0';
sym = 0;
while (words[sym] != NULL && strcmp(words[sym], id_name) != 0)
sym++;
if (words[sym] == NULL)
sym = ID_SYM;
}
// .. To be continued
}
}
Anything else is a syntax error in the language, and we have
void next_sym()
{
again:
switch (ch)
{
// ...
default:
// ...
// ...
else
syntax_error("unknown symbol");
}
}
The tokenizer is done. The following function that prints all the tokens should be written as we were extending the tokenizer:
void print_tokens()
{
again:
next_sym();
switch (sym)
{
case DO_SYM:
printf("DO_SYM \"%s\"\n", id_name);
goto again;
case ELSE_SYM:
printf("ELSE_SYM \"%s\"\n", id_name);
goto again;
case IF_SYM:
printf("IF_SYM \"%s\"\n", id_name);
goto again;
case WHILE_SYM:
printf("WHILE_SYM \"%s\"\n", id_name);
goto again;
case PRINT_SYM:
printf("PRINT_SYM \"%s\"\n", id_name);
goto again;
case LBRA_SYM:
printf("LBRA_SYM\n");
goto again;
case RBRA_SYM:
printf("RBRA_SYM\n");
goto again;
case LPAR_SYM:
printf("LPAR_SYM\n");
goto again;
case RPAR_SYM:
printf("RPAR_SYM\n");
goto again;
case PLUS_SYM:
printf("PLUS_SYM\n");
goto again;
case MINUS_SYM:
printf("MINUS_SYM\n");
goto again;
case LESS_SYM:
printf("LESS_SYM\n");
goto again;
case SEMI_SYM:
printf("SEMI_SYM\n");
goto again;
case EQUAL_SYM:
printf("EQUAL_SYM\n");
goto again;
case NUM_SYM:
printf("NUM_SYM \"%d\"\n", num_val);
goto again;
case ID_SYM:
printf("ID_SYM \"%s\"\n", id_name);
goto again;
case EOI_SYM:
printf("EOI_SYM\n");
break;
}
}
Now a simple demonstration of { i=1; while (i<100) i=i+i; }:
LBRA_SYM
ID_SYM "i"
EQUAL_SYM
NUM_SYM "1"
SEMI_SYM
WHILE_SYM "while"
LPAR_SYM
ID_SYM "i"
LESS_SYM
NUM_SYM "100"
RPAR_SYM
ID_SYM "i"
EQUAL_SYM
ID_SYM "i"
PLUS_SYM
ID_SYM "i"
SEMI_SYM
RBRA_SYM
EOI_SYM
The Parser
In order to write the parser we'll first need to give names to the semantic constructs in the language, which is done by going through the grammar and labelling the productions. The following list is complete:
enum
{
VAR,
CST,
ADD,
SUB,
LT,
SET,
IF,
IFELSE,
WHILE,
DO,
PRINT,
EMPTY,
SEQ,
EXPR,
PROG
};
Next, we'll need a data structure to hold not only the type of a construct but
also its components. We'll need at maximum three pieces of data (the IFELSE
construct) and the following data structure is adequate for all the constructs
in the language:
typedef struct node
{
int kind;
struct node *o1, *o2, *o3;
union {
int val;
char id[100];
};
} node;
A new node is created by calling the following function:
node *new_node(int k)
{
node *x = malloc(sizeof(node));
x->kind = k;
return x;
}
At certain points during the parsing we'll simply need to consume an expected token type and the following function is useful:
void consume(int expected)
{
if (sym == expected)
next_sym();
else
syntax_error("unknown expected");
}
The parser itself is made up of a set of possibly recursive functions calling each
other according to the grammar. To start, id simply creates a new node of
type VAR and copies over the content of id_name; it also makes the next
token available by calling next_sym:
node *id()
{
node *x = new_node(VAR);
strcpy(x->id, id_name);
next_sym();
return x;
}
Next, num is likewise:
node *num()
{
node *x = new_node(CST);
x->val = num_val;
next_sym();
return x;
}
Next, according to its grammar, term decides which non-terminal to follow
according to the current token type. Note that we need a forward declaration
of paren_expr().
node *paren_expr();
node *term()
{
if (sym == ID_SYM)
return id();
else if (sym == NUM_SYM)
return num();
else
return paren_expr();
}
Next in line is sum. According to its grammar, there're three productions
none of which begins with a terminal. However, we can expand an addition or a
subtraction and see that a term always needs to be parsed first. Subsequently
it's zero or more additions or subtractions which translates to a loop
structure. Thus
node *sum()
{
node *x = term();
while (sym == PLUS_SYM || sym == MINUS_SYM)
{
node *t = x;
x = new_node(sym == PLUS_SYM ? ADD : SUB);
next_sym();
x->o1 = t;
x->o2 = term();
}
return x;
}
Let's pause a bit here and try to visualize the work of the parser. Given an
expression such as 1 + 2 - 3 we'll get the parsed tree
+
/ \
1 -
/ \
2 3
The next one is test:
node *test()
{
node *x = sum();
if (sym == LESS_SYM)
{
node *t = x;
x = new_node(LT);
next_sym();
x->o1 = t;
x->o2 = sum();
}
return x;
}
Now, expr, which is a bit tricky. Even though an identifier seems to point to
the second production, if we expand the first production, <test>, we see that
it may start with an identifier as well. In the following, we will simply check
if the current token is not an identifier, in which case we parse the first
production by returning what will be parsed by test. Otherwise, we call
test and check if we have an identifier followed by the equal sign, in which
case we are parsing the SET clause, and if not, it's also what has been
parsed by test.
node *expr()
{
if (sym != ID_SYM)
return test();
node *x = test();
if (x->kind == VAR && sym == EQUAL_SYM)
{
node *t = x;
x = new_node(SET);
next_sym();
x->o1 = t;
x->o2 = expr();
}
return x;
}
Next, paren_expr, it's probably the simplest one:
node *paren_expr()
{
consume(LPAR_SYM);
node *x = expr();
consume(RPAR_SYM);
return x;
}
Next is statement, which has eight productions and is therefore the longest
function. Seven of them start with a distinct terminal indicating which
production is to be parsed (even though there're two conditional statements,
one can be distinguished from the other by the else terminal). Most of them
are straight-forward, except the sequence production. It translates into a loop
structure that keeps parsing and attaching what has been parsed as a component
to what will be parsed. We may roughly see how it works by looking at an
example: The sequence { i=1; while (i<100) i=i+i; } will be parsed by the
parser into the tree below, the lower tree being parsed first and glued to the
upper tree as a branch.
SEQ
/ \
SEQ WHILE
/ \
EMPTY EXPR
node *statement()
{
node *x;
if (sym == IF_SYM)
{
next_sym();
x = new_node(IF);
x->o1 = paren_expr();
x->o2 = statement();
if (sym == ELSE_SYM)
{
x->kind = IFELSE;
next_sym();
x->o3 = statement();
}
}
else if (sym == WHILE_SYM)
{
x = new_node(WHILE);
next_sym();
x->o1 = paren_expr();
x->o2 = statement();
}
else if (sym == DO_SYM)
{
x = new_node(DO);
next_sym();
x->o1 = statement();
consume(WHILE_SYM);
x->o2 = paren_expr();
consume(SEMI_SYM);
}
else if (sym == PRINT_SYM)
{
x = new_node(PRINT);
next_sym();
x->o1 = paren_expr();
consume(SEMI_SYM);
}
else if (sym == LBRA_SYM)
{
x = new_node(EMPTY);
next_sym();
while (sym != RBRA_SYM)
{
node *t = x;
x = new_node(SEQ);
x->o1 = t;
x->o2 = statement();
}
next_sym();
}
else if (sym == SEMI_SYM)
{
x = new_node(EMPTY);
next_sym();
}
else
{
x = new_node(EXPR);
x->o1 = expr();
consume(SEMI_SYM);
}
return x;
}
Finally, program. We need to remember to consume EOF_SYM.
node *program()
{
node *x = new_node(PROG);
x->o1 = statement();
consume(EOI_SYM);
return x;
}
And a function that initiates the parsing:
node *parse()
{
next_sym();
node *x = program();
return x;
}
The following function prints the abstract syntrax tree to give us roughly an idea of how it looks like.
void print_ast(node *x)
{
switch (x->kind)
{
case VAR:
printf("VAR \"%s\" ", x->id);
break;
case CST:
printf("CST \"%d\" ", x->val);
break;
case ADD:
print_ast(x->o1);
printf("ADD ");
print_ast(x->o2);
break;
case SUB:
print_ast(x->o1);
printf("SUB ");
print_ast(x->o2);
break;
case LT:
print_ast(x->o1);
printf("LT ");
print_ast(x->o2);
break;
case SET:
printf("SET ");
print_ast(x->o1);
print_ast(x->o2);
break;
case IF:
printf("IF ");
print_ast(x->o1);
print_ast(x->o2);
break;
case IFELSE:
printf("IF ");
print_ast(x->o1);
print_ast(x->o2);
printf("ELSE ");
print_ast(x->o3);
break;
case EXPR:
printf("EXPR ");
print_ast(x->o1);
break;
case SEQ:
printf("SEQ ");
print_ast(x->o1);
print_ast(x->o2);
break;
case PRINT:
printf("PRINT ");
print_ast(x->o1);
break;
case WHILE:
printf("WHILE ");
print_ast(x->o1);
print_ast(x->o2);
break;
case DO:
printf("DO ");
print_ast(x->o1);
printf("WHILE ");
print_ast(x->o2);
break;
case PROG:
printf("PROG ");
print_ast(x->o1);
break;
case EMPTY:
printf("EMPTY ");
break;
default:
syntax_error("unknown node");
break;
}
}
As an example, we may see the output of { i=1; while (i<100) i=i+i; }:
PROG SEQ SEQ EMPTY EXPR SET VAR "i" CST "1" WHILE VAR "i" LT CST "100"
EXPR SET VAR "i" VAR "i" ADD VAR "i"
Here we see the main node PROG that contains a sequence. As we discussed how
a sequence was parsed, i.e. roughly { i=1; while (i<100) i=i+i; } will parsed
into
SEQ
/ \
SEQ WHILE
/ \
EMPTY EXPR
which was reflected in the output.
The Interpreter
This language is a very small subset of the C language and the semantics of its
statements are clear. The only point that should be discussed is scoping.
Usually brackets open a new scope and interpretation should take scoping into
account properly. In the current language, however, we'll simply treat brackets
as a way to group statements, and accordingly there's only one global scope,
instead of different local scopes for different pair of matching brackets. As
an example, in our language the statement {a = 3; {a = a + a;}; a = a + a; print(a)} will print 12, instead of 6; the inner pair of brackets does
update the variable a.
Thus, instead of creating new local environments when entering brackets we will have a global environment that keeps identifiers and their values. The environment will be a list
typedef struct list
{
char *id;
int value;
struct list *next;
} list;
list *env;
We want to be able to get an identifier, which is an element in the list, and also be able to look up the value of a given identifier:
list *get_id(char *id)
{
for (list *lst = env; lst; lst = lst->next)
if (strcmp(lst->id, id) == 0)
return lst;
return (list *)NULL;
}
void lookup_error(char *id)
{
fprintf(stderr, "error looking up %s\n", id);
exit(1);
}
int lookup_value(char *id)
{
list *pid = get_id(id);
if (pid)
return pid->value;
lookup_error(id);
return -1;
}
Finally, we want to be able to add an identifier and its value to the global environment. Because of our scoping rule, if the identifier already exists its value will simply be replaced by the new value; otherwise the new name-value pair will be added to the beginning of the list.
void add_id(char *id, int value)
{
list *pid = get_id(id);
if (pid)
{
pid->value = value;
return;
}
list *lst = malloc(sizeof(list));
lst->id = id;
lst->value = value;
lst->next = env;
env = lst;
}
Now the interpreter. According to our grammar there should be three functions,
eval_program, eval_statement, and eval_expr. First, eval-expr:
int eval_expr(node *x)
{
node *var;
int val;
switch (x->kind)
{
case VAR:
return lookup_value(x->id);
case CST:
return x->val;
case ADD:
return eval_expr(x->o1) + eval_expr(x->o2);
case SUB:
return eval_expr(x->o1) - eval_expr(x->o2);
case LT:
return eval_expr(x->o1) < eval_expr(x->o2);
case SET:
var = x->o1;
val = eval_expr(x->o2);
add_id(var->id, val);
return val;
default:
eval_error();
return -1;
}
}
Next, eval_statement:
void eval_statement(node *x)
{
switch (x->kind)
{
case PRINT:
printf("%d\n", eval_expr(x->o1));
break;
case IF:
if (eval_expr(x->o1))
eval_statement(x->o2);
break;
case IFELSE:
if (eval_expr(x->o1))
eval_statement(x->o2);
else
eval_statement(x->o3);
break;
case WHILE:
while (eval_expr(x->o1))
eval_statement(x->o2);
break;
case DO:
do
eval_statement(x->o1);
while (eval_expr(x->o2));
break;
case SEQ:
eval_statement(x->o1);
eval_statement(x->o2);
break;
case EXPR:
eval_expr(x->o1);
break;
case EMPTY:
break;
default:
eval_error();
}
}
Finally, eval_program:
void eval_program(node *x)
{
switch (x->kind)
{
case PROG:
eval_statement(x->o1);
break;
default:
eval_error();
}
}
Let's run the interpreter on the program { i=1; j = 10; while (i<100) print(i=i+j); }
11
21
31
41
51
61
71
81
91
101
The Compiler
A compiler translates a program in our language into a set of bytecote instructions that will subsequently be interpreted. Each instruction is associated with a number, and the set of instructions that will be used is
enum
{
IFETCH,
ISTORE,
IPUSH,
IPOP,
IADD,
ISUB,
ILT,
IJZ,
IJNZ,
IJMP,
IPRINT,
IHALT
};
The interpretation of the bytecode will be carried out by a stack virtual machine, starting from the first instruction, with a stack holding computation values. In addition, there will be two additional arrays to keep variables and their associated values. Each variable will be put into a specific location on one array, and the value associated to the variable will be put into the corresponding location on the other array. The location number will be used as a bytecode instruction.
Please note that we are operating at a very low level.
To start, the instructions are put into an array called object. Since we'll
keep pushing the instructions on top of the array, we need a pointer, here.
The function g simply puts a given bytecode on top of the code array and
moves up one element.
typedef char code;
code object[1000], *here = object;
void g(code c) { *here++ = c; }
As we have said, there are two arrays holding the variables and their associated values. We're restricting to only maximum a hundred names, and therefore a hundred values.
char names[100][100], (*namespt)[100] = names;
int globals[100];
An important operation on names is getting the index of a variable. If the
variable is already in the array, its index will be returned; otherwise the
variable is put on top of the array and the index of the top element is
returned:
int get_index(char *name)
{
int i;
for (char(*npt)[100] = names; npt < namespt; npt++)
{
i = npt - names;
if (strcmp(name, names[i]) == 0)
return i;
}
i = namespt++ - names;
strcpy(names[i], name);
return i;
}
Let's move on to generating the code. We'll write a function called c that
takes an abstract syntax tree and generates the corresponding code.
void c(node *x)
{
switch (x->kind)
{
// To be continued
}
}
Since PROG is the containing node, let's work on that first. We need to
generate the bytecode and put IHALT at the end. Thus,
void c(node *x)
{
switch (x->kind)
{
case PROG:
c(x->o1);
g(IHALT);
break;
// To be continued
}
}
Now, according to the grammar of the language, there're eight different kinds
of nodes that could be contained in a PROG node. The easiest one is EMPTY:
void c(node *x)
{
switch (x->kind)
{
// To be continued
case EMPTY:
break;
// ...
}
}
Let's pick EXPR as the next one. In this case a value is expected to be
computed and put on the virtual machine stack, before being used. Thus we'll
generate the instructions followed by the IPOP instruction.
void c(node *x)
{
switch (x->kind)
{
// To be continued
case EXPR:
c(x->o1);
g(IPOP);
break;
// ...
}
}
There're several possible expressions. Let's start with VAR. Given a
variable, as there is already a value associated with it, we simply generate
IFETCH and the index of the location of the variable:
void c(node *x)
{
switch (x->kind)
{
case VAR:
g(IFETCH);
g(get_index(x->id));
break;
// ...
}
}
Next, CST. In this case we generate IPUSH and the value to be put to the
virtual machine stack:
void c(node *x)
{
switch (x->kind)
{
// ...
case CST:
g(IPUSH);
g(x->val);
break;
// To be continued
// ...
}
}
Next are ADD and SUB:
void c(node *x)
{
switch (x->kind)
{
// ...
case ADD:
c(x->o1);
c(x->o2);
g(IADD);
break;
case SUB:
c(x->o1);
c(x->o2);
g(ISUB);
break;
// To be continued
// ...
}
}
LT is similar:
void c(node *x)
{
switch (x->kind)
{
// ...
case LT:
c(x->o1);
c(x->o2);
g(ILT);
break;
// To be continued
// ...
}
}
Finally, SET. Here we compute the value to be set first, then store it, then
generate the index of the variable:
void c(node *x)
{
switch (x->kind)
{
// ...
case SET:
c(x->o2);
g(ISTORE);
g(get_index(x->o1->id));
break;
// To be continued
// ...
}
}
Now let's get back to the nodes that could be contained in the PROG node. We
have implemented EMPTY and EXPR. The next easy one is PRINT:
void c(node *x)
{
switch (x->kind)
{
// ...
case PRINT:
c(x->o1);
g(IPRINT);
break;
// To be continued
// ...
}
}
SEQ is also straight-forward:
void c(node *x)
{
switch (x->kind)
{
// ...
case SEQ:
c(x->o1);
c(x->o2);
break;
// To be continued
// ...
}
}
Next, let's move on to IF. The key to translating this node is to use the
instruction IJZ, meaning jump if zero. Before writing the code, we will look
at an example, if (1 < 10) print(10);. Here the condition 1 < 10 will be
translated into a series of instructions, which we'll mark X below:
... |X| | |...
print(10) must be translated too, but whether or not it should be executed
depends on whether the condition is true (1), in other words the sequence of instructions should be skipped if the condition is false (0). To achieve
this effect, we put IJZ followed by a hole, and generate the instructions
for print(10). The hole is to hold the number of instructions to be skipped.
We can keep the following illustration in mind:
... |X|IJZ| |X|...
To create a hole, we have the function
code *hole() { return here++; }
And to compute the number of steps:
void fix(code *src, code *dst) { *src = dst - src; }
The code for translating IF is:
void c(node *x)
{
switch (x->kind)
{
// ...
case IF:
c(x->o1);
g(IJZ);
p1 = hole();
c(x->o2);
fix(p1, here);
break;
// To be continued
// ...
}
}
Next we will translate IFELSE. We'll use a concrete example, if (1<10) print(1); else print(10);, to follow along. The code for 1<10, print(1),
and print(10) will need to be translated. In a general statement, one of the
consequences will be skipped depending on the condition. Thus we will place
a jump instruction and a hole in front of each. But the second jump is a simple
jump. Thus we use the IJMP instruction.
...|X|IJZ| |X|JMP| |X| |...
The addresses to be kept in the hole need to be calculated at the right places:
void c(node *x)
{
switch (x->kind)
{
// ...
case IFELSE:
c(x->o1);
g(IJZ);
p1 = hole();
c(x->o2);
g(IJMP);
p2 = hole();
fix(p1, here);
c(x->o3);
fix(p2, here);
break;
// To be continued
// ...
}
}
Now translating WHILE. This works almost like a condition. We'll have
|X|IJZ| |X|IJMP|...
We need to be careful with the calculation of the jumping addresses:
void c(node *x)
{
switch (x->kind)
{
// ...
case WHILE:
p1 = here;
c(x->o1);
g(IJZ);
p2 = hole();
c(x->o2);
g(IJMP);
fix(hole(), p1);
fix(p2, here);
break;
// To be continued
// ...
}
}
Finally, DOWHILE. Here the JNZ, or jump if not zero, is used.
void c(node *x)
{
switch (x->kind)
{
// ...
case DO:
p1 = here;
c(x->o1);
c(x->o2);
g(JNZ);
fix(hole(), p1);
break;
// ...
}
}
The Virtual Machine Interpreter
The bytecode instructions generated by the compiler written above are interpreted by a stack virtual machine interpreter. There is a stack to hold computation values, and the instructions are to be interpreted from beginning to end.
void run()
{
int stack[1000], *sp = stack;
code *pc = object;
again:
switch (*pc++)
{
// To be continued ...
}
}
The following instructions manipulate the stack or the value array directly:
void run()
{
int stack[1000], *sp = stack;
code *pc = object;
again:
switch (*pc++)
{
case IFETCH:
*sp++ = globals[*pc++];
goto again;
case ISTORE:
globals[*pc++] = sp[-1];
goto again;
case IPUSH:
*sp++ = *pc++;
goto again;
case IPOP:
--sp;
goto again;
case IADD:
sp[-2] = sp[-2] + sp[-1];
--sp;
goto again;
case ISUB:
sp[-2] = sp[-2] - sp[-1];
--sp;
goto again;
case ILT:
sp[-2] = sp[-2] < sp[-1];
--sp;
goto again;
case IPRINT:
printf("%d\n", *--sp);
goto again;
// To be continued ...
}
}
The remaining instructions involve using the number of steps to jump when needed:
void run()
{
int stack[1000], *sp = stack;
code *pc = object;
again:
switch (*pc++)
{
// ...
case IJZ:
if (*--sp == 0)
pc += *pc;
else
pc++;
goto again;
case IJMP:
pc += *pc;
goto again;
case IJNZ:
if (*--sp != 0)
pc += *pc;
else
pc++;
goto again;
}
}
The virtual machine has been completed. Given a program, such as { i = 0; do { i = i + 10; print(i);} while (i < 50);}, it will produces the output
10
20
30
40
50