In previous chapters we have seen a variety of utility procedures for initialising symbol tables, for reading numbers, identifiers and special characters, for reporting errors and for other purposes. To keep the programs short, only minimal utilities were provided. In no two programs were the utilities sufficiently large and sufficiently complex to warrant collecting them in a common library of utilities. The payoff becomes different when the utilities are large and useful in several different programs. In fact, one may taylor the programs to use as many portions of the utilities as possible, and to accept that some portions will not be used. This is the approach we shall use in the remainder of this book. In this chapter we design and implement some general utilities, in the remaining chapters we use them.
A scanner: The core of the utilities is a scanner, a procedure which reads characters from the input file, assembles short groups of them into symbols such as numbers in decimal or other notation, quoted single characters, quoted strings of characters, identifiers, reserved words and sequences of special characters such as punctuation and other symbols. Reserved words and sequences of special symbols should be looked up in a table; identifiers should not be looked up because the required details are likely to vary in different applications. The scanner must also handle comments.
For many purposes it is useful to let the scanner handle directives.
These are special purpose instructions to the scanner
to affect its behaviour.
They are not passed on to the parser,
indeed as far as the parser is concerned they are invisible just like
One important directive is the
which tells the scanner to suspend reading from the input file
and to start reading from another.
When the end of that file is reached,
reading resumes from the previous file.
It should be possible to nest such inclusions.
Also useful is a
LISTING directive which can set an internal
variable so that any input is copied to a listing file
and any output goes not only to the output file but is also
copied to the listing file.
In special cases one might want the listing file
to contain debugging information,
so it is desirable to have levels of listing.
As in any programming language,
IF directive can be useful.
The kind chosen here is the simplest,
it applies only to lines in the input file.
IF part evaluates to true, the remainder of the input line
is processed, otherwise it is skipped.
A collection of variables
A .. Z can have integer values assigned
to them with the
For numbers in alternative notation,
the radix can be set to any value between 2 and 36 inclusive.
Errors: Any error messages should in the first place go to the output file, which may be the user terminal or a disk file. But, if a listing is being produced, error messages should also go to the listing file. It is best if error messages actually point to where in the input line the error occurred. So, for errors reported to the output file, the input line should be repeated and a pointer placed under the position of the error. In the listing file the line should not be repeated if it is the first error in that line. However, for second and later errors in the same line it is more informative if the line is repeated in the output file and the listing file. In the listing file all lines that are copied from the input should be given line numbers, and all error reports should give the offending line together with its line number.
Other utilities: Any program that uses the scanner will have to enter reserved words into a table; it is useful to include a procedure which enters one reserved word per call (and checks for overflow). A similar procedure is useful for user declared identifiers. Programs that make use of the utilities should not have to bother about the listing file. Any output should be handled by the utilities, writing to the output file and, if necessary, to the listing file.
To make the utilities flexible, some declarations have to occur before the file of utilities can be read in. These declarations are specific to the particular use to which the utilities are being put. Hence these declarations have to occur in the program which uses the utilities, before the utilities file is being included. The details should be apparent in the programs in the next three chapters.
Many of the names or identifiers are similar to those in PLZERO of Wirth (1976) and Wirth's PASCALS decribed in Berry(1982).
By far the largest procedure is the scanner, procedure
It reads characters from the current input file and assembles
them into symbols.
The reading is actually done by a local procedure
getch which is
getsym begins by skipping
any non-printing characters
and then uses a
CASE statement to dispatch on the current
character in the global variable
Some of the cases are familiar from previous programs.
Each of the cases assigns a value to the global
sym which is of the enumeration type symbol.
Some of the cases also assign an integer value
to the global variable num.
These two variables constitute the principal output
of the scanner.
ch is a single quote character,
then this signals a character constant.
Following the quote is normally an ordinary character,
and then the global variable
num is set to the ordinal value
of that character.
However, if the character following is the backslash
then the ordinal value returned in variable
is given by a numeric scan-time expression.
In most cases this will be a literal number,
'\65 is equivalent to
However, following the backslash can be an expression
computed at scan-time by a local function
value described below.
ch is a double quote character,
then what is being read is a string constant.
While the next character is not the terminating double quote
character, further characters are read and stored in a large array
of characters which is made to hold all strings that are read.
The start position of the current string is recorded in a further
of starting positions,
and the next value in this array gives the starting position of
the next string that will be read.
The index in this integer array is assigned to the global variable
Hence the length and content of the currently read string
can be determined by using the value of
as indices into the integer array.
The values recorded there can then be used
to retrieve the string from the large character array.
This part of the scanner is not being used in the three
programs in this book which use the utilities,
but it has been used elsewhere.
Parentheses and comments:
These are handled as in previous programs:
(* signals a comment and causes skipping up to the closing
GOTO the beginning of the scanner.
A left parenthesis character without a following asterisk
is just a parenthesis symbol.
Signed numbers and hyphens:
ch is a decimal digit character or a hyphen
then the symbol is either just the hyphen symbol
in case it is not followed by digits,
or it is a negative number in case it is followed by digits,
or it is a positive number in case there is no preceding
So, if there is no leading
then the number being read is not to be negated,
and a Boolean flag is set to that effect.
Otherwise the flag is set to eventually negate the number being read,
if indeed there is one.
If there are no digits immediately after the
then the symbol is just a hyphen,
GOTO the end of the scanner is effected.
Otherwise a string of decimal digits is read as normal,
and the number is possibly negated.
Numbers in alternative radix:
For some purposes it is useful to be able to read numbers
in other than decimal notation but in a different radix.
Such numerals are signalled by a leading
The default alternative radix is 2,
but it can be set by a directive
where N is a scan-time expression whose value
is between 2 and 36 inclusive.
9 are taken to be
so hexadecimal uses
Reading successive digits is very similar to reading
instead of multiplying values by 10
they have to be multiplied by the alternative radix.
A minor complication arises because in ASCII
the character sequence
is not immediately followed by
there is a gap which has to be taken care of.
If a digit exceeds the maximum allowed for the current radix,
then an error has to be signalled.
ch is a lowercase letter,
the symbol being read is an identifier.
The letter and any further lowercase letters, digits and underscores
are collected into a short variable of type ident
which are strings whose length is determined by a
defined in the program which includes the utilities.
No lookup is performed.
ch is a
%, then procedure
directive is called,
ch is anything else,
a catch-all case is entered.
ch is an uppercase letter, then it and any further letters, digits
and underscores are collected into a string.
ch is any other special character,
then it together with any further special characters that are likely
to occur as further characters in combinations such as
.. are collected into a string.
The string starting with an uppercase letter or a special character
is then used to search a table of reserved words.
If it is found there,
then the variable
sym is set to what is recorded in the table,
otherwise it is treated just like an identifier starting with
a lowercase letter.
The various directives have been implemented as a cascade
IF-THEN-ELSE-IF.. statements in Pascal.
It would have been possible and more efficient to use a binary
search through a table of directives.
However, directives do not occur often,
hence their efficiency is a minor issue.
The method used here is just a little simpler to implement.
This completes the cases of the body of the scanner. There are three routines local to the scanner: a procedure for obtaining the next character, a function for computing values of scan-time expressions, and a procedure for executing directives.
getsym does not read the input file directly
but obtains the next character from a procedure which
deposits a new character in a global variable
Because error messages to the output file should
point to the position in the current input line,
it is necessary to retain all of the current input line
in a global string buffer.
Two global integer variables are made to point into this buffer,
one records the total length, the other the current position.
getch starts by examining whether the buffer is empty.
This will be true if the current position is equal to the total length.
If it is, a new line will have to be read into the buffer.
But first an adjustment might have to be made to the current line number
in case the immediately preceding directive was an
or in case the preceding line was the last line in an included file.
In the first case the line number is set to zero.
In the second case it has to be set back to what it was
in the file containing the
which originally caused the inclusion.
Because inclusions can be nested,
a small global stack is used for names of included files,
the last line number in the file, and,
most importantly, the file variable itself.
Only after this adjustment can the line number be incremented,
and buffer length and position be set to zero.
The line buffer is then filled from the current input file,
but the details vary depending on whether this is the initial
input file or an included file.
If the stack of included files is empty,
the reading occurs from the initial input file.
If the end of file has been reached,
an error is signalled.
Otherwise characters are read into the line buffer
until the end of line is encountered.
On the other hand, if the stack of included files is not empty,
then reading occurs from the current included file.
No error is reported if the end of that file is reached.
But if after the line has been read the end of file is reached,
then the included file is closed and the adjustment flag is
set to pop the stack of included files when the next line is read.
This concludes the steps that have to be taken
getch finds the line buffer empty.
At any rate, the line buffer will not be empty now.
The current position is incremented,
ch is assigned the character value from
the buffer at the current position.
This function computes integer values of scan-time expressions written
in prefix notation.
It is called by the scanner when handling the backslash
in character constants and string constants,
and by the procedure which handles directives.
The function calls
ch is a printing character.
If it is a character that can start a number or a single character,
then it calls
getsym recursively and returns whatever
deposits in the global variable num.
If the character is an uppercase letter,
then it returns the value retrieved from an array of integer values
that would have been assigned there by a previous
Parentheses can be used for readability,
but are not necessary for prefix notation.
The special characters
/ have their
obvious meaning as binary arithmetic operators,
they result in two recursive calls to the function
and then a computation of the value.
< are interpreted as
relational operators returning 1 or 0 for true and false.
? causes a number to be read from the input file.
Any other character produces an error.
This procedure is called from one case in the scanner
when it sees the character
since the procedure is only called there
its body might have been incorporated there.
The procedure reads letters into a string and then examines the string.
For the IF directive the function
value is called,
if that returns a value less than 1,
the remainder of the current input line is to be skipped.
This is done by setting the position variable
of the line buffer to the length of the line buffer ---
this will cause procedure
getch to read from a new line.
SET directive the expected syntax is
%SET V = value,
V is an uppercase letter.
Errors occur when
V is not an uppercase letter
= is missing.
value is called to assign an integer
to the array of scan-time variables at position
For the PUT directive the remainder of the input line
is copied from the line buffer to the output file ---
this is useful only if the input comes from a disk file.
The other three directives,
simply set global variables to an integer returned by the function
This completes the scanning procedure
The procedure for reporting errors takes two value parameters: a single character and a string. The single character serves to indicate the kind of error, the string is the actual error message. The procedure has to write the position in the input line where the error occurred, and then, on a separate line, an error mark that is defined as a (short) string constant in the program which uses the utilities, then the character indicating the kind of error, and then the error message. All this has to be written to the output file, and, if a listing is being written, to the listing file.
Several complications arise.
Firstly, the line which contains the error can be in an included
In that case it is not informative if the line number given
is simply the ordinal number of the line that has been processed.
Instead it is better if the error report states the name of the file
and the line number in terms of that file.
Secondly, in the listing file it is not necessary to repeat
an input line if the current error is the first error in that line.
So the scanning procedure
and the error reporting procedure have to cooperate
in a number of ways.
It is necessary
to keep track of line numbers and names of included files.
This is done by the small explicit global stack
already mentioned in the description of procedure
The body of the error reporting procedure increments
the global error count in all cases except when
the character parameter is an
a merely informational message.
Programs using the utilities can access the error count
for reporting, and they can set it back to zero.
Then the procedure handles any writing to the output file,
and any writing to the listing file.
For the output file,
if the error occurs in an include file,
this is indicated by a remark and the name of the include file.
Then the procedure calls a local procedure
to do the actual error reporting to the output file.
If a listing is being produced,
that same local procedure is called again,
this time to report to the listing file.
The local procedure takes a file variable as formal parameter.
It does all its writing to that file.
A second Boolean value parameter indicates
whether the current input line is to be repeated.
If it is, then the current line number,
the separator between line numbers and text,
and then the line buffer are written as one line to the
In any case,
for the next line it writes firstly a mark under
the previous line number, secondly sufficient spaces
to reach the error position in the input line,
and thirdly the pointer
On the following line it writes
the error mark,
the character indicating the kind of error,
and the actual error message.
All the other utilities are relatively minor.
There is one procedure to initialise
all the global variables used by the utilities;
most of them are integers that have to be set to zero.
A program which uses the utilities would
normally begin by calling this initialisation procedure.
Such a program might then re-assign some of these variables,
for example the default alternative radix or the set
of special characters that can occur in second or later
positions in reserved words such as
There are two procedures for entering reserved words
and standard identifiers into tables.
The sizes of these tables are given by constants
in the program that uses the utilities,
and the two procedures will check for overflow
and produce appropriate error messages.
A program which uses the utilities
would call the entering procedures
once for each reserved word or identifier
that is to be entered.
The scanner will know about the last reserved word
that has been entered,
and it will perform the lookup correctly
even if the table is not filled completely.
No lookup is performed for standard identifiers
and no such procedure is provided;
this is because the method that is required
is likely to vary widely between applications.
Parsing: To facilitate parsing there are two procedures. One checks whether the last symbol returned by the scanner is identical with one of its parameters, otherwise an error is reported using a string parameter as the message. The other tests whether the last symbol is a member of one of its parameters, a set of symbols. If it is not, then an error is reported and symbols are skipped until the symbol is in the first set or in another set that is a further parameter.
There are three principal output procedures.
One is for writing signed integers,
internally it uses a recursive procedure to convert
an integer into a sequence of digit characters.
The other two write strings of the size
of reserved words and standard identifiers;
since these may have different lengths,
separate procedures are used.
All three first check whether the item to be written
really fits onto the line,
and if it does not fit, then a subsidiary procedure
for ending a line is called first.
This procedure will do a writeln to the output file,
and, if a listing is being written, to the listing file.
It also sets a variable to keep track of the length of the output line.
The three principal procedures output characters by calling
another subsidiary procedure for writing a single character.
This writes the character to the output file.
If a listing is being produced, it also writes the character
to the listing file;
but if a new line had just been started it first writes
to the listing file as many spaces as there are in a
line number and the separator between the line number
and the following text.
It also increments the counter which keeps track
of the length of the output line ---
except that if the character to be written
is the newline character,
it sets the counter to zero.
One last output utility, used for finalising a run,
writes to the output file and potentially to the listing file
the CPU time in milliseconds taken by this run.
The following is the standard Pascal source program for the utilities.
The utilities implemented here are being used
by three quite disparate programs
in the next three chapters.
None of the programs use all the facilities,
but that is unavoidable for something as general as this.
One exercise that might be worthwhile is to make
IF directive more powerful by letting it operate
not just on the remainder of the line
but on all following lines up to an
ELSE directive would be useful, too.
The language C has a preprocessor called cpp
which does this sort of thing.
Again, being able to define constants
would be useful; the preprocessor cpp for the language C has this.
Some of the reading given for Chapter 4
on macro expansion is relevant.
A Major Project: This project is not immediately tied to the utilities in this chapter, but it can use them profitably, and it has much less to do with the programs in the later chapters.
Design a small procedural language.
It should include declarations of global variables,
of procedures and functions which can have value parameters
and local variables.
Unlike Pascal, procedure and functions cannot be nested.
Variables and parameters are to be typed,
and so are the values returned by functions.
But there will be only three inbuilt types:
boolean, char and integer,
and no type constructors.
The statements should include assignment, call,
The expressions should include a reasonable collection
of operations on the inbuilt types.
You should begin by writing a context free grammar first. Then write a context free parser which reports errors. Add error recovery in accordance with Wirth (1976 pp 320 - 322) or Terry (1986 pp 208 - 211). You should now be able to have a test program that has many context free errors.
Next, add a symbol table for entering global variables,
procedures and functions and their parameters and local variables.
Note that at the end of a procedure or function declaration
the names of the parameters and locals have to become invisible,
but the types of the parameters still have to accessible.
Take some care to avoid spurious error messages.
It helps to have an additional type
notype in expressions.
Write a test program that has many context sensitive errors.
You can now add code generation to this compiler. There are many possibilities: to translate into some existing language, to translate into some internal language that will be interpreted either recursively or non-recursively, or to translate into some intermediate language first which can then be translated independently into several other languages.