Compilers and How They Work: An Overview
Madison IBM-PC User's Group
What are compilers and how do they work? Many computer
users ask themselves this question after the programming
bug has bitten them. To most people, a compiler is a
"black box program" which takes source code written in some
high-level language, such as FORTRAN, BASIC, Pascal, or C,
and translates (compiles) it into a language the computer
can understand and execute. Compilers take source code and
transform it into virtual machine language. With the IBM
PC, this virtual language is 8088 machine code.
Compilers vs. Interpreters
Computers cannot understand English words and grammar.
Even the highly structured words and sentences of
programming languages must be translated before a computer
can understand them. The compiler or interpreter must look
up each "word" of your programming language in a kind of
dictionary (or lexicon) and, in a series of steps,
translate it into machine code. Each word initiates a
separate logical task.
An interpreter translates one line of source code at a time
into machine code, and then executes it. Debugging and
testing is relatively fast and easy in interpreted
languages, since the entire program doesn't have to be
reprocessed each time a change is made. The BASIC(A).COM
program is an interpreter. Interpreted programs run much
slower than compiled programs, because they must be
translated each time they are run. Programmers often test
and debug their programs using an interpreter and then
compile them for production use.
How Compilers Work
Most compilers convert programs in three steps. Each step
is called a pass. A particular compiler may have one
program per pass, or may combine two or three steps in a
single program. For a very complex language, a step may be
so difficult to perform that it is broken up into many
smaller steps. Regardless of how many passes or programs
are required, the compiler performs only three main
functions: first, lexical analysis; second, syntax
analysis; and third, code generation. During each pass of
the compiler, the source code moves closer to becoming
virtual machine language (or whatever language the compiler
is designed to generate).
In the first pass of the compiler, the source code is
passed through a lexical analyzer, which converts the
source code to a set of tokens. A token is generally a
number representing some keyword in the language. A
compiler has a unique number for each keyword (i.e. IF,
WHILE, END), and each arithmetic or logical operator (i.e.
+, -, *, AND, OR, etc.). Numbers are represented by a
token which indicates that what follows it should be
interpreted as a number. The tokens put the programming
language into a form that can be checked for proper
structure and order.
The other important task of the lexical analyzer is to
build a symbol table. This is a table of all the
identifiers (variable names, procedures, and constants)
used in the program. When an identifier is first
recognized by the analyzer, it is inserted into the symbol
table, along with information about its type, where it is
to be stored, and so forth. This information is used in
subsequent passes of the compiler.
After the lexical analyzer translates a program into tokens
of keywords, variables, constants, symbols and logical
operators, the compiler makes its next pass. To describe
what happens during this function, I will briefly explain
Grammars. Like any language, programming languages have a
set of rules governing the structure of the program. Each
different computer language has its own grammar which makes
it unique. Some grammars are complex (PL/I) and others are
relatively easy (Pascal). The programmer must observe all
the structural rules of a language to make logical sense to
the computer. The next step of the compiling process,
parsing, checks to be sure all the rules were followed.
Parsing. The parsing routines of a compiler check to see
that the program is written correctly (according to the
language rules). The parser reads in the tokens generated
by the lexical analyzer and compares them to the set
grammar of the programming language. If the program
follows the rules of the language, then it is syntactically
correct. When the parser encounters a mistake, it issues a
warning or error message and tries to continue. Some
parsers try to correct a faulty program, others do not.
When the parser reaches the end of the token stream, it
will tell the compiler that either the program is
grammatically correct and compiling can continue or the
program contains too many errors and compiling must be
aborted. If the program is grammatically correct, the
parser will call for semantic routines.
Semantic Routines. The semantic routines of a compiler
perform two tasks: checking to make sure that each series
of tokens will be understood by the computer when it is
fully translated to machine code, and converting the series
of tokens one step closer to machine code. The first task
takes a series of tokens, called a production, and checks
it to see if it makes sense. For example, a production may
be correct as far as the parser is concerned, but the
semantic routines check whether the variables have been
declared, and are of the right type, etc. If the
production makes sense, the semantic routine reduces the
production for the next phase of compilation, code
generation. Most of the code for the compiler lies here in
the semantic routines and thus takes up a majority of the
Summary. Two major routines comprise syntax analysis: the
parsing routine and the semantic routine. The parser
checks for the correct order of the tokens and then calls
the semantic routines to check whether the series of tokens
(a production) will make sense to the computer. The
semantic routine then reduces the production another step
toward complete translation to machine code.
The code generation process determines how fast the code
will run and how large it will be. The first part of code
generation involves optimization, and the second involves
actual machine code generation.
Optimization. In this step, the compiler tries to make the
intermediate code generated by the semantic routines more
efficient. This process can be very slow and may not be
able to improve the code much. Because of this, many
compilers don't include optimizers, and, if they do, they
look only for areas that are easy to optimize.
Code Generation. This process takes the intermediate code
produced by the optimizer (or semantic routines if the
compiler has no optimizer) and generates virtual machine
code, which in our case is 8088 machine code. It is this
part of the compilation phase that is machine dependent.
Each type of computer has an operating system that
processes virtual machine code differently; therefore, the
code generator must be different for each type of
computer. The choice of instructions for the fastest
execution and smallest code size are made at this point,
according to the machine's operating system. Each code
generator is designed specifically for the machine and
operating system the final code will run on.
If the program is free from syntactical errors, code
generation should take place without any problem. When the
code generator is finished, the code produced will be in
8088 machine code, but the format of the code is not yet
executable. It is in a format (an .OBJ file in our case)
that is ready to go to a linker, which will create an
executable *.EXE or *.COM file from the machine code the
compiler has generated.