A compiler system for windows


Педагогика и дидактика

I was working since several years in a compiler environment project. I had written an editor ‘Wedit’ and sold some copies of it to fellow programmers in Paris. But my editor project was stuck: it missed the compiler. Once people bought the compiler with its integrated editor, they wouldn’t think about using any other one. The editor was actually considered as an afterthought. The effort of learning a new editor was considered too big. Besides, the new programming environments were more and more closed.



7.36 MB

6 чел.


A compiler system for windows

Technical Reference

Jacob Navia

based upon

the lcc compiler written by

C.W. Fraser


Dave Hanson



[0.3] Writing software

[0.4] Compiling software.

[0.5] The compilation process

[0.6] A closer look at the compilation process.

[0.6.1] Overview

[0.6.2] The pre processor.

[0.6.3] The compiler

[0.6.4] Lcc’s intermediate language

[0.6.5] Following an expression:

[0.6.6] A bug

[0.6.7] An overview of the abstract instructions of the virtual machine

[0.6.8] Allocating register variables

[0.6.9] Allocating registers for expressions

[0.6.10] The code generator

[0.7] The assembler.

Modifications done to lcc

[0.8.1] Porting the code

[0.8.2] Integrating the preprocessor into lcc

[0.8.3] Supporting Microsoft’s extensions

[] Calling conventions

[0.8.4] The _stdcall calling convention

[0.8.5] Problems encountered when implementing _stdcall

[0.8.6] The “declspec” construct.

[0.8.7] The different #pragmas

[0.8.8] Structured exception handling

[0.8.9] Writing the startup code

[0.8.10] Writing the stackprobe code

[0.8.11] The windows.h and the other header files

[] The import directive

[0.8.12] The #ident discussion

[0.8.13] Other fine points in the preprocessor

[0.8.14] Generating the debugging information

[0.9] Generating the browsing information

[0.9.1] Design considerations

[0.9.2] Browse file format

[0.9.3] Linking the browsing information.

[0.9.4] Browsegen and the preprocessor

[0.9.5] The long long extension

[0.9.6] Implementing run time debugging support: the shadow stack.

[0.9.7] Error handling within lcc-win32

[0.9.8] The information flow in lcc-win32

[0.9.9] The source files of the compiler proper

[0.10] Building the compiler

[0.11] Lcc’s assembler

[0.11.1] Overview

[0.11.2] The format of the x86 instructions

[0.11.3] Fragments

[0.11.4] The format of object files

[0.11.5] Relocations

[0.11.6] The symbol table

[0.11.7] Pseudo instructions

[0.11.8] Further work

[0.11.9] The debug section

[0.11.10] The format of the debug information

[] Symbols

[] Simple Types

[] Types referenced from the ‘types’ record

[] 0x201 Argument list

[] 0x204 Field list

[0.11.11] Examples

[0.12] Implementing the “inline” operator (C99)

The peephole optimizer

[0.13.1] 1. Motivation

[0.13.2] 2. Implementation

[0.13.3] 3. Patterns

[] 3.1 Cross -loads

[] 3.2 Eliminate unnecessary moves before a push

[] 3.3 Avoiding unnecessary intermediate registers

[0.13.4] 4 Eliminate constants if possible.

[0.13.5] 3.5. Use the ‘new’ 486 instructions.

[0.13.6] 3.6. Optimize sequences.

[] 3.6.1 Sequence of calls with the corresponding stack adjustments.

[] 3.6.2 Initialization sequences

[0.13.7] 3.7. Improving the block move instructions.

[0.13.8] 3.8. Inlining library functions.

[0.13.9] 3.9 Avoiding clobbering when possible.

[0.13.10] 3.10 Avoid unnecessary moves

[0.13.11] 4. Construction of the optimizer.

[0.13.12] 5 Speed considerations

[0.13.13] 6. Results:

[0.13.14] 7. Description of the code

[0.13.15] 8 Further work

[0.14] The high level optimizer

[0.14.1] Improving the machine description

[] Rules to directly operate with an address using a constant.

[] Rules concerning the usage of the constant one.

[] Rules for using the special machine modes.

[] Rules for fully using the floating-point unit.

[] Rules for fully using LEA.

[] Problems caused by the larger machine description.

[0.14.2] Improving register usage

[0.14.3] Improving floating point performance

[0.14.4] Front end modifications

[0.14.5] Finding variables with disjoint domains

[0.14.6] Modifications to lburg

[0.14.7] Constant folding

[0.14.8] Dividing by a constant.

[0.14.9] Optimizing ‘const’ variables

[0.14.10] Optimizing structure arguments

[0.14.11] Other changes to the front end.

[0.14.12] Quality control

[0.14.13] Further work

[0.15] Compiler optimizations. A case study with gcc-2.95.2

[0.16] The intrinsics interface

[0.17] Implementing the operators redefinition in lcc

C and C++

[0.18.1] Function overloading.

[0.18.2] Default function arguments

[0.18.3] Namespaces

[0.18.4] Avoiding complexity

The Pentium SSE/SSE2 back-end

[0.20] The ‘pedump’ utility

The linker

[0.21.1] Motivation

[0.21.2] The format of the PE Executable.

[] The ‘Dos stub’

[] The File header

[] The ‘Optional’ Header

[] Standard sections

[0.21.3] The linking process (overview)

[0.21.4] The linking process: detail

[] Parsing the command line

[] Convert the resources to object files

[0.21.5] Parsing and processing the .def files

[0.21.6] Open all files given in the command line

[0.21.7] Symbols

[0.21.8] The discussion about the common storage model

[0.21.9] Relocate all symbols

[0.21.10] Build the symbol table

[] Relocating/building the line numbers.

[0.21.11] Performing the relocations

[] Counting a symbol’s usage

[0.21.12] Linking the debug information

[0.21.13] Mapping source files to addresses

[] The file header

[] The information per source file

[0.21.14] Building the executable headers

[0.21.15] The source distribution of lcclnk.

Dynamic linking

[0.22.1] Motivation

[0.22.2] Implementation

[0.22.3] Binding the object file

The import librarian

[0.23.1] Import libraries

[0.23.2] Summary: Building an import library from a dll.

The resource compiler

[0.24.1] The .rc language specifications.

[0.24.2] Parsing

[0.24.3] The parser

[] Dialog statement

[] MENU Statement

[] The different image resources (icons, bitmaps, cursors)

[] The string table resource

[] The RCDATA resource

[] The Message Table resource

[] The ACCELERATORS resource

[0.24.4] Limitations of the resource compiler

[0.24.5] The source distribution of the resource compiler

The librarian

[0.25.1] The signature

[0.25.2] The « first linker member »

[0.25.3] The second linker member

[0.25.4] Usage of lcclib

[0.25.5] Lcclib source files

The resource editor

[0.26.1] The format of the .RES files

[0.26.2] Building a dialog box dynamically

[0.26.3] Testing a dialog box interactively

[0.26.4] Writing the .dlg file

[0.26.5] Writing the .res file

[0.26.6] Writing the .rc file.

Wedit: the Integrated development environment

[0.27.1] History

[0.27.2] How to start an editor.

[0.27.3] Coping with legacy code

[0.27.4] Why global variables?

[0.28] How does Wedit work

[0.28.1] Editor

[0.28.2] The ‘Autosave’ feature

[0.28.3] Real time coloring of text

[0.28.4] Representing text

[] Software dust

[] The virtual screen

[0.28.5] Handling the keyboard

[0.28.6] Redrawing the screen.

[0.28.7] Handling the selection and the clipboard

[0.28.8] Maintaining the status display

[0.28.9] Handling dialogs

[0.28.10] Handling the bookmarks: what is important in a user interface?

[0.28.11] The output window

[0.28.12] Handling the menu

[0.28.13] The right button menu

[0.28.14] Special purpose parsers

[0.28.15] The software metrics module.

[] How Wedit collects the data?

[0.28.16] The 'TREE' utility

[0.28.17] Identify an executable or an object file

[0.28.18] Building the history list

[0.28.19] The object file cross-referencing utility

[0.28.20] User interface considerations: an example.

[0.28.21] The built-in utilities

[] Grep (find in files)

[] Diff

[] Displaying function slices

[0.28.22] Showing the #ifdefs

[0.28.23] Showing executable statistics

[0.28.24] Extracting the character strings from a source file.

[0.28.25] Searching for a function

[] Formatting C programs

[0.29] The project maintenance

[0.29.1] The “workspace” display

[0.29.2] The configuration wizard of Wedit.

[0.29.3] Adding files to the project

[0.30] Building the executable

[0.30.1] Generating a Makefile

[0.30.2] Speeding up makefile generation

[0.30.3] Starting a build and displaying the results

[0.30.4] Finding out the missing libraries

[0.31] The application wizard

[0.31.1] Wizard mechanics:

[0.31.2] The source distribution for the wizard

The debugger

[0.32.1] Starting a program under debugger control.

[0.32.2] Preparing for debugging

[] Reading the debug information

[] Reading the imports table

[] Finding out where the real entry point is.

[0.32.3] Debugger mechanics

[0.32.4] Finding out where a call is going to go.

[0.32.5] To trace or not to trace

[0.32.6] The different debugger displays

[] The ‘assembler view’ (code) display

[] The 'events' display

[] The ‘Memory watch’ display

[] The stack display

[] The Locals display

[] Displaying structures

[0.32.7] The different kinds of breakpoints

[0.32.8] The operation in ‘Trace’ or ‘Step’ mode

[] Interactions

[0.32.9] Implementing the ‘step out’ feature

[0.32.10] Implementing a fast « breakpoint » in all lines

[0.32.11] Implementing the error window

[0.32.12] Implementing the data breakpoints.

[0.32.13] What to do when a data breakpoint fires.

[0.32.14] Implementing the ‘ToolTips’ window

[0.32.15] Recovering after a program crash

[0.32.16] Debugging a program that just crashed.

[0.32.17] Evaluating expressions under the debugger

[0.32.18] An example

[0.32.19] Implementing conditional breakpoints

[0.32.20] Getting to the Thread Information Block

[0.32.21] Stopping a program and cleaning up

[0.32.22] Starting a debugging session with a given executable

[0.33] Other debug information formats : the ‘stabs ‘ format. (gcc)

[0.33.1] Organization of the stabs information.

[0.33.2] Accessing the data base

[0.33.3] Bibliography about debugging

[0.34] Customizing the editor

[0.34.1] Adding a new menu item

[0.34.2] Adding a new key handler

[0.35] The source distribution of Wedit.


[0.36.1] Analyzing bugs

[0.36.2] Correcting bugs

[0.36.3] How I introduced the bug 

[0.36.4] Correcting the bug

[0.36.5] Classifying bugs

[0.37] Reporting bugs in lcc-win32

[0.38] Memory management

Practical applications of lcc-win32: C as an universal assembler

Appendix 1: Summary of the assembler syntax.

[0.40.1] Pseudo-instructions

[0.40.2] Syntax

[0.40.3] Opcode naming

[0.40.4] Memory references

Appendix 2 : The mmx intrinsics

[0.41.1] Introduction

[0.41.2] Instruction Syntax

[0.41.3] Description of the interface

[] Pack with signed saturation

[] Pack with unsigned saturation

[] Packed Add

[] Packed Add with saturation

[] Packed And.

[] Packed And. Not

[] Packed compare for equal

[] Packed compare for greater than

[] Packed multiply and add

[] Packed multiply high

[] Packed multiply low

[] Packed or

[] Packed shift left logical

[] Packed shift right arithmetic

[] Packed Substract

[] Unpack high packed data

[] Unpack low packed data

[] Pxor

[0.42] The problem with the mmx instructions

Appendix 3: Overview of the machine description and the rules

[0.44] Epilogue




This adventure started in an anodyne way: I went to ‘Le monde en tique’, the data-processing library of Paris. Wandering around, a book caught my attention: ‘A retargetable C compiler: design and implementation by Christopher Fraser and David Hanson’. Mmm interesting.

I took the book, started reading. A whole C compiler neatly explained. There was even a DOS implementation. This is interesting I told myself. I bought the book, went home, read some chapters.

I was working since several years in a compiler environment project. I had written an editor ‘Wedit’ and sold some copies of it to fellow programmers in Paris. But my editor project was stuck: it missed the compiler. Once people bought the compiler with its integrated editor, they wouldn’t think about using any other one. The editor was actually considered as an afterthought. The effort of learning a new editor was considered too big. Besides, the new programming environments were more and more closed. They were more tightly integrated, so Wedit couldn’t easily start a compilation, or call the debugger.

Well, it would be nice to build a whole environment...

But it was a monstrous amount of work! Lcc came with a compiler generating just assembler mnemonics: there was no assembler, no linker, no library, no resource compiler, no resource editor, no debugger, no header files.

But then... this would be an opportunity to write an assembler, a linker and a debugger!

I talked to some friends. Everyone told me I was completely crazy. But this, I knew already. So I started doing it.

This was like beginning 1995 or so. It is 2001 now. The assembler is written, together with the linker, and the import librarian, the resource compiler, the resource editor, and a first version of the debugger is out.

This document, as you can see in the ‘stubs’ (sections containing just a title) is not finished, as this project is not finished. I thought it could be useful to understand what is happening, so I include it in the project documentation.

Books are dangerous. They can change people’s lives. The book of Mr. Fraser and Mr. Hanson changed mine, and I am grateful to them for having published their work. In the process of developing the optimizer, or adjusting it to the windows environment, I have several times destroyed essential data structures of the compiler. But it has never crashed.


It is a solidly built piece of software, full of useful assertions that will pinpoint exactly where you went wrong. It has a clean design, with a good separation of the front end/back end, and several functions that are really beautiful to read: take a look at ‘listnodes’ in dag.c for instance.



Software is the art of illusion. You ‘push’ a button, and you see the button changes its form, as it existed. What really happens, is only that a click in the mouse button (THAT button exists of course), will select a different bitmap for the screen area where the button is drawn. Nothing has moved, there is no button anywhere in the screen, but it looks like.

The whole windows system is based upon illusions about the existence of ‘windows’, and about a ‘desktop’. Graphics are used extensively to reinforce this paradigm, and to present to the user a number of objects (called ‘controls’ in techspeak) like buttons, trees, list boxes etc that exhibit a standard behavior.

A compiler gives you the impression that the machine speaks a high level language. Those languages allow you to give to the machine a more abstract kind of commands, that the commands the machine really understands. This process of translating a standardized language into machine instructions is called compiling.

We have to be aware that all ‘high-level’ languages are just machine ‘languages’. They look similar somewhat to the written languages we use; especially English, but they are in fact completely different.

The input language of lcc-win32 is the C language, as defined by the ANSI committee for standards. This language is fairly close to the machine, and doesn’t have the expressive power of more evolved languages like lisp or smalltalk, for instance. This can be seen as a weakness, or as strength, depending on your point of view, and the application you are writing.

If you are interested in ignoring most software abstractions and writing programs that are close to the native windows API; C will do this very well. On the other side, if you do not want to be bothered with memory management, obscure machine details, and you would like to use the ‘object oriented programming’ paradigm, C will fail short of your expectations and become an obstacle to expressing your ideas into programs. Java or Eiffel would be better choices.

C is a fairly old programming language, and very stable. The code of wedit, the IDE of lcc-win32 contains code that was taken from the microEmacs editor, a software package written around the beginning of the eighties. That code still works without any problems, in 2001, under Windows 2000. If you write your code in C your code can last for years and years and it will still work, in the unknown environments of the future.

The illusions created by software break down sometimes. These events are called ‘bugs’, in techspeak, and show us suddenly what is really behind all software. We have all seen this, but tend to forget them as soon as possible, becoming again absorbed by the program’s display. For instance, bugs in the Windows operating system destroy the desktop and show us a blue screen full of awe inspiring numbers, that give the user a glimpse into the innards of the machine we are working with.

Yes, numbers. The only thing the circuit “understands”.

Writing software


Humans can design software, you too. There is nothing special about it: just putting one instruction after the other, to make the computer do what you want it to do. Suppose you want to tell your user some information about the data he/she is viewing. Like walking, nothing special, just one step after the other. Yes, I know, learning to walk wasn’t easy, but was doable. So you calculate the information the user needs, or you yourself want to see. You put it in a file, i.e. a certain amount of space in the magnetic surface of a fast spinning hard disk. Then you display it.

Here is one piece of software, to give you a feeling about it. It opens a data set, does some string processing in it, and puts the result of another part of the system in the screen. The input data is assumed to have a special format. This routine is not especially important, neither it is very well written. It is not an example of ‘the way’ you have to program; it just shows how I did it. It is here to give you a feeling what is behind each symbol in a program, and how it happens in detail. This is a very small routine, but does represent the bulk of software: reading, processing and handling that data. You know all this if you program in C already, but even then, it is instructive to see how much effort and words need to be written to explain a small routine to a certain amount of detail. The purpose here is to show this detail, and some of the things you should have in mind when programming.

static void HandleShowIncludesCommand(void)

This means that here starts a subroutine that will be called ‘HandleShowIncludesCommand’, that doesn’t return a result (void), and is local to the current compilation module (static), i.e. that the significance of this new name will be limited in scope.


The brace means that this is the start of a new scope block. I declare here always the most of data the subroutine will use. These variables are called ‘local’ variables, i.e. the iummediate context of this routine’s execution. Normally this data resides in the main processor’s primary cache, running at full speed.

FILE *f;

One of them will be the symbol that holds the connection to the data spinning in the disk. I do not like much typing so I called it ‘f’, just so.

HWND hwnd;

This is the symbol that holds the id of the the screen window context that will display the data. Those rectangles are the display area of the application.

char *p,*tmpfile = tmpnam(NULL),*prjOptions = NULL,*q;

And it goes on, all must be taken care of: The filename symbol, holds the name of a temporary file that will hold the output of the calculation, that is actually the name of the data the user is editing, the main application being a program editor that you may use. Another symbol is prjOptions, that will hold the state of the current project the user is working on, if any, maybe there is no project defined and the editor is just displaying some random file.

char *filename = CurTxtBuf->BufferName,




The includePath symbol is a data list that contains alternative places, where the system looks for information, besides the current directory. Note the quantity of detail that you have to keep track of. This is what makes the complexity of walking. One step after the other yes, but… where are we going?

Ah we do not know. That’s no easy file handling operation.

int tablevel = 0,i;

I decided to indent the information according to the tabulations the user defines, to make it more readable. The output has a tree-like structure, easily made evident by just indentation. The ‘i’ variable is a counter, I will use later on. I name them always i (or j less often) maybe because a Fortran tradition.

PROJECTINFO *prj = GetCurrentProject();

This is a user-defined type (as were actually HWND and FILE, but those are system wide). This one is defined in the include header of the application, and contains a project description. Note that I use a pointer to it, i.e. the number of the RAM location where the project description block starts. RAM locations are numbered consecutively, and those sequential numbers are pointers, i.e. they tell us where in RAM is located the information they represent.

Then I leave always a blank line. These declarations prepare the context for the body of this subroutine: the actual instructions to process the data.
The total size of the above context is small, 13 words of 32 bits each, what fits perfectly in today’s caches.

if ((CurTxtBuf->BufferFlags & BFIS_C_FILE)==0) {

 sprintf(buf,"%s: not a C file",filename);





Is the input data correct? Any subroutine does some error checking, to see if there isn’t incoherence somehow. For instance if this is not data of the kind this subroutine expects, an error message is shown in the screen, and the routine exists. This is clearer to the user (input data incorrect) than silently trying to go on with at best absurde consequences or nonsense results. I test for this condition using a C technique of writing flags (i.e. bits, that are tested with an AND operation if they are on or off).  One of this flags tells if the current active data set is the kind this routine expects, and if this is not the case, the predicate triggers the execution of the sequence.

First, this sequence builds the error message in the buf variable, by using a special notation. We call the formatting routine with a format spec, and the name of the wrong dataset. Then, we put everything in lower case (strlwr call), because I think looks better. I do not support error messages in YOU DID SOMETHING WRONG style…

Then, I show it in the screen, and return from this. Not very complicated isn’t it? Just common sense.

if (prj) {

 prjOptions = IncludesAndDefines(prj);


We test the global context where this subroutine is working on. If any, we store it in prjOptions. From now on, we shouldn’t forget that prjOptions can contain two values: either void (with Null in it), or pointing to the current global context options. Both values must produce correct results.

wsprintf(buf,"%s\\bin\\browsegen.exe %s -M1 -Fo%s %s",


prjOptions ? prjOptions : "" ,


To process the data, we have to call the calculator part, that will figure out from the current data set, a whole set of relationships. We prepare a string with the command that we are going to pass to the OS for starting the calculator. I specified a path, and then an executable name, that is called in a certain configuration (-M1 –Fo) to perform a certain task. There could be user-configured options to pass to the calculator, so we use them, if present. Note that we test for this condition, with the C idiom:

condition  ?  TRUE part : FALSE part

This returns a result that is passed to the formatting routine. Note that we keep both paths open: NULL and something.

The name of the output file, (the temprary file name stored in tmpfile), is passed to the calculator so that it writes in there.


OK, we pass the data to the OS. We tell the system not to show any window for the calculator, since anyway, we are going to display the data later. The ExecuteCommand routine makes a small set of system calls, to start a new process, and wait till it finishes.


The program has a rectangle for displaying messages at the bottom of its display area. If it was closed, we open it, since we are going to display the data. This could be done later actually. This comes about because this part of the subroutine is more ancient, and I haven’t inserted the statements in the most perfect logical order. But anyway, this is going to happen so quickly anyway (the calculations are done in less than a second now) that this instruction can be left here without any big problems.

hwnd = (HWND)SendMessage( ProgramParams->hwndOutput,


0, 0);

This is the ID of the display manager at the display rectangle, hwnd. It will be used to send the data there.We ask for this ID to the window procedure of that area, with the usual object oriented idiom of sending a message to the procedure, invoking a method..

hwnd = GetDlgItem(hwnd,IDLIST);

We use a variable, giving it a new meaning. This is not something that is always bad. Many identifiers confuse the reader, and actually this is an hwnd, whatever the system understands by that. That symbol contains now the list box window’s id. We need it to clean it. We do this by sending to the system a message in the form of eventually a system call to get the ID of the list box window nin the display area’s window list. I assign the result to the same identifier that was stored in the window manager’s, since anyway its no longuer needed.


We tell that list box to empty, because we are going to write into a new sheet.

f = fopen(tmpfile,"r");

We establish a connection between the fast spinning hard disk surface and the subroutine in RAM. How to do that, is something the OS knows about, if not, we wouldn’t be running there!

includePath = allocatePath();

The alternative searching path is allocated and initialized to zero.

if (prj)

 sourcesPath = DuplicateString(prj->SourcesDir);


 sourcesPath = DuplicateString(CurrentWorkingDir);

The data location path is either the one in the global context, or just the current directory. We save this information to avoid unnecessary clutter, when the computer displays a lot of redundant information, that doesn’t interest anybody. This path information is redundant, and can be eliminated.


Keep lowercase.



The application runs in a established source directory structure. The alternative datapaths are all after the default one.

while (fgets(buf+tablevel*tabsize,256,f)) {

We read a line of data from disk into RAM. The system routine fgets will fill the RAM space called buf, starting at location (tablevel*tabsize). Normally tabsize is four, so each tree level will be indented by 4 spaces. Note that the characters 0 to tabsize*tablevel do not contain initialized data.


Keep lowercase

 p = strchr(buf,'\n');

Search the new line character

 if (p)

If there was a new line character

  *p = 0;

Put a zero in it, terminating the chain. We do not need that character.

 for (i=0; i<tablevel*tabsize;i++)

  buf[i] = ' ';

As I told you above, the first positions of the buffer aren’t initialized. I put spaces in them. I tell the circuit to count the characters in the RAM location called ‘i’,  testing at each character if it is below the integer indicated by the result of the multiplication of tablevel with tabsize.

 if ((p=strstr(buf,"#include")) != NULL) {

The data set contains a structure delimited by special markers we find in the data. The calculator puts that synchronization signals in the data for us, how nice.

  pStart = buf+tabsize*tablevel;

The start of useful data is after the tabulations, of course.

 /* ignore directory information if the file is in the sources directory */

A commentary. Rare, rare. Imagine how long it would take me to write software if I would explain it to the reader as I am doing with you here. I would never had the time to explain everything.

  if (!strncmp(pStart,


strlen( sourcesPath )) ||

    !strncmp( pStart,


strlen(includePath))) {

We see if any of the initial part of the data is redundant. We compare a certain amount of characters with each other, and deduce from that, if equal, that they can be erased. If that predicate triggers, the following sequence starts, that will erase those characters from the output:

   q = strchr(pStart,' ');

I search the first blank in the input data.

   if (q) {

if there was any

    *q = 0;

we break the character chain there.


Close scope. I tend to use scopes even if this is not always necessary.

   p = strrchr(buf,'\\');

The last backslash represents the last path component in the data.

   if (p && q) {                                         

If there was a backslash and a space in the data


skip it,

    *q = ' ';                            

we put a space in there


and we copy that part of the buffer to the start of it. To make this more easy to follow imagine you have the input data:

d:\lcc\wedit\edit5.c: line 14 --> d:\lcc\wedit\edit.h

We copy all characters from ‘edit5.c’ to the start of the path information, (the characters d:\… etc) overwriting what it was in there.


   else if (q) *q = ' ';

If any backlsash wasn’t present… we restore what it was in there. Data from the calculator will not change its output format so easily, but better not to rely on that, and cope with the situation the data isn’t as we expected.


What would happen if the ‘else’ instruction wouldn’t be there?


In the case that the predicate of the if was true, the positions of the characters would have moved, and q would overwrite useful data.


How do I know?




What would happen if the if (q) were left out of the last line?


In the case that no space would be present, the program would access a void location. This means that the system would put an unpleasant dialog box in the screen saying that ‘This program has performed an illegal operation and will be terminated’. What a shame to finish like that. All data the user had in RAM would be lost.

How did I find out? I corrected this possible bug when I correected the problem with the space above, and started checking the logic a bit.


We close the scope of the if above that established the predicate about the redundant info. Now we start processing the second part of data.

  q = strchr(p,'<');

We search a marker, stopping at the first occurrence.

  if (q == NULL)

If it wasn’t found

   q = strchr(p,'"');

search an alternative marker.

  if (q) {

If any marker at all was found


Skip it

   /* ignore the include path info if it is a global include */

We do the same thing as above

   if (!strncmp(


includePath, strlen(includePath)) ||

This predicate uses two conditions, joined by an OR symbol. This means the first condition OR the next will trigger the predicate. We compare first with includePath, and then with sourcesPath.





strlen(sourcesPath))) {

This sequence will be triggered by that predicate.

    *q++ = 0;

We break the character chain there.

    p = strrchr(q,'\\');

Search the backslash… I confess I just copied this from the first condition with the text editor. I should have done a routine that encapsulates this and takes just two arguments, doing what it takes to do. Cut and paste, however, was much more convenient, and I am sure you will understand this later on. Programming means working with the current time budget, managing your resources, knowing what you can do and you can’t do, as walking. Walk to the Everest mountaintop? Possible but its really necessary?


This was pointing to the marker, whatever it was (either ‘<’ or ‘”’). We want to output the marker to the user, so we backup one position.

    if (p) {

If there was a backslash


we skip it


add the data set name info


If there wasn’t a backslash we do nothing.


If there wasn’t any redundat info we do nothing


If there wasn’t a marker, we do nothing.


We send the output to the list box window, that will take care of displaying it in the screen with the current font, drawing each character line, etc. We just output it. The routine that we call, does just a small set of system calls to put that data in the screen.


Increase the level variable using the post increment construct. This adds one to the variable in question.


If this input line didn’t contain the marker, we check any command to decrease the level from the calculator

else if (!strncmp(buf+tablevel*tabsize,"closing",7))


If there is one of this commands, decrease that variable by one.



Finished processing the input line. If it was a marker we did something, else we did something different only if it was a decrease level command, otherwise we did nothing with the input.


We finish the processing of each line. At this point, execution returns to the beginning of the while instruction above, to read more input.


We break the binding to the disk. This routine has finished its work, and now it cleans up before returning.


The temporary data file is no longuer needed. Erase it.


Release the temporary context we allocated.




Finished. No result is necessary, since menu commands are just action the user tells ut to perform. It isn’t necessary to say to the user: COMMAND DID NOT WORK. If there was an error, he will see that it didn’t anyway.

Compiling software.

The instructions above are all high-level instructions. The integrated circuit that executes them, knows nothing about such high level concepts like FILE or HWND. It only knows how to do a much smaller and lower level kind of instructions called ‘instruction set’. These are instructions like add, substract, compare, branch, etc.

To program at that lower level is possible of course, but very tedious. The higher level instructions allow us to write a short hand for those instructions, that are translated into real machine instructions by a piece of software called a compiler, that, hence its name, compiles a set of machine instructions for each high level one. For isntance when we call a procedure, we write in C:


it will emit:

push f

call _fclose

add $4, esp

These lower level instructions mean something to the circuit. They tell it to write the value in that pointer into the top of the stack, after decreasing the stack pointer. Then, it tells it to set the program counter at the location where the fclose routine is, and start executing instructions there until it finds a return instruction, that will send it back to where it was. The compiler readjusts the stack to compensate for the push and the call is done.

The high level instructions are part of a standardized input format for a specific computer language. Other languages differ in many ways, in concepts, in ways of doing, etc. But all of them sooner or later, must be translated into a set of those basic instructions anyway. The machine doesn’t understand anything else.

Software development in a compiled environment means then, writing code, compiling it, and then executing the instructions compiled, maybe with the aid of a debugger.

Now, let’s see what the compiler will generate for some of the statements in the function above. The assembler representation used is the one used by lcc-win32, i.e. the source of the operation is written first, the destination second. This is the contrary of the Intel convention.

static void HandleShowIncludesCommand(void)


Here the compiler emits the function prologue, i.e. a series of instruction that prepares the execution of the code within the function. This is a typical prologue sequence: establishing a new frame (frames are kept in a linked list), making space for local variables, and saving the registers used.

pushl   %ebp  ; Save frame pointer in the stack

movl    %esp,%ebp ; Establish new frame

subl    $52,%esp ; Make space for locals

pushl   %ebx  ; Save registers. Save EBX

pushl   %esi  ; Save ESI

pushl   %edi  ; Save EDI

FILE *f;   ; Note that this declarations generate no code

HWND hwnd;

char *p,*tmpfile = tmpnam(NULL),

This is the typical case of a function call. The arguments are pushed from right to left to the stack, then the function call is done, and the stack adjusted. The result of a function call is left by convention in the register EAX.

 pushl   $0  ; Push first argument: NULL

call    _tmpnam ; call the tmpnam procedure

addl    $4,%esp ; adjust the stack

movl    %eax,-44(%ebp) ; Save the result in the stack frame of the procedure

*prjOptions = NULL,

 movl    $0,-40(%ebp)   ; Put zero in that variable


char *filename = CurTxtBuf->BufferName,

Here the compiler emits code to access a structure. First, the compiler emits code to access the start of the structure, then adds the offset till the field we want to access.

movl    _CurTxtBuf,%edi ; Move start address of the global into the register edi

addl    $29,%edi  ; Add the offset of the BufferName field

movl    %edi,-52(%ebp) ; save result at the address of the local variable

int tablevel = 0,i;

movl    $0,-8(%ebp)  ; Set the tablevel to zero

PROJECTINFO *prj = GetCurrentProject();

if ((CurTxtBuf->BufferFlags & BFIS_C_FILE)==0) {

Again, we access a field of a structure, to test a certain bit. This is done by reading the value stored there, and ANDing it with a power of two. This sets the machine flags accordingly, and we can use the jump instruction, that reads implicitely the flags.

 movl    _CurTxtBuf,%edi ;read the value of CurTxtBuf into register EDI

testw   $1024,27(%edi) ; compare the contents of  short at edi+27 with 1024

jne     _$358   ;If its different, skip the next block

 sprintf(buf,"%s: not a C file",filename);

We make again a function call, this time with three arguments. Note that we push a pointer to an anonymous character string generated by the compiler, not the character string itself.

pushl   -52(%ebp) ;Push the variable filename

pushl   $_$360  ;Push a pointer to the anonymous variable _$360, the string

pushl   _buf  ;Push the variable buf

call    _sprintf ; Make the function call

addl    $12,%esp ;Adjust the stack: each arg was 4 bytes, there was 3, then 12.


 pushl   _buf  ;Again a function call with one argument. Push it

call    _strlwr ;Make the call

addl    $4,%esp ;Adjust the stack: one argument is 4 bytes.


Here we make a function call to a function that has the _stdcall calling convention. Note how the name of the function is mangled with the size of the expected stack. Note too, that there isn’t any stack adjustement. The _stdcall functions will clean up the stack themselves.

pushl   _buf

call    _ShowDbgMsg@4

.line   469

jmp     _$357  ;Jump to the epilogue of the function to prepare for exiting.


We stop here. The objective is just to show you the two different levels of representation of a program. The high level, where the programmer writes about logical constructs, structures, FILEs, etc etc, and the actual machine level, where there is only moving data into/from the registers, making calls, etc etc. The compiler is then the software that binds this two levels of representation, and makes it possible to pass from the high level into the lower ones, so that the program can be run by a microprocessor.

Note too, that the ‘lower’ level we have shown here is not what the machine sees, of course, but just mnemonics used by lcc-win32’s assembler. The machine sees only the binary representation of the encoding of those mnemonics, i.e. a long sequence of numbers.

The compilation process

To get an executable then you have to 

  1.  Compile with lcc a source file to obtain an object file
  2.  Link the object files with the resource files into an executable using the linker lcclnk
  3.  Find out why the program doesn’t work as expected using the debugger.

Compiling C code into an executable is a fairly involved process to be described below. Before doing that, it is necessary to know exactly what is the ultimate goal of that process: the executable.

When you click in a program icon, or type the name of an executable in a command window, the program that is referenced to, by the icon or the name, starts. This is such a common day experience, that we tend to forget what is really happening behind the scene.

What happens, in any operating system, and Windows is no exception, is that a program called ‘loader’ is invoked by the operating system. Roughly, the loader works as follows:

It finds the disk file that is associated to the command or icon.

It reads and executes the instructions stored in the file concerning the loading process.4

It establishes the memory mappings between the program’s virtual addresses, and the real memory addresses that the program will use. This means allocating a virtual memory space for the program.

It copies from disk into memory, the sequence of numbers (machine instructions) that make the program, and its associated initial data space (static data).

It finds all the dynamic link libraries that are needed by the program, and links them into the process’ address space. The addresses in the program that point to external Dlls are resolved, and patched in memory by the loader. If any one of those references is missing, the program will not be allowed to run, and a message box informs the user that ‘The procedure <such and such> cannot be found in the DLL <such and such>‘.

It determines the starting address of the program.5

It creates a process within the OS, and sets the program counter of that process to point to the starting address, also called ‘Entry point’ in the technical literature.

It starts running the code in the entry point of all Dlls that are used by the program, informing them that a new process has loaded them.

If a debugger started the process, it stops, and gives control to the debugger, to allow it to do its initialization.

And then, at last, passes control to the first instruction at the program’s entry point.

And that is it. All this complex sequence of events takes less than a second.

Executable code is much more than a sequence of numbers that the CPU can interpret. It is a complex interaction between several tools that contribute to the final product: the editor, where you type your program in, the compiler, that translates them into machine instructions, the linker, that assembles the different modules into one executable, the resource compiler, and several others, like, for instance, the loader, that will do the final job.

The objective of lcc-win32 is to produce an executable file, derived from the instructions you write in C, and that will execute exactly those high level instructions in machine language. The process of translating the C instructions into an executable is the compilation process.

A closer look at the compilation process.


Here is an overview of the different phases of this process in more detail.

Functional Unit

Tasks performed

Compiling a C source file with lcc.exe



Macro expansion

#define substitution

#if ... #endif conditional compilation

#include file processing

Front end

lex.c, decl.c, expr.c, simp.c

Lexical analysis

Tree generation

Simplification of expressions

Conversion of trees to DAGs

Common sub expression elimination

First optimization pass

Global register allocation


Determining which variables are going to be stored in which registers for the duration of the function. This is performed only if the user requests the optimization option.

Finds out the block structure of the program

Constant folding

Labeler. Win32.c

Choosing the right instruction sequence to execute

Register allocator gen.c

Choosing which registers will contain which information in the instruction sequence generated by the labeler.

Code generator


Combining instructions with registers to produce ASCII assembler mnemonics

Peep-hole optimization


Improving the instruction sequence



Converting the assembler text codes into opcodes

Emitting the relocation information

Emitting the debug information

Building a COFF object file (.obj)

Resource compiling with lrc.exe


Exactly the same as the preprocessor phase above.

Resource compiler

Analyzes the resource file text, and generates the binary resource format. This is done by :

recognizing tokens from the input stream, and then

Recognizing predefined token sequences in the token stream.

Organizing the structures read from the text resource file into a binary resource file (.res).

Its output is a binary resource file (.res).

LINKING with lcclnk.exe


Convert any binary resource files into COFF

Link the debug information (if any).

Combine all object files into an executable file ready to run

Generate an import library if linking a DLL.

Debugging with wedit.exe


Reads the executable and its debug information, and allows the user to single step through the code, or to set breakpoints, etc.

Lcc-win32 supports the separate compilation model; i.e. you translate first one or several modules separately, and then give all this modules to a program called linker, that will assemble them into a single module called executable file. This is then, a two-phase process.

The objective of the first phase is to produce a special file called object file6. The details of this first phase are as follows:

  1.  The compiler7 will open a file of C instructions, and will make several transformations of it. First, it will pre-process those instructions following the pre-processor directives, and produce a translated set of C instructions. This set contains all the files indicated in the #include directives, has been stripped from all comments, all #defines are replaced by their definitions, and with several other transformations like ## substitution, etc.
  2.  The result of this pre-processing will be feed into the compiler proper that will output a series of more basic instructions called ‘intermediate language’ instructions. Those instructions form the vocabulary of the virtual machine that lcc assumes as an abstraction of the real machine.
  3.  Those instructions will be given to the code generator that will select the best sequence of real machine instructions from several possible choices. This is called labeling the intermediate language tree.
  4.  The labeled tree will be used by the emitter to output a sequence of ASCII assembler mnemonics that corresponds to each intermediate level instruction.
  5.  If the user specified an optimization pass, those ASCII mnemonics will be go to the peephole optimizer, that will further work on them, mostly with the objective of eliminating instructions and improving the code emitted by the code generator ‘gen’.
  6.  Then, those text mnemonics will be passed to the assembler, that will translate them into numerical codes, that the CPU can understand. All those numbers will be packed in a special format called ‘COFF’8, and written to a file called ‘object’ file.
  7.  Those numbers are written according to the specifications published by the company that makes integrated circuits like the Pentium for instance. This company is called ‘intel’ (in lowercase) an acronym for INTegrated ELectronics. That company is not the only one that makes integrated circuits that understand those numbers. There are several others, and those numbers have become very popular. All of them produce circuits that are called ‘lcc-win32’ compatible in the literature...

The second phase is the ‘link-editor’ phase. This program (called lcclnk.exe in the lcc-win32 system) assembles all modules produced in the first phase, together with optional resource files produced by the resource editor, and library code, into a single file, that the operating system knows how to load and run.

Windows uses a set of files called RC files, to describe in a textual form, several resources used by the program like menus dialog boxes, message strings, etc. These files are written in a special language, with a very simple grammar.9 This files can be compiled using the resource compiler lrc.exe’, that transforms a RC file into a binary .res file, in a similar fashion that the C compiler transforms the .c file into a .obj file.

Windows supports an operating system extension called ‘dynamic linking’, i.e. linking a program when it is being loaded rather than in the linking phase described above. If you use this feature, you build an executable that is called DLL10 Lcc-win32 supports dynamic linking, and their associated definition files (.def), and other associated machinery.

The pre processor.

Dennis Ritchie, one of the authors of the C language, wrote the pre processor of lcc. He and Brian Kernighan designed the C language to write the Unix system.

I asked Dennis Ritchie about this program, and he answered :

« I'm glad you found the preprocessor useful.  Incidentally,this particular program is younger than lcc (it is not one of those things that came from early Unix).It's just an atttempt, more or less for fun, to implement a C89 preprocessor, which we needed for the Plan 9 system »

The pre-processor will input from disk the C source files, and will create an output stream containing the text of that file, without any pre processor instructions like #define, or others. It will concatenate all included files in the output stream, and present to the compiler proper a text without any comments or macro definitions.

The pre processor scans each token in the input, to verify that is not a macro. This lexical scanning repeats somehow the lexical scanning done later by the compiler itself. Many times I have thought about eliminating one of this redundant scanners, but each time I stopped at several difficulties. Basically I have arrived at the conclusion that it is better to have a separate scanner after all in each part of the system, since they are centered towards different things. Lumping them together would mean a hopelessly complicated piece of software.

The compiler

The output of the preprocessor is read by the compiler’s lexical scanner that lives in lex.c. The scanner classifies the input, and returns an integer that encodes the meaning of the current token. Lcc’s technique for parsing is called ‘recursive descent’, and it is better described in the book of Fraser and Hanson, so I will not repeat that book here.

From the input stream, the compiler builds first a tree representation. As each construct of the language is recognized, a parse tree is built containing the information gathered from the input. This sequence of trees is the first abstract representation of the program structure. At the end of each statement, this sequence of trees is passed to the function listnodes, in dag.c, that builds the dags11 containing statements in lcc’s intermediate language. These dags represent simple operations like ADD, MULTIPLY, etc. This is a much simpler vocabulary than the C language, and this is precisely the goal of the whole operation: to arrive at a lower level description of the trees generated by the front end. These operations are common to many microprocessors, not only the x86 series. Lcc is a portable compiler, i.e. it works at a level of abstraction where the common features of all microprocessors are recognized, so that porting the software to a new machine implies only rewriting the lower levels and the machine description. The front end is independent of the machine itself.

The nodes built by listnodes are appended to the code list that is stored to be processed later, when the end of the current function is reached. Code can be generated only inside a function, and lcc generates the code for each function when the front end is done with it, and the last matching brace has been read.

When that happens, and if you specified the optimize option, a quick analysis will be performed with the code list, mainly to optimize certain constructs like:

a = (b == 8) ? 6 : 3;

for instance. The analysis phases discards unnecessary labels, to increase the size of the basic block that will be seen by the peephole optimizer. This analysis is performed in the file analysis.c.

Then, the code list goes through the function gencode that converts the dags into trees, and in this conversion step, uses the dags to discover the common sub expressions in the source code. These resulting trees go through the labeler.

The labeler is one of the finest pieces of software that make lcc. It consists of a utility called ‘lburg’, that will input a machine description in a special format, to be described later, and outputs a C program that will try to find the best sequence of assembly instructions from the many possible sequences. It uses linear programming to reach this goal.

The labeled trees are then passed to the genasm() function, that converts them into assembler mnemonics. That ASCII assembler mnemonics are eventually passed to the assembler that converts them into actual machine codes, to be stored in the object file.

Lcc’s intermediate language

Lcc is a retargetable compiler. Each back end for each different machine has its own machine description, specifying the rules to convert the trees generated by the front end into actual assembler mnemonics. 12These files have an extension of .md, for machine description.

This machine description is nothing more than a sophisticated table of correspondences between the instructions of the abstract machine, and the instructions of the concrete machine that will execute the program being compiled, in our case an x86 processor.

Here we have a rule taken from the machine description for the x86:

stmt:         (1)

  ASGNS(      (2)





  "\tincw\t%0\n"     (3)

  immediateOpShort(a)    (4)

We see here four parts:

  1.  The ‘stmt:’ part. This means that this rule is executed for side effect and doesn’t return a result. A rule can return a result, for instance in a register.
  2.  The ASGNS ( ) part. ASGNS means assign short, i.e. 16 bits. This is one of the words of the intermediate language vocabulary. The notation used is lisp like: ASGNS takes two arguments, the first is an address, and the second is the result of the operation CVIS, or ConVert Integer to Short. That operation takes itself only one argument: the result of the ADDI operation. That operation is an integer addition, and takes a 16-bit memory location, and a constant, in this case, the constant one.
  3.  The assembly language part enclosed in double quotes. This part is the translation of the operation ASGNS(addr(CVIS(ADDI(mem16,con1))): it is very short, because the x86 has many powerful instructions (the CISC concept). The instruction that does all that is the INC instruction decorated with the w suffix, to indicate to the assembler that a 16-bit quantity is being incremented. The conversion to/from 16 bits are not necessary. Note that the incw operation takes one argument denoted here with ‘%0’, i.e. the first argument of the ASGNS instruction: the ‘addr’.
  4.  Note that 'addr', 'mem16' and 'con1' are non-terminals, i.e. terms that are defined elsewhere in the machine description.

  1.  This rule is guarded by a C expression, in this case the call to the function ‘immediateOpShort()’. This rule can only work if mem16 equals addr, and it is the task of that C function to verify that this is indeed the case.

Another rule is:

reg:            (1)

ADDP(         (2)





"leal %3(%2,%0,%1),%c"     (3)


  1.  This rule returns a result in a register.
  2.  ADDP is a pointer addition operation. The outermost has two arguments: the result of another ADDP operation, and a constant. The innermost ADDP receives the result of a register left shift and adds to it another register.
  3.  All this operations can be mapped to the x86 instruction lea, for Load Effective Address. This versatile instruction adds can add an arbitrary constant to two registers, where one will be left shifted by 1, 2, or 3. The %n constructs indicate the kids of the rule. %0 is the first kid, (in this case 'reg'), %1 is the 'icon' (integer constant) non-terminal, %2 is the second 'reg', and %3 is the 'acon' (address constant) kid. To better understand this construct, we can read it from its innermost expression, and work our way outwards. We take the left shift expression LSHI(reg,icon). This expression shifts left a register by an integer constant. The result of this, is the first argument to the inner ADDP, so we are actually adding the result of the shift to a register. The resulting value, is used in a pointer addition operation again (the outer ADDP), to yield the access to the table.
  4.  There is no C expression or constant that assigns a cost to this rule so its cost is zero.

Interesting is to see all this in action in an intermediate language file. Here is an example of the last rule in a .lil file:

ADDP(          (1)







reg: ADDP(ADDP(LSHI(reg,icon),reg),acon) /      leal (2)


       leal    -4(%ecx,%edx,4),%edx    (3)

  1.  This is a concrete example where this rule is used. Note that the non-terminals have been replaced by real things and not symbols, i.e. the first ‘reg’ non-terminal above is the result of INDIRI(ADDRGP(blocks)), an operation that yields a register. The ‘icon’ 13non terminal is replaced by the concrete constant 2, and the second ‘reg’ non-terminal is the result of INDIRP(ADDRGP(first_block)), i.e. a pointer indirection (INDIRP) through the global (ADDRGP, address of a global pointer) ‘first_block’.
  2.  The rule is displayed in another format.
  3.  The effective assembler code generated is enclosed in quotes.

From that, the assembler will produce: 8d5491fc.

Following an expression:


  int c;


  printf("%d\n", (5*c+c/2)/2 );

The intermediate code generated looks like this in the .lil file:

ASGNI(         (1)



ARGI(DIVI(        (2)









ARGP(ADDRGP(2))       (3)

CALLI(ADDRGP(printf))      (4)

  1.  In the first tree, the value of the local variable ‘c’ is assigned to a register since it will be used several times in the expression. This ‘common sub expression elimination’ is done in the file dag.c, by the function ‘visit’. The value of c then goes to the EDI register. The code for deciding which register is used is in the file gen.c, specifically in the function ralloc(), for register allocation.
  2.  In the second tree, the first argument to the printf call is calculated and pushed (ARGI).
  3.  Then the second argument, the address of the character string "%d\n", is pushed (ARGP).
  4.  Then the call is done

Now, consider the thick argument (2). This is the result of a calculation that runs as follows:

The first expression to be evaluated is the multiplication. This is:




The rule that the labeler returns for this is

reg: MULI(con5,reg) /   leal    (%1,%1,4),%c

       leal    (%edi,%edi,4),%esi

You will maybe wonder where the multiplication by five is. Well, leal (we come again to that, I told you is versatile), can be used to multiply by five: You multiply by 4 and add the register to the result. The EDI register here, will be shifted by 2, what’s equivalent to a multiplication by 4, and then its contents will be added to the result. Because

 5*x == (4*x) + x

The result of this is stored in the ESI register. This is much faster than a multiplication that takes over 40 clocks.

Then, the evaluator goes on with the second argument to the addition, the result of a division.

For the tree




the labeler will choose the rule:

reg: DIVI(reg,reg) /    cdq

       idivl   %1


       idivl   %ecx

This instruction, DIVI, needs absolutely its argument in the EAX register. The x86 is like that. So, we have to put first in EAX the argument. This is accomplished by the LOADI instruction that moves a register to another.

The rule is then:


reg: LOADI(reg) /       movl    %0,%c

       movl    %edi,%eax

Now, we have the first member of the addition in ESI, and the result of the division in EAX.

We can at last, perform the addition.

reg: ADDI(reg,mrc) / ?  movl    %0,%c

       addl    %1,%c

       movl    %esi,%eax

       addl    %eax,%eax

An addition between a register and a constant, a register or a memory location can be done with the ADDL instruction.

A bug

Wait a minute... You didn’t notice something weird above?

We move esi into eax, and then add eax to itself... This is completely nonsense!!!

We are clobbering the input of the addition before the addition is done!

Well, I WAS happy when I finally arrived at this. It was  A. K. Mohanty, that send me a message warning that the following program:

#include <stdio.h>

int main()


int c=4, k;

c = 4;

printf("%d\n", (5*c+c/2)/2 );

k = 5*c+c/2;

printf("%d\n", k/2 );

return 0;


When compiled by LCC, it prints



instead of the correct result



I was completely astounded that such an enormous bug still existed in lcc-win32. But yes, it was true. The problem is, that the operation here ((5*c+c/2)/2) has two nested divisions. This is not so common, but this one uncovered the fact that I should save the results of the division operation, and other operations that force a fixed register. The correct code for the above operation is:



    ^^^^^   MULI(







That LOADI instruction is absolutely necessary to save the intermediate result!

I patched rtarget() in gen.c to take into account this fact. The code generated is then:

reg: ADDI(reg,mrc) / ?  movl    %0,%c    (1)

       addl    %1,%c

       movl    %esi,%edi

       addl    %eax,%edi








reg: LOADI(reg) /       movl    %0,%c    (2)

       movl    %edi,%eax

Here the value of ESI is saved into EDI, where the addition is performed (1), and then, the result of the addition goes into the EAX register, to prepare for the following division.

This is not optimal: it would be much better to add ESI to EAX and be done with it, but at least is correct. Nested divisions aren’t all that common, so a few cycles lost here will not make a big impact in code quality overall.

Programmers do not like to speak about their bugs. I am not masochist, but I think much can be learned from bugs. Personally, I do not like those technical manuals that do not expose the real problems to the user.

Tracking this kind of bugs is very difficult. Actually, most of the work consists in trying to find where the bug is. Once it is found, the solution is in many cases easier to find. In this one, it wasn’t. It costed me a lot of effort to see the problem in rtarget().

The machine description is stored in the win32.md source file. The utility ‘lburg.exetranslates a machine description into a C language file that can be compiled by the C compiler.

An overview of the abstract instructions of the virtual machine

The types supported by lcc-win32 are described by abbreviations. They are:

Type name



Floating point number of 32 bits


Floating point number of 64 bits


Character type of 8 bits


Integer type of 16 bits


Integer type of 32 bits


Unsigned integer type of 32 bits


Pointer type of 32 bits




Block type (structure)


Long long integer type of 64 bits


Unsigned long long type of 32 bits


Boolean type of one bit, stored into a character of 8 bits for efficiency.


Long double type: 12 bytes

The operations supported by lcc’s abstract machine are few. Here is a description of each of them.

Instruction name

Types supported




Constant value



Argument to a function



Assignment operation



Convert from character to some other type



Convert from double to some other type



Convert from integer to some other type



Convert from short to some other type



Convert from unsigned to some other type



Convert from long long to some other type



Convert from unsigned long long to some other type






Call procedure



Load into register



Return from procedure



Fetch global pointer



Fetch argument from stack



Fetch local variable








All numeric



All numeric



All integer

Left shift


All integer

Modulo operation


All integer

Right shift.


All integer

Boolean AND operation


All integer

Negate each bit


All integer

Boolean OR operation


All integer

Boolean XOR operation.


All numeric



All numeric

Greater equal


All numeric

Greater than.


All numeric

Less or equal


All numeric

Less than


All numeric

Not equal






Named code position

We see here that the actual operations performed are actually only six/seven.

  •  Arithmetic: ADD SUB MUL DIV MOD
  •  Logical (OR XOR AND BCOM
  •  Shifts (LSH RSH)
  •  Comparison (GE EQ GT LE LT NE)
  •  Goto (JUMP/LABEL)
  •  Calls, return from call and argument passing. (CALL RET, ARG)
  •  Assignment (ASGN).

Allocating register variables

Before the labeller runs, just after the code list has been finished and the text of the function has been completely processed the ‘function’ procedure in file w32incl.c will call the SetupRegisterVariables procedure in analysis.c.

This procedure has the following algorithm :

Decide which registers will be available for register variable allocation. The ‘SetRegistersMasks’ function in w32incl.c decides this using the information about the function gathered by the parser and the front end.

A table is constructed with all local variables of the current function, including function arguments. This table is then sorted by the ref field of each symbol, to put  the most used variables in the function first.

A variable is candidate for being a register variable only if its address is nowhere taken, its size fits in a register (4 bytes at most), and is not an aggregate (structure, union or array). Only variables that have a ref field bigger than 2.0 will be used.

The sorted table is then inspected, and the first three variables are allocated to a register.

When a register is allocated, the life span of the associated variable is searched, and if another variable in the table has a disjoint domain, i.e. the uses of the two variables never intersect, the other variable will be allocated to the same register. (Register variable aliasing)

A second pass through the list is done after register allocation to see if there are holes in the register usage we can fill with some variable from our table.

To decide which registers will be used for variables the compiler uses the following heuristics:

If the function has no integer divisions or modulo operations and no block-move instruction, EBX, ESI, and EDI are available for register variables.

If the function is not a leaf function, or uses integer division, only ESI and EBX are available

Else, use only EBX for register vars.

Allocating registers for expressions

After the labeller is done, the compiler needs to assign registers to each construct. If the expression needs a register, the register allocator looks in for a free one, and assigns it to the expression. This is done in the function ralloc(), in file gen.c.

In general, lcc-win32 uses two kinds of register allocation strategies :

  1.  When no optimizations are requested, there are no register variables, and the compiler starts looking for a free register beginning with ESI or EDI, descending until EAX. This means that in most cases ESI or EDI will be selected, unless the expression is a really complicated one that will need 3 registers. In that case the lookup can reach EBX. This is the original strategy used in lcc, that is still in use up to the latest one, lcc 4.1.
  2.  Optimizations are requested. In this case, lcc-win32 starts by looking if EAX, EDX or ECX are free, and if not, goes on to the expensive registers EBX ESI EDI.

The calling conventions of most systems for the x86 require that EBX ESI and EDI must be maintained across any function call, i.e. the contents of those registers are saved at function entry, and restored at function exit. That is why in an optimization context, it is better to use EAX EDX and ECX, since they do not have to be saved.

This is much more complicated, however, because some operations like division or block-move instructions require their inputs in a specific set of registers that can’t be changed.

The register allocator has to know if it is safe to use the lower numbered registers or not. This is the meaning of the flag ‘unsafe’ in the code of the function ‘askreg()’. This flag will be set by the high level optimizer that is called before ralloc runs. The compiler (in ‘analysis.c’) looks for operations that would make any use of lower numbered registers unsafe, and marks the nodes accordingly.

There are some operations, however, that make all this extremely difficult. Division, for instance, needs the dividend in EAX, and the divisor in another register. Yet another register is usually used to store the result, and EDX is destroyed by this operation. So we have that this operation needs 4 registers out of 6 available.

Since any function call will destroy EAX EDX and ECX, we can’t store any of those values across function calls, and we have to either spill them, or try to find a higher numbered register available.

I have been able to keep in most functions EBX ESI and EDI free for allocating them to register variables. This is done with the help of the pre-scanner, that marks interesting nodes as safe or unsafe. See the code in analysis.c, function ‘BuildBasicBlocks’.14

That function will linearize the linked list of nodes, joining them through the x.next field. The list is parsed by the function ‘FindSymbols’ that tries, essentially to:

Build a count of labels to erase later unneeded ones, and to

Mark each node as safe or unsafe, depending on the operation being performed.

With this information available, the register allocator has a difficult but doable job. I have been able to drive it through complicated expressions that need a lot of registers without breaking it down.15

The circuit where this software runs is not designed, it just grew and grew, but not in the right direction Intel has modified very little the original 8086 design, just widening ax to eax, but that’s not a really big deal. No new registers have been added to this architecture since the 8086 days, more than 10 years ago. This is a huge time, and I have (as probably many other Intel users before me) often wondered why this is so. I have speculated that it is because compatibility with older software, wishes to keep the software investment that many people have done to drive the circuit.

But a register extension wouldn’t need to be incompatible with older software. Just new opcodes would be needed, and older software would go on running. It is maybe possible that Intel has painted itself in a corner and there are no more opcodes available, but that’s impossible: the escape opcode was always there.

Why didn’t they add a few registers to the circuit then?

Mystery. They have the capability of adding a lot of special code circuitry to do MMX and now SIMD parallelism and I thought for a moment that they would add a

‘MOVG  screen, game’

instruction, that would read your game from disk and play it for you at high speed.

Is it right to program the graphics card into the CPU?

I asked that to some Intel researchers but they gave no answer. The circuit that Intel sells is a general-purpose circuit, i.e. one that is designed to do anything. Not only to run games. Actually, most of the PCs today do office-like tasks (word processing, business software development, accounting, mail, and similar tasks). For these tasks, high-speed graphics operation can be left to the graphic card that with its special purpose processors will always do better than Intel. In a specialized graphic display, the algorithms, the way of drawing, and the associated special purpose circuitry can be changed quickly, from one model to the next. True, there are some algorithms that can be better programmed in software for flexibility, and the abstraction of the mmx operations would be a good idea in a context where you have plenty of registers to go fast, but not in the crowed conditions of the x86 architecture.

This is going to end with the arrival of the 64-bit Merced, says Intel. Maybe true, but I do not see why decreasing the register starved condition of the circuit would do any harm… The complicated circuitry to speculate what the next instruction will be looks more complex to me than adding a set of 26 new 32 bit wide general-purpose registers, what would make the processor much more independent from the main memory subsystem.

There is no point in getting a 1 GigaHertz processor if main memory still runs at 15 MHz and the processor needs a lot of memory access because there are no registers to store intermediate results or local variables.

True, there is a cache, but if the cache influence is good, I can readily imagine what the influence of 26 new registers more would mean. Gigabytes of memory accesses would be avoided.

But I am digressing here, we were faced with the task of using the incredibly few registers available, taking into account all quirks. Like the one that forces you to save the floating point flags somewhere, set them to truncation in the ‘C’ language sense, doing the conversion, and then setting the flags as they were again.

The C code is

double a=7.8;

int a = (int)a;

But if you want to write a compiler for windows you can’t change the circuit. It is done, and it is better to cope with it rather than lamenting. Conclusion: ESI EDI and possibly EBX are reserved for register variables only.

Lcc-win32 tries hard to use every possible mode that the machine has to offer. Byte or word specific instructions will be inserted when possible. The register allocator collaborates to this task by avoiding registers ESI or EDI for character operations since they do not have byte wide sub-registers. The registers EAX, EBX, ECX and EDX can be addressed as 16 bit or even 8 bit registers, so for character or short operations it is better to avoid ESI or EDI.

Some operations require their inputs in specific registers. This is done in the function ‘rtarget’ in w32incl.c

In a second pass after register allocation, I look for leaf-functions, i.e. functions that do not have any function calls. In those functions, it is safe to exchange ESI by ECX for instance, if ECX isn’t used. This saves the push and pop instructions for saving and restoring the contents of ESI. The cases where this is done are few, and the register allocation schema of lcc supposes quite a few registers more than the x86 has to offer.

A better strategy would have been to allocate registers in a basic block instead of globally for all the function. But that would have meant a complete rewrite of the code generator and specially the register allocator.

Debugging a register allocator and its corresponding spiller is a task that takes years. The one that Chris Fraser and Dave Hanson wrote has been running for a few years, and is possible to improve it for the x86 situation. Rewriting it completely would have been an impossible task.

The code generator

The function emitasm, in gen.c receives the labeled tree list, and builds a textual output stream with the instructions it finds in the rules. If you specified the optimizer option in the command line, the output stream will be first be redirected through the peephole optimizer that lives in optimize.c. The peephole will be described later, here it is enough to say that it tries to eliminate instructions as much as it can, and to inline certain functions of the C library like memset or strcmp.

The emitasm function will fill the arguments for the rules, as specified in the ‘%0’ constructs or in the %a or %c directives (for symbols).

The escape instructions inserted by the pre-label pass end up here. They call the emit2() escape hatch, to generate special purpose code for implementing predication, for instance, or for implementing the CMOVxx instruction. You will notice in win32.md that all escape instructions start with '#’, that is a signal to emitasm that should call emit2. There, I do the improvements of the conditional statement described later. Constant folding can generate escape instructions too, for instance for transforming tests into NOPs without disturbing the code list.

The pre-labeller will expand constants during the constant-folding optimization phase, and you can end with comparisons like if 6 == 6. Instead of generating code for it, the node’s op field is changed to the escape opcode that I assigned in win32.md. In the ‘oldop’ field, I encode the information that will later be processed in emitasm and eventually emit2 that will just ignore it, transforming the whole comparison into nothing.

The emitasm() function is full of functionality:

If the node is a label, the code that was accumulated in the buffer will be sent to the peephole optimizer.

If the function has no frame pointer, adjust the code to use ESP instead of EBP. This makes in fact EBP available as a general-purpose register.

Make an effort to ensure that an instruction doesn’t destroy its source operands when doing the operation.

Process the ‘%x’ directives in rules

If it is an intrinsic call, call the intrinsics interface with the node to generate special purpose inlined functions.16

If the node represents a function call and the debug level is higher than two, the current line number and context is saved before the call.

Lcc accumulates the assembly mnemonics in a buffer that before17 was simply written to disk. I added an optimizations phase in emitasm() to call a parser of those mnemonics, that synthesizes an abstraction of those instructions to be used to update the State variable in the FindStateChanges() function in optim.c.

The directives followed by emitasm are the following:




Use Node->syms[0]->x.name


Use Node->syms[1]->x.name. Not used


Use Node->syms[2]->x.name. This is usually a register name


Use the first child as indicated by the _kids function


Use the second child


Use the nth child


Generate a temporary label


Use the last generated temporary label


Use Node->kids[0]->syms[0]


Use Node->kids[0]->syms[2]


Use Node->kids[1]->syms[2]

The intrinsics interface is implemented as a several step process:

In the function ’call’ in enode.c, a flag is set if the function name is found in the intrinsics table. To avoid name clashes, the function must be declared as _stdcall.

This flag is copied to the corresponding tree in dag.c

Emitasm() tests this flag, together with the register allocator, that needs to call the interface when handling the arguments of the intrinsic, if any. The actual interface call is done in emitasm.

The assembler.

The output stream will be feed then to the assembler, where it is converted in a sequence of numbers that go in the object file. The assembler is described in detail later. Here it is enough to say that, besides transforming the ASCII mnemonics into numbers, it builds the debug information that will be used by Wedit’s integrated debugger to follow the program execution.

It should be noted here, that lcc-win32 assumes at least a 486 processor. If you have an old machine with a 386, lcc-win32 will NOT run, since sometimes in its code instructions like ‘xadd’ are used, that do not exist in the 386 series.18

You can examine the intermediate result of each step of the compilation process described above with the following directives:

To see the output of the preprocessor, use lcc -E source.c. This will create a file called source.i that will contain the pre-processed source.c file.

To see the output of the front end, and the intermediate language use: lcc -z source.c. This will create a file with the same name as the original .c file, but with a .lil extension for lcc’s intermediate language.

To see the output of the optimizer/input to the assembler phase, use lcc -S source.c. This creates a source.asm file, containing the assembler instructions.

To see a textual description of the binary source.obj file, use the utility ‘pedump’, that comes in the bin directory of lcc-win32 distribution. It will display all information that the object file contains in textual form.

Modifications done to lcc

Lcc-win32 presents some adaptations of the original code of lcc 3.6 to the windows environment. Here I present a rather sketchy account of the process of writing those adaptations and building a windows compiler.

Porting the code

Lcc’s code ported without great difficulties to Win32. It is a code without surprises, well written, and after only a few minor modifications, I had it running under Win32’s command line.

Integrating the preprocessor into lcc

The traditional approach, used by Unix compilers like gcc, is that the preprocessor produces an intermediate file as output that is read from disk by the compiler proper that produces an assembly file. That file is again read from disk by the assembler, that produces an object file. This approach allows a clean separation of each tool, but has an enormous drawback: it is very slow. The amount of disk I/O that the machine is forced to do is considerable.

I decided to avoid any intermediate files. For some time I played with the idea of using the pipe mechanism of Win32 to make all tools communicate, but after a while, I decided for a tight coupling of all three tools. Pipes are bound to be slower than a program specific solution because:

You have to copy the data into an operating system buffer

There, the OS copies the data into another buffer (probably) to pass it to the receiving program

The receiving program has to copy it again into its own buffers

Many context-switches between both programs and the OS happen in between.

Better mechanisms exist of course, like shared files in memory for instance. But then, you have to take into account the synchronization problems and other goodies... And all that for doing what? Just for sharing a buffer full of data between two modules?

The fastest solution that I know of is to link the two modules in a single program and just share the buffers. Period. No context switches, no copy, no synchronization problems since you have only a single thread of execution. And this solution follows a guideline I try always to follow: the KISS19 principle.

The preprocessor does the only I/O operation that is absolutely necessary, and puts at the disposition of the compiler a buffer containing the characters preprocessed so far. The compiler reads this buffer, and generates a second buffer for the assembler, that reads it, and generates the object file, making the second disk I/O.  No copy is needed, since the buffers are just shared and only a pointer to the start of it is passed around.

The modifications to lcc that were necessary to do this were actually very small. The main one was in the code of the preprocessor: instead of reading the file from start to end, it should read more input only when the compiler needs it. This is done in the function ReadFromCpp (), in the ncpp.c file, in the compiler’s sources directory.

Supporting Microsoft’s extensions

A purely ANSI compiler will not compile the standard <windows.h> header files. The main stumbling block is the ‘stdcall’ calling convention. A windows compiler is forced to recognize this keyword (_stdcall or __stdcall). Other problems with special purpose syntax include the __declspec syntax, the unions with anonymous members, and a several similar non-ANSI conforming constructs. All in all we have to see how much progress was accomplished since the days where you had all those “_far” “_near”, “_huge” whatever…

Calling conventions

Calling conventions are a standard for pushing the arguments into the stack, preparing for a function call, and for cleaning it up after the call, removing the arguments pushed into it. The normal calling conversion in C is to push the arguments from right to left, and to generate code after the call for adjusting the stack.

The sequence then, looks like this

push arg3

push arg2

push arg1

call _foo

addl $12,%esp

This would correspond to the call:


The _stdcall calling convention

This calling convention is used in almost all the Win32 system calls. Functions that are _stdcall’ should:

  1.  Push the arguments from right to left, as in normal C calls
  2.  Do not adjust the stack after the call since the called function adjusts it when it executes the ‘ret’ instruction.
  3.  Add the stack size to the function name. For instance for a function like:

int _stdcall fn(int,int);

the name of the function should be changed to _fn@8.

The rationale for name decoration is very simple. If you omit a header file, and you call a _stdcall function _foo() without any previous declaration, the compiler will generate a call for _foo. Since the name of the function has been changed internally to _foo@8, the link will fail with a missing symbol. This may be seen as harsh or unnecessary, but imagine the mess that would ensue if this wasn't done! It would suffice to forget one header file/declaration somewhere, to ruin completely the application, that would crash randomly, at different places and at different times, depending on where the faulty call instruction was.

This needed a lot of modifications to lcc’s source code.

I added the _stdcall keyword to the lexer, so that it would recognize it (see lex.c)

I added a flag to the symbol and type structures so that this information would eventually arrive to ‘emitasm’ in gen.c.

I modified many functions in ‘decl.c’ to keep that flag, accept it, pass it around, etc.

In gen.c I planted a test for that flag, when it emits a ‘call’ instruction. If the test succeeds, it will not emit the ‘add $x , %esp’ code.

In ‘win32.c’ I test for that flag when building an internal symbol name. If it is set, I add the famous ‘@8’ to the name. Besides, in the procedure ‘function’, I emit a ‘ret stacksize’ instead of just ‘ret’. This means that the keyword will not only work for system calls, but also it can be used in user programs too. Now that it is done, I recommend its use, since for each call you save 4 bytes. For a function that is called very often this can make significant space savings.

Since there was in the headers sometimes ‘_stdcall’ or sometimes ‘__stdcall’ I added an internal #define to convert from one to the other.

Problems encountered when implementing _stdcall

A subtle problem appears when you declare a function as _stdcall, and this function returns a structure. Internally, the compiler will pass to that function a supplementary argument, the address of a place where the function should copy its result into. Consider then, the following code :

typedef struct tagS { // define some structure

int a;

double b;


STRUCT _stdcall function(double a,double b)



s.a = (int)(a+0.5);

s.b = (int)b;

return s; // This function returns this structure, not a pointer to it


int main(void)


STRUCT s = function(0.7,0.8);



If the compiler sees first the function definition, everything works : the function stack definition has been augmented with an extra parameter, and the compiler will name it « function@20 ». But if you write first a prototype for it, and THEN define it later, things will go wrong :

typedef struct tagS {

int a;

double b;


STRUCT _stdcall function(double,double); // PROTOTYPE ONLY

int main(void)


STRUCT s = function(0.7,0.8);

// call to function will be generated as function@16



// This function will be named function@20

STRUCT _stdcall function(double a,double b)



s.a = (int)(a+0.5);

s.b = (int)b;

return s;


The compiler will generate in the first call to that function, a call to « function@16 », then, when seeing the function’s definition will generate a function called « function@20 » and linking of the object file will fail because of a missing definition of the  _function@16 symbol.

This example shows only the unexpected ramifications a simple change like this can have in a language system. This bug wasn’t discovered in lcc-win32 after several YEARS of use !20

The “declspec” construct.

Microsoft has introduced this construct to direct code generation for doing specific windows related things like dlls, or export/import tables.

Lcc-win32 implements following constructs:

  1.  The __declspec(dllimport) construct. This should guide the compiler to generate an indirection when referencing a data item that is imported from a DLL. Use this, when accessing any variable/data that is in a DLL.
  2.  The __declspec(dllimport) construct. This provokes that the following name is added to the export for a DLL.


char * __declspec(dllimport) DllName;

int __declspec(dllexport) myFunction(int a)


return a;


The first example declares that the char * DllName is imported from an external DLL. The second instructs the compiler to add the name ‘myFunction’ to the list of exported names from a DLL, i.e. the names that are visible from outside, and will be included in the import library built by the linker.

Internally, this is accomplished by:

Substituting all occurrences of the symbol by its external21, decorated, representation, and

Adding a level of indirection to it.

The declaration:

int __declspec(dllimport) someInt;

is equivalent to:

#define someInt  (*_imp__someInt)

extern int someint;

This means, take the contents of the memory location pointed to by the symbol __imp__someint. When you write:

int a = someInt;

you are actually writing:

int a = (*__imp__someint);

This is implemented using a parser modification in decl.c, and a modification of the routine to generate the external names in w32incl.c. Specifically, look for the constants DLLIMPORTATTRIBUTE and DLLEXPORTATTRIBUTE.

All this machinery is necessary for implementing dynamic linking. The linker will write to the executable file a table of symbols that have to be imported from another dlls before the program can run. The program loader before the program runs will scan this import table. The loader will replace all those addresses in the table with the addresses in memory where the corresponding symbol in the DLL was loaded. That is why we need another level of indirection: we need to access, not the address of the import symbol itself, but use the contents of that address as a pointer to get to the real data contents in the DLL.

The different #pragmas

I added several pragmas to lcc to support some options needed under windows: This pragmas are now case insensitive, and oversight from the earlier versions of lcc-win32.

  1.  #pragma pack(n)

This will set the alignment within structures to 1 (no alignment) to 4 or 8. Other values will be accepted but are no very useful...

A variation on this is

#pragma pack(push,n)

and the corresponding

#pragma pack(pop)

The first pragma instructs the compiler to pack the structures with the value <n>, but in addition to that, to save the old value of the packing constant in a pushdown stack. At the end of the file, you should use the corresponding pop, to restore the value to whatever it had when the compiler was instructed to change.

  1.  #pragma optimize(on|off)

This will turn on or off the optimizer for a certain part of the source

  1.  #pragma section(name)

This will set the current section name in the object code. Only needed for some obscure object code manipulations.

  1.  #pragma ref <symbol>

     The given symbol will be marked as used.

Structured exception handling

I have tried to implement all constructs that Microsoft defines for the C language. This part has been very difficult, and it is not completely finished, mainly due to a lack of documentation. In any case, __try, __except, work as Microsoft defines.

Writing the startup code

When you click in a program icon, or just type the name of a program at the command line, you invoke implicitly a program that is one of the basic pieces of any operating system: the program loader. This program will search and read for an executable file of the name that corresponds to the name you typed in or to the program the icon points to. It will load that program into RAM, fix it, and then pass control to the program’s entry-point. This is NOT the ‘main’ function (or the WinMain) but a piece of code that prepares the execution of the main line, doing some initializations first.

The startup of lcc comes in two flavors: the normal startup of an executable, called lcccrt0.obj, and the DLL startup called libcrt0.obj. Both object files reside in the ‘lib’ directory.

The task of that piece of code is to perform the initialization of the standard input/output, and call the ‘main’ or ‘WinMain’ function. Actually, the startup will always call the ‘_main’ function. If the program is a windows program, it doesn’t have any ‘main’ function, so the linker will find a ‘main’ function in the ‘libc.lib’ library file, and link it in. That function will call WinMain. This allows the lcc-win32 system to use only one startup for all .exe programs.

After the main function returns, the startup code calls the exit function, that returns control to the operating system.

Here is the code for the main function of the startup.


       movl    %fs :0,%eax   ; save the contents of the exception list

       pushl   %ebp    ; build a stack frame

       movl    %esp,%ebp

       push    $0xffffffff   ; build the exception handler structure

       pushl   $_$ExcepData

       pushl   $__except_handler3

       push    %eax

       movl    %esp,fs :0      ; set this structure at the top of the exception list

       subl    $16,%esp        ; space for local variables

       push    %ebx            ; save registers

       push    %esi

       push    %edi

       mov     %esp,-24(%ebp)

       pushl   $__environ     ;call GetMainArgs to get the command arguments

       pushl   $___argv       ;for the ‘main’ function

       pushl   $___argc

       call    __GetMainArgs

       pushl   __environ       ; now push those for _main()

       pushl   ___argv

       pushl   ___argc

       movl    %esp,__InitialStack ; save the top of stack in a global

       call    _main

       addl    $24,%esp

       xor     %ecx,%ecx    ; invalidate the exception list element above

       movl    %ecx,-4(%ebp)

       push    %eax

       call    _exit        ; finish this process

       ret                  ; this is in case _exit returns, very surprising !

Writing the stackprobe code

In the firsts versions of lcc, this program would generate a trap:



int table[4096];


This would generate a fault, even if the program does nothing. After much research, I found the problem: The fault occurred in the procedure’s prologue, when the generated code executed the

sub $16384,%esp


As we will see in the linker chapter, Win32 reserves 1 MB of virtual address space for the program. Those pages aren’t committed; they are just reserved by the system. They will be actually committed when the program touches them, as it uses more stack space.

When the program touches a page other than the guard page, i.e. the 4096 byes just below the last used page, the system will make the program trap: stack pages are supposed to be touched just one after the other.

Why is this so?

I can only guess of course, but from the system builder’s standpoint, it is better to trap immediately, than trying to allocate several dozens of megabytes when there is an error in user software. The penalty being incurred too, is not very high: a function call, and a couple of memory reads, that’s about all.

The mechanism for detecting this is simple. Normally, only the first page (the first 4096 locations) of that space is committed. The next page is the ‘guard page’. Whenever the program touches it, the system will commit it, and reserve a new stack guard page. When the stack grows by more than 4096 locations in a single jump however, this mechanism would fail. The system detects this, and this explains the fault.

To correct this problem, I modified the code in ‘function’ (file win32.c) to test if the framesize is bigger than 4096. If it is, it will generate the following instructions:

mov $size,%eax

call __stackprobe

where ‘$size’ is the size of the stackframe.

The function __stackprobe is a complicated one, and there is no possible equivalent for it in the C language. Here it is:

__stackprobe:    ; argument: stacksize in eax

       pop     %ecx  ; pops return address

_$loop:     ; for each page we touch a memory

       sub     $4096,%esp ; location with (%esp)

       sub     $4096,%eax

       test    (%esp),%eax ; access a memory location in the

       cmp     $4096,%eax ; page

       jae     _$loop

       sub     %eax,%esp ; do final stack adjustment

       test    (%esp),%eax ; touch the last page

       jmp     %ecx  ; jmp to address popped above

The first instruction pops the return address pushed by the call instruction that called this procedure, saving it in ecx for later use. Then, I subtract 4096 from the stack pointer and from the argument in eax. This done, I just access a memory location pointed by esp with a test instruction, that doesn’t use any register. Here a trap occurs, the OS reserves a new stack page, and returns control to me.

Well, there is only to test if more whole pages are needed. If yes I loop, if not, I subtract the remainder in eax from the stack, do a final memory access and jump to the return address in ecx. This procedure takes only 29 bytes of code.

The windows.h and the other header files

A big problem was the header files. Those files make more than 4MB in modern Win32 compilers, and they are absolutely needed for a compiler system that should run under windows.

I had luck however, since the Free Software Foundation (GNU) started porting their compiler system ‘gcc’ to windows too, and they proposed to Scott Christley, of the net community to rewrite the windows header files. I downloaded with excitement the header files, and continued the work of Scott. I added several files, and above all tried to compress the headers as much as possible.

I have worked in those header files for years, updating them, keeping them current with the newest release of the SDK I could get, and they have now nothing to do with the original files of Scott, that haven’t changed for years, keeping all the bugs I corrected.

But let’s come back to the header files.

I am convinced that the compilation process now is absolutely I/O bound. With the powerful CPUs around, the disk time access overweighs any other considerations. So, I thought, the best would be that the header files would be as small as possible.

Normally you do not think about it when you write your header files. You write comments into them, and size considerations are far from your mind. And this is good, and should be so. The problem is, that in very big files, this adds up to a very significant portion of the compilations time, since the compiler has to parse that comment over and over again, at each compilation in fact.

I took the following approach when rewriting the headers of Scott:

Eliminate all comments

Eliminate all white space and leave only what is absolutely necessary

Eliminate all redundant identifiers in function calls: for instance the function



MyWindowsApi(HWND hwnd,

LPSTR name,

HANDLE handleToProcess);

would become:


This makes for a significant amount of space savings. It is true that many of the names are useful for giving you an idea of what that function does, but let’s face it: lcc-win32 has no replacement for the documentation of the Win32 system, that you should get elsewhere. Besides I think it would be very risky of deducing what the function does and what its parameters are just by looking at the documentation of it in ... a function prototype!

Every effort was done to reduce the size of windows.h to its actual size, about 430K. This compares significantly better to the 4MB the compiler has to read when compiling with the header files of MSVC, and is smaller even than the characters read when you define the keyword WIN32_LEAN_AND_MEAN, that somehow reduces the size of the headers.

But the space savings, and the consequent acceleration of I/O to/from disk (the machine has less bytes to read), is only one side of the problem. Other, significant savings are in the parsing and storing of all those unnecessary identifiers. A declaration like:


adds to the compiler’s symbol table only one identifier, instead of four. This means that the compiler doesn’t have to search for the other three identifiers, and when it doesn’t found them, it doesn’t have to add them to the table. This has as a consequence that searches for a given identifier will be faster, since the symbol table is smaller, that less RAM will be used for storing all those ids, hence less paging, etc.

This makes lcc a fast compiler, even if it doesn’t support the precompiled header files option.

I tried to follow somehow the structure of MSVC, but really, there is no close correspondence here. Just include <windows.h> and any special files you need. Most of them like the property sheets definitions etc are included in Windows.h, so there aren’t a lot of files to include at each compilation. It is true that this contradicts somehow the objective of small size above but the problem is that re-creating that file structure would be too complicated now. Frankly I do not have the energy to do it.

The list of all lcc-win32 header files can be found in the user’s manual.

The header files were reduced to their minimum size, but nevertheless, they make for more than 3MB of code.

The import directive

The objective C preprocessor introduced the #import directive. The purpose of this is only to include a file if the preprocessor hasn’t seen it already. I introduced this into lcc’s preprocessor to easy the development, since there is no need to enclose every header file you write with the usual construct for avoiding multiple inclusion of a file:

#ifndef __SOME_SYMBOL

#define __SOME_SYMBOL



This was discussed in length in the lcc-discussion list. It was an involved discussion, and here I will summarize only the best parts of it:

Nelson wrote:

From: "Nelson H. F. Beebe" <beebe@math.utah.edu>

Date: Sat, 2 May 1998 11:37:20 -0600 (MDT)

Subject: #import and #include

The discussion today on #import and #include mentioned that multiple inclusion by #include could be prevented by a "#pragma once".  A better idea, widely implemented in vendor header files, and GNUware header files, is for

       #include <foobar.h>

to include a file which is wrapped with

#if !defined(_FOOBAR_H)

#define _FOOBAR_H

...old contents...


This requires no addition of #import to the language, and no use of #pragma support.  The big problem with #pragmas is that their contents are implementation dependent, and there is no guarantee that a #pragma from one system will not cause a fatal compilation error on another system, or else be misinterpreted.  Thus, #pragma is effectively useless in portable C code.  Its only justifiable application is inside system header files that will never be seen by any compilers

from outside that implementation.

The sole advantage of #import over the above #include and check of _FOOBAR_H is that a file open and read is not required.  While file openings used to cost me US$0.50 each on an IBM mainframe, they are a lot cheaper today, so most people don't worry much about them.  Large C programs already require lots of header files anyway, so avoiding a small number of file openings by use of #import instead of #include is of little significance, and likely not enough to justify introduction of a new statement into the preprocessor language.

The only vendor that I've encountered who uses #import is NeXT; they are also the only UNIX vendor that I know of which has shipped an Objective-C compiler (a gcc 1.xx derivative).  I find no use of them in DEC, HP, IBM, SGI, or Sun header files.

I answered:

From: jacob@jacob.remcomp.fr (Jacob Navia)

Date: Sat, 2 May 1998 22:39:25 +0200 (MET DST)

Subject: Re: #import and #include

I do not see it as a better idea Nelson, I have to disagree here. The problem is i/o, as you yourself say later in your message. At the beginning of all windows header files there is a construct like the one you propose. Microsoft has adopted this, since it didn't have #import. So the machine reads:

#ifndef __LCC_WINDOWS_H

#define __LCC_WINDOWS_H

(more than 450K of stuff...)


All 450K+ are going to be read searching for the #endif... what a waste. And this can happen several times in the same compilation unit!!!

Rinie Kervel, noticed the following:

From: Rinie Kervel <rinie@xs4all.nl>

Date: Sun, 03 May 1998 11:09:21 +0200

Subject: Re: #import and #include

[snip] ...from the GNU C Preprocessor manual (Version  2 1990):

The GNU preprocessor is handled to notice when a header file uses this particular construct and handle it efficiently. If a header file is contained entirely in a '#ifdef' conditional, then it records that fact. If a subsequent '#include' specifies the same file and the macro in the '#ifdef' is already defined, then the file is entirely skipped, without even reading it.

I remembered this paragraph because it indicated once more that there might be smarter ways of solving a problem than adding non standard features. (Although for objective C #import is a standard feature).

He also states that #import is not a well designed feature.

To quote again: It requires the users of a header file - the application programmers - to know that a certain header file should only be included once. It is much better for the header files

implementers to write the file so that users don't need to know this. Using '#ifndef' accomplishes this goal.

To me this seems easy to implement in the preprocessor: tag with an include file, the #ifndef macro and determine in the #include processing that its already defined.


So, I discovered that the GNU C preprocessor optimizes this away. But let’s get the facts straight: When should the compiler read a file again in the same compilation unit?

The only reason somebody would want that, is that after reading it for the first time, some symbols are #undefined and/or redefined again with new values, so that the inclusion of the same source file will generate different definitions / code, because it contains conditional preprocessor directives that make the compiler take another path. This would look like:

file weird.h


#define FOO 67


#define FOO 4387



Then in weird.c you write

#include « weird.h »


#include « weird.h »

But this makes the file « weird.h » unreadable. You would have to find out in the #ifdef #ifndef conditional jungle what was being actually passed to the compiler proper. That file can be only read by using the preprocessor! This is a really horrible programming practice.

So, in my opinion, there is no need to make the compiler ever read again a file. This is actually a consequence of a design mistake of the C language: the compiler should always read an include file once period. This would make hacks like the above impossible, and would accelerate the preprocessing of C files.

But instead of going to the root of the problem, people have devised the #ifndef ... #define ... #endif patch, so that they could write includes that are processed only once. And then, to put the final ironic note into that bad design, Gnu’s preprocessor will optimize that construct so that you can write that kind of hacks in C very efficiently... This is absurd, in my opinion.

Besides, as Nelson that investigated this a little bit further, noted, that optimization has bugs in it:

From: "Nelson H. F. Beebe" <beebe@math.utah.edu>

Date: Mon, 4 May 1998 08:55:36 -0600 (MDT)

Subject: GNU C preprocessor and avoidance of redundant #include


I also found that if one uses the equivalent, but slightly different, code,

#if !defined(FILE_FOO_SEEN)



#endif /* FILE_FOO_SEEN */

then gcc WILL INCLUDE the file at each #include, instead of suppressing the redundant includes.

Even if I have implemented the #import directive, I decided after the discussion not to put it in the standard header files of lcc-win32. For the time being the issue rested like that.

For a complete list of all the include files see the manual.

The #ident discussion

Wedit’s IDE uses following construct to identify which source files contributed to building which executable:

static char _cmsid []  = ...

Follows an identification string. This will be compiled into the object file, and then linked in the resulting executable, together with similar strings from other modules.

The problem with this approach was that the compiler always emitted a warning at the end, telling me that _cmsid is not used... Annoying. So, when a user asked me why I didn’t implement the #ident preprocessor directive, I was happy to jump into it, and implement it.

But later on, and prompted by a reflection of Norman Ramsey, I started investigating what other compilers would do with #ident, and the results were surprising: there seemed to be absolutely no consensus as to what this directive would do. So I sent a message to the lcc discussion group asking for help.

From jacob Wed Jul 29 08:44:47 1998

Subject: reconsidering #ident

To: lcc@cs.virginia.edu

After I implemented #ident, and following a conversation with Norman Ramsey,I decided to investigate a little more this question.

I implemented what I considered was the semantics of #ident: put a sequence of bytes in the data segment, so that the executable or the object file can be identified by a corresponding utility.

What I do, is to generate an anonymous data space with the characters contained in the argument string:


#ident  "0123456789"

will make lcc generate a label followed by those bytes in the .data section.

But then, I started observing what other compilers do with it, to make sure that I did things right. The results were startling:

MSVC will accept #ident, but will generate a .comment section  to be read by the linker, instead of a data item like I did.

Microsoft's linker doesn't include those bytes in the executable, so the identification is for the object file only. cl's output looks like this:

section 00 (.drectve)  size: 00061  file offs: 00060

00000000: 2D 3F 63 6F 6D 6D 65 6E  74 3A 22 30 31 32 33 34 -comment:"01234

00000016: 35 36 37 38 39 22 20 2D  64 65 66 61 75 6C 74 6C 56789" -defaultl

00000032: 69 62 3A 4C 49 42 43 20  2D 64 65 66 61 75 6C 74 ib:LIBC -default

00000048: 6C 69 62 3A 4F 4C 44 4E  41 4D 45 53 20          lib:OLDNAMES

gcc under Unix, compiles into nothing. The following file:

#ident "0123456789"

will compile into a completely empty file (without any strings in it). No warnings will be issued. The directive is accepted but ignored. (Linux red hat, gcc version

Intel's compiler icl will compile this into a linker Comment, as Microsoft, but with a different format:

section 00 (.comment)  size: 00070  file offs: 00300

00000000: 49 6E 74 65 6C 20 43 20  43 6F 6D 70 69 6C 65 72 Intel C Compiler

00000016: 20 56 65 72 73 69 6F 6E  20 32 2E 34 20 50 39 37  Version 2.4 P97

00000032: 31 37 36 20 2D 44 5F 4D  53 43 5F 56 45 52 3D 31 176 -D_MSC_VER=1

00000048: 30 31 30 20 2D 4D 44 20  2D 63 20 30 31 32 33 34 010 -MD -c 01234

00000064: 35 36 37 38 39 20                                56789

Borland C for windows 3.1 (the only version of borland I have) rejects the #ident statement emitting an error message, as lcc did before.


I am considering dropping support for #ident actually. This directive seems to be portable what syntax is concerned, but each compiler does a different thing. Emitting a warning would be, in this case, a better alternative to giving the user a false sense of security.

What do you think?

As always, Nelson Beebe answered with a full investigation of #ident in many compilers. His investigation is probably the most exhaustive comparison about #ident done to date. He searched the output of  123 compilers!

From: "Nelson H. F. Beebe" <beebe@math.utah.edu>

To: jacob@jacob.remcomp.fr (Jacob Navia)

Cc: beebe@math.utah.edu

Subject: Results for #ident on many UNIX systems

I made this test file:

% cat i.c

#ident "0123456789"

void foo(){}

and then tested it on 133 UNIX compilers on various systems; only 15 of them put the ident string in the .o file.

The Makefile says


       $(CC) -c i.c

       @strings i.o | grep 0123456789

Here is the typescript of the test runs:

[snip... it would fill this book!]

So, I sent the following message to the lcc discussion group:

From jacob Thu Jul 30 19:19:10 1998

Subject: conclusion

To: lcc@cs.virginia.edu

As Nelson Beebe noted, only a few of the compilers tested did the same as I did, regarding #ident.

That report makes an interesting reading though. Here we have a compiler that emits assembler instructions that its own assembler doesn't understand...

/usr/local/bin/g++  -c i.c

/usr/tmp/cc001011.s:1:Unknown pseudo-op: .ident

/usr/tmp/cc001011.s:1:Rest of line ignored.1st junk character valued 34 (").

The "junk" is the assembler pseudo op .ident. Since my red hat linux has those instructions too, seems that the Next's assembler has a problem... Well, I shouldn't laugh too loud: I had a similar bug a year ago...

To conclude, I will drop #ident, and for the people that want the same functionality you can always do

static char *rcsid = "something";

The IDE of lcc-win32 does this since a few years when you choose the 'identify file', in the 'Search' menu. The only problem (that has no solution as far as I see) is the warning at the end:

warning: static char rcsid is never used

It was this problem that made me accept a user's complaint about my missing #ident support. But I see that the problem is not at all solved. Let's stick to ANSI C.

Other fine points in the preprocessor

The standard defines several macros like __DATE__ or __TIME__, that should be set by the preprocessor to character strings. But as always in language issues, things become complex when you consider the following code :

#if defined(__DATE__)



Are those symbols really defined or not ? As I found it, the preprocessor did not consider those symbols as defined. A careful reading of the standard however, tells a different story :

6.10.8 Predefined macro names22

  1.  The following macro names shall be defined by the implementation:

__LINE__ The presumed line number (within the current source file) of the current source line (a decimal constant).

__FILE__ The presumed name of the current source file (a character string literal).

__DATE__ The date of translation of the source file: a character string literal of the form "Mmm dd yyyy", where the names of the months are the same as those generated by the asctime function, and the first character of dd is a space character if the value is less than 10. If the date of translation is not available, an implementation-defined valid date shall be supplied.

__TIME__ The time of translation of the source file: a character string literal of the form "hh:mm:ss" as in the time generated by the asctime function. If the time of translation is not available, an implementation-defined valid time shall be supplied.

  1.  The values of the predefined macros (except for _ _LINE_ _ and _ _FILE_ _) remain constant throughout the translation unit.
  2.  None of these macro names, nor the identifier defined, shall be the subject of a #define or a #undef preprocessing directive. Any other predefined macro names shall begin with a leading underscore followed by an uppercase letter or a second underscore.

I presume that the rationale for not defining those names was that if they can’t be #defined or #undefined, they are not really #defines... In any case I decided that they should be defined in lcc-win32, so the above construct

#if defined(__DATE__)

should evaluate to TRUE.

In the same vein, I implemented the __func__ pseudo-macro that evaluates to the name of the current function, if any. Its definition in the formal language of ANSI looks like : Predefined identifiers23


1 The identifier __func__ shall be implicitly declared by the translator as if, immediately following the opening brace of each function definition, the declaration

static const char __func__[]="function-name";

appeared, where function-name is the name of the lexically-enclosing function.

2 This name is encoded as if the implicit declaration had been written in the source character set and then translated into the execution character set as indicated in translation phase 5.

The last point sent me in a long reflection about its meaning... and then I decided that the execution character set is the same as the source character set, and left the whole at that : I did not translate anything from the name of the function, so, if the function name symbol was accepted by the parser, it will appear without any translation in the output.

Another issue is to know what do I do when you attempt to use that identifier outside a function scope. I just decided to return the string « <none> », and emit a warning. The standard is silent about this, and anyway, since this identifier begins with an underscore, it belongs to reserved name space of the compiler., so you shouldn’t use it.

I added several options to the pre-processor. Specifically, the ‘M’ option, that will print in the standard output the names of the included files. This wasn’t specially difficult to do, just look at the function doinclude() in ncpp.c and you will find the printf that outputs the opened file. A more detailed output can be obtained with the option ‘M1’, that will input all files and the stack trace of the preprocessor. This has proved very useful for debugging the complex includes of the ole automation headers, for instance.

Generating the debugging information

Lcc generates debug information that follows the NB09 standard as defined by Intel and Microsoft in their document about debugging information. This means that the debugging info is compatible with the MSVC debugger that should work better than lcc’s one for a certain time still.

This information will be generated in the file CV.C in the source tree of lcc.

An innovation in lcc-win32, is that I decided to include within the debug information all the #defines that are actually used somewhere in the program. This allows the debugger to give the value of any #define, what can really help when you have statements like:


for example. Now the debugger can show you what is behind that name.

Since the only use of that debug information is to display a value, I decided to store all ‘defines’ as character strings, so no type information is available. The type information is absent in the source anyway. When you say:

#define NUMBER 8

Is this a char? An integer? A short?

The choice would be arbitrary anyway, so I think is better to leave things as characters?

The debug information is stored in a special area, and accumulated during the compilation  process. Once the compiler is finished, the assembler will output two sections: .debug$S with the symbols information, and a .debug$T with the types defined in the module. More on this later on.

Generating the browsing information

Using the browsegen utility or using the -x command line option in the compiler, generates cross referencing information in .xrf files. These files contain information about all the symbols used by the compilation unit. The name of the .xrf file is derived from the input file, i.e. if you compile foo.c, the .xrf file is named foo.xrf.

The browsegen utility is a modified version of the compiler for doing just this. It will accept two extra command line parameters that aren’t used in the compiler proper:

Command line option



Include in the generated cross-reference file all the symbols defined, even if they are not used in the main file. Normally, browsegen outputs a definition only when that definition is used somewhere else.

-q <symbol name>

Query just one symbol. For example the command line:

Browsegen –q WM_COMMAND windows.h

Will output a short .xrf file containing just the definition of the WM_COMMAND symbol.

If the symbol ends in an asterisk (i.e. WM_*, for instance), browsegen will output the definition of all symbols that start with the three letters “WM_”.

Any tool that desires to gather information about the program can use these files.

Design considerations

The purpose of the browse subsystem is to find quickly answers to questions like :

  •  Where is this symbol defined ?
  •  Where is this symbol used ?
  •  Where is the field <x> of this structure <z> used ?
  •  Where is the address of this symbol taken ? This is important to find where the aliased symbols are.

The browse subsystem doesn’t encode the actual type of each symbol it registers. Just a very large classification is done, encoded in the first letter of each line, just before the identifier. The main purpose is to associate to each symbol of the program, a coordinate with file, line, that allows the browser program or the editor, to position itself at the place where the definition is.

Within this context, it is important to see that the coordonates of a symbol (where it is defined)

aren’t necessarily within a single line. A typedef can span several pages of program text sometimes. In those cases, it is necessary to give two coordinates to each symbol. Normally however, most definitions need only a source file and a source line, to correctly position the cursor of an editor in the place where the user can read it.

Only symbols that are actually used are included. It is surely not necessary to repeat all the definitions of windows.h within the browse database, that should be kept as small as possible for efficient access. An exception to this rule are the type definitions defined within the main compilation file. Even if those definitions aren’t used anywhere in the program, it is better to add them to the browse index. This poses the problem of name clashes. Suppose  that in a file1.c we define an enumeration with the same name that another structure in file2.c. This is legal C, but it is not handled in the current version. Anyway, this is a horrible programming practice, if done purposedly. But it could be that an accidental name clash appears. The browse linker should warn the user if this happens. This is another thing that is currently in the agenda.

To make the task of the editor easier, subtypes are carried on, for instance when you define :

typedef struct _someTag {

int a ;

int b ;


a declaration like :


Will provoke that the browse generator will mark s as being a _someTag, but will mark the STRUCT name at the end of the line, indicating the type definition.

Another problem happens with the defines of the preprocessor. They can be used in the header files, without ever being used by the main program. Currently I add them anyway, but it would be better probably to add them only if they are used in the main file.

The generator ignores all local variable declaration, considering that they are not relevant. The editor (wedit) is able to find the local variables of any function, and show their definitions, but it doesn’t rely on the browse information.

The system uses heavily ASCII plain text files. Each record is generated in a line, separated by the newline characters. This makes sometimes for quite big files, but during this development phae, where the format, and the kind of information that is written is bound to change somewhat, it is better to be able to use a plain text editor to see the output. Other people that would like to use this files will surely find easier to parse a text file than to read a binary description.

Of course, to keep the size of the generated files small,  the information is written using more or less meaningless one letter acronyms. This makes the paring much simpler, even if the generated files look quite ugly to anyone that would care to read it.

Browse file format

Each symbol used/defined in the compilation unit is in a line by itself. The files are in plain text (ASCII). I thought for a while of using a binary, more space-efficient format, but this would put this information out of the reach of others, that could surely use it as well as Wedit does. So, I decided to stay in ASCII.

The file begins with a file table, listing the names of all files opened during the scan. The files are introduced with the letter ‘F’ (main file) or ‘f’ (included file).

The format of each line is as follows:




A single letter giving the type of symbol. The letters are explained below


The symbol name


Two numbers, giving the index in the file table for the file, and the line number within that file.

The first letter in the line determines the type of information that follows:




Assignments done to this symbol


Assignment to a structure field. The name of the field follows


Loci where the address of this symbol is taken


Exported data symbol This is followed by the name of the export, and the line number in the input file where the export is defined


Exported function Same format as ‘E’ above


File. A path for the input file follows


File A path for an include file follows


Loci where this symbols is dereferenced and its value used as pointer


Externally defined function. Usually the loci of the prototype.


Parameter passed to browsegen in the command line


References done to this symbol


Static Same format as ‘E’ above.


Static function. Same format as ‘S’ above


Typedef. File number, line number, start of definition line number. At the end can follow the type this typedef refers to.


Structure The name of the type, the index of the file that defines it, and the line number within that file where the definition ends, followed by the line number where the definition starts


Member of a structure or union. Name of the member follows.


Preprocessor definition. Followed by the name, the file index of its definition file, and the line number within that file

The files generated are relatively small. For an input file of 229,826 bytes for instance, the generated .xrf file makes only  23,447 bytes.

Linking the browsing information.

For each source file then, the compiler generates a corresponding .xrf file, containing the location of the definitions of all source files. It is necessary then, to consolidate this information in a central index that will contain the dictionary of the whole program. A small utility does this called ‘browselink’.

This utility reads all .xrf files, and generates a browse.index file, containing the definition loci for all symbols. External symbols are discarded, and only one definition of a symbol is retained. All locations in the .xrf file are relative to the table of that file. The linker translates this to locations in a single global file table.

The format of that file is as follows:

First comes the file table. The format of each line is just as above, the letter ‘f’, followed by a space, and then the name of the file. All references to files of the identifiers below are indexes to this table. The indexes start with 1, i.e. file 3 is the third line of this table.

Then come the identifiers, sorted alphabetically. The general format is as above: a ‘type’ letter, followed by a space, the identifier, then the file number, and then the line number where the definition of the symbol appears.

I was tempted to use a binary representation that would be more compact, but then decided that the advantages and simplicity of a textual representation outweigh the space considerations. You can use a normal text editor (notepad will work) to see this file, and all other text related utilities like «‘grep’ and ‘diff’ will work with those files. This makes easy to write utilities that read this information.

The size of the index is small. For instance, a huge project consisting of more than 15MB of C source produces an index file of only 213K.

The name of the index defaults to browse.index, and is stored together with the object files in the lcc make directory.

Browsegen and the preprocessor

The utility can be used to show information about the preprocessor. The option


will make the utility print in standard output all the line numbers of the lines in the source file that do NOT contribute to the compilation, i.e. is a list of inactive lines.

This option can be combined with the /Fo option, to produce that list in a file that can be further processed. With the command :

browsegen –showifdeflines /Foinactive.txt myfile.c

you will obtain a list of inactive lines in the file called inactive.txt.

The long long extension

The long long extension makes for several dozen rules in win32.md. All instructions are implemented as several assembly instructions, since they do not exist natively in the x86 series.

I defined first a mem64 target that represents all 64 bit memory references.

mem64: INDIRL(addr) "%0"

mem64: INDIRL(ADDRGP) "%0"

Another new target is the EAX:EDX register pair. This holds a 64 bit quantity in those registers:

rpair: mem64

 movl %0,%eax  (1)

 leal %0,%edx  (2)

 movl 4(%%edx),%edx (3)

Note that I use the edx register as a pointer. First, I load the lower 32 bits into EAX, then use EDX as a pointer to the upper 32 bits. This is not very efficient, especially when the 64 bit memory being accessed is in the local stack frame. But I found very difficult to express an address addition within the framework of the intermediate language, so I used this.

Then I implemented assignment:


movl %eax,%0

leal %0,%eax

movl %edx,4(%eax)

This will assign to a mem64 operand, the values contained in the EAX:EDX register pair. The same technique as above is used, this time with EAX as a pointer.

I needed somehow to store temporaries in memory. The x86 with its 6 available registers could hold only two of those numbers anyway, because we have to leave some registers free so that the machine can perform useful work. This led me to the need to represent a stack allocated intermediate as a « argl » goal.


leal  %0,%eax

pushl 4(%%eax)

pushl (%%eax)

This leaves in the stack the contents of the 64 bit memory location.

Obviously, it is fundamental that nothing is left in the stack. All rules that left their results in the stack are followed by other rules that take a stack location, eventually generating an assignment or a function result.

Using those primitives, I implemented the rest of the operations. Here is an example for  addition:

rpair:ADDL(mem64,rpair)    (1)

leal  %0,%ecx   (2)

addl  (%ecx),%eax  (3)

adc  4(%ecx),%edx  (4)

This instruction results in a rpair, i.e. the EAX:EDX register pair (1). It takes a mem64 and a register pair, and adds it the contents of the memory location. It uses ECX as a pointer to the memory operand (2), and then performs two 32 bit additions (3) and (4).

Not all operations have been implemented in line as this example. Division and multiplication are too complicated to inline, because their length. The rules for those operations generate a function call.

The cost of a function call (1 cycle if the procedure is in the cache) doesn’t justify the code bloat that would be the consequence of inlining all those operations.

Here is one of the rules for multiplication:


leal  %0,%eax (1)

pushl 4(%%eax)

pushl (%%eax)

leal  %1,%%eax (2)

pushl 4(%%eax)

pushl (%%eax)

.extern __allmul (3)

call  __allmul

This rule is straightforward: using EAX as a pointer, I push the first argument to the rule (1), then using the same register I push the second argument (2), then I make the call. No stack cleanup is needed because __allmul will cleanup the stack when exiting (this is the _stdcall calling convention).

Implementing run time debugging support: the shadow stack.

The motivation for this was that terrible ‘dialog box’ with the registers dump that appeared every time an application that was compiled with lcc crashed. It was completely meaningless, displaying to the user an impressing display of hexadecimal numbers, containing the values in all registers.

I decided that a better display would really help people find out where the bugs in their programs were. So I decided to maintain in a fixed point in the stack, an internal program counter that would hold always the current source line. In the same way, each function would hold in its stack frame, an ASCII string with the name of the source module that was used to compile, and the function name.

I modified the compiler then, augmenting the debug levels from -g2 to -g3 and -g4. The two new levels would:

In level -g3, the function stack would be available, but not the source line. This mode is really fast, almost as fast as normal code.

In level g4, the compiler generates code to maintain at all times a pointer to the currently executing source line. This can reduce the speed of the program at most by a factor of two.

The implementation uses a proposition described in an article of Dave Hanson that was kind enough to point me to it.

I maintain a shadow stack, using only one global variable, called just ShadowStack, to make things clear. Actually, that variable contains the last value of the register EBP, the frame pointer.

The function prologue is modified to record the function that is currently active, building the following structure in the stack:

struct ShadowStackList {

char *ModuleName;

int CurrentCoordinate;

char *FunctionName;


The members are:

The name of the module where the function is defined. This points to an automatically generated character string.

The current coordinate. This is a 4-byte integer that contains in the upper 16 bits the coordinate of the starting line of the function within the module, and in the lower 16 bits the line number of the current executing line.

The name of the function. This is an automatically generated character string.

At each line, the compiler generates code to update the current line number in this structure. This is the equivalent of what would happen if at each line of your program you would write:

dosomething();LOWORD(ShadowStack.CurrentCoordinate) = 365;

At each function call, the compiler generates code to restore the shadow stack to the current value after the call.

The entry point of the main() function (or the WinMain() function for windows programs) is augmented with code that assigns the exception handling code using the signal() system function.

When a trap occurs, the code in the startup detects it, and calls the _lccShadowStack() function, that walks the stack frames showing all the functions that are in the stack at the moment of the fault.

This is all very good if the stack is not corrupted of course. To make a recovery possible when there is a problem with the stack, at the level 5 (-g5), the compiler will emit code at the function epilogue to save EBP before doing the actual return. Since EBP is never used in any code  (at least not in any code generated by lcc), and it is saved immediately upon procedure entry, this value should allow the stack walking code to detect this situation, and at least show in which function the stack fault occurs.

Another feature with the –g4 option, is that I test at the function’s epilogue if the stack has been corrupted. I test specifically that the pointer to the function name is still there, so that any stack overwrites get cached earlier.

The generated code for a function’s prologue then, looks like this :


       pushl   %ebp

       movl    %esp,%ebp ; establish stack frame

       pushl   $_$Module ; push the name of the file

       pushl   $2359296  ; push the starting line number

; shifted left 16 bits (36<<16)

       pushl   $_$Ffoo  ; push the function name

       movl    %ebp,__ShadowStack ; save frame pointer

       pushl   %esi  ; save registers

       pushl   %edi

       .line   37

       movb    $1,-8(%ebp) ; relative line nr goes into the lower 16 bits of the line

     ; number field

At function’s exit, the compiler generates code to check that the pointer to the name of the function is still there. This catches a pointer that overwrites stack values.

       cmpl    $_$Ffoo,-12(%ebp); test if pointer still there

       je      _$G4foo  ; if it is, nothing is done

       pushl   $0   ; call lccStackTrace

       call    __lccStackTrace

       addl    $4,%esp


       popl    %edi  ; exit function

       popl    %esi



Error handling within lcc-win32

The compiler will stop after a fixed number of errors. In general, error recovery is very simple, and in many cases, a missing ‘ ;’ will throw the compiler in a cascade of errors where only the first ones are relevant.

The production versions of the compiler run without checking any of the assertions specified in the code; i.e. with NDEBUG defined. This will in some cases provoke a trap within the parser, especially in cases where a type is not correctly initialized because of an error. All traps are cached by a global error handler established by the main() function at startup. This trap handler looks first at the contents of the nerrors variable, which keeps the count of the hard errors encountered so far. If this is greater than zero, it is likely that this trap is a direct consequence of previous errors, so the message that is shown to the user is:

Confused by earlier errors. Goodbye.

And the compiler quits.

If the error count is zero, this is a hard compiler error, and this is acknowledged to the user. The exact wording of the message is :

Compiler error (trap). Stopping compilation.

I think this solution is an efficient and quite appropriate solution for this problem. The code is kept small, avoiding the clutter of many tests for ‘obvious’ things like NULL or bad pointers, and passing those tests directly to the hardware that performs them much faster than software.

The information flow in lcc-win32

All input to the compiler enters the system through the preprocessor, (file ncpp.c). Here the files are opened, and read in, preprocessed, and copied into a buffer that the input.c routines pass to it. The central routine that makes the interface between the pre-processor and the rest of lcc is ReadFromCpp(), that fills the buffer passed to it with preprocessed code.

If the -E option is on, the stream from the preprocessor is not further processed, and goes directly to the .i or other file. No code is generated.

In the normal case, the pre-processed ASCII stream passed then to the lexer, specifically to the gettok() function, that classifies the input into tokens, reads char strings or other constants, etc. The lexer emits a token stream, that is then further processed by many functions in the compiler, that transform that token stream into a series of ‘Trees’.

Those ‘Trees’ are then passed to the function that emits the intermediate language, i.e. in dag.c the function listnodes().

The dags are then reconverted into trees in the technical sense, i.e. in hierarchical structures without any cycles in them. Common sub expression elimination is done when doing this conversion, since a node that is referenced more than once is a common sub expression. Those common sub expressions are recorded and assigned to temporary variables or registers.

When the end of each function arrives, the whole code list is passed to the code generator, which labels the trees, to determine the best rule that will execute the given tree with the minimum cost. The labeler is automatically generated by the utility lburg from the machine description in win32.md, and is located in the source file win32.c.

When the labeling operation is done, the resulting labeled tree list is passed to the register allocator, that determines which registers to use to evaluate each expression. All the machinery of the register allocator, the spiller, is located in gen.c. If the user has specified the -z option to see the intermediate language code, the printing is done at this stage.

After the register allocation is done, the code goes through yet another pass: the emitter, which will generate an ASCII stream of assembler instructions. If optimization is turned on, the stream is redirected to the peephole optimizer, if not, it is directly passed to the assembler.

The assembler is the only module that does any output: the final object file. If the -S option is on, i.e. if the user has specified an assembler output, the stream of ASCII machine instructions will be printed before passing it to the assembler.

The source files of the compiler proper

bind.c. This is used in the context of a cross compiler, a feature not supported by lcc-win32. Only one machine description (the x86 machine description) is available.

profio.c. This is used for the profiler, but the lcc-win32 profiler doesn’t use this file.

prof.c. Same as above.

trace.c.Prints function calls trace information. Not used in lcc-win32.

seh.c. The structured exception handling module. The entry point is the function dotry(), that is called directly from the lexer (in lex.c), when a “ __try “ construction is seen. The whole exception handling stuff works underneath the compiler proper and all constructs are invisible to it, being processed within the lexer.

tree.c. The generation of the tree structures. This is the first intermediate representation, and from this one, the dags are later generated. You will find in this file the tree constructors, and the main function ‘root()’, that dispatches according to the type of expression it sees.

string.c. String handling and storing. The compiler always writes a symbol once. In this file is done the hashing of symbols into the hash table.

alloc.c. Memory management routines.

lex.c. The lexer. The main function is ‘gettok()’, that returns the token scanned. Other quite unrelated functions live here, like the ‘assem()’ function, that parses the arguments of the _asm() construct, or most of the functions that do pragma processing.

expr.c. Expression parsing. You will find here the routines for expr0, expr1, expr2, incr, value, expr3, unary, postfix, primary, idtree, rvalue, lvalue, retype, and others, all dealing with trees.

w32incl.c. The x86 back end. This has two main parts: the automatically generated routines produced by the lburg routines, (in the file win32.c) and the file w32incl.c, that contains the ‘function()’ routine that controls code generation for the whole function .

decl.c. Parsing of declarations. Here is the entry point from the front end, called directly from the ‘main()’ function : program(). Parsing of all declarations is started here, with functions to parse argument lists, functions, etc.

gen.c. The code generator. The main function here is the ‘gen()’ function, that generates the code, invoking the labeler in win32.c. Another important function is the ‘emitasm()’ function, that outputs the assembler mnemonics.

dag.c. The generation of the intermediate representation as Directed Acyclic Graphs (DAGs). This dags will be converted to normal trees by the function ‘visit’, in this same file.

analysis.c. First part of the optimizer. Scans the converted dags looking for certain patterns.  The function ‘SetupRegisterVariables’ decides which variables will be assigned to registers.

cv.c. The generation of the NB09 debug information.

enode.c Emits nodes for certain constructs like call, and others.

error.c Displaying errors and warnings.

event.c. Not used by lcc-win32.

simp.c. Simplifies expressions. The main function here is called simplify. It will try to replace costly operations by cheaper ones, like replacing a division by 4 by a right shift; it will ignore unneeded operations like additions of zero or multiplications by one, and many other transformations. The input data of simplify are the trees generated by the parser, before they come into listnodes, that will build the DAGs out of them.

init.c. Handles initialization of tables, structures, or plain variables.

input.c. Handles reading from the preprocessor.

win32.c. The tree labeler. This file is generated automatically from the machine description by the utility lburg.exe. Please do not edit, since it will regenerated at each modification of the machine description.

intrin.c. Intrinsic and MMX functions. These functions will be detected by the function ‘call’ that lives in enode.c. The processing of the intrinsics is done later, at the register allocation phase, and then in the code-emitting phase, where special functions will be called to actually emit the code for the intrinsic.

list.c. List handling utilities.

asm.c. The assembler. All binary opcode generation is done here, together with the construction of the object file and the debug sections.

ncpp.c. The preprocessor of lcc-win32, written by Dennis Ritchie.

stmt.c. Parsing of statements. The main function in this file is not surprisingly ‘stmt()’, that coordinates the whole statement parsing, calling ‘ifstmt()’, ‘forstmt()’, and all others.

null.c. Not used by lcc-win32

msg.c. All character strings containing all messages of the compiler. It suffices to translate this file into another language for having a compiler that will speak your native tongue.

optim.c. The peephole optimizer. The entry point of the peephole optimizer is the routine  ‘ChangeBlock()’, that processes a pseudo basic bloc. It calls the ‘FindEndBlock()’ function, that inputs each instruction into its corresponding structure. The structure used to describe an instruction is the ‘Instruction’ structure, defined at the beginning of the file. The peephole optimization proper is done in the function ‘FindStateChanges()’, a huge function making most of the file.

sym.c. Handling of symbols, symbol table, scopes, etc.

main.c. The ‘main()’ function. Parsing of command line arguments and flags settings are done here.

output.c. Output routines, and the ‘print’ function.

types.c. Type definition and related code.

win32.md. The machine description for the x86 back end of lcc. This machine description is used by the lburg utility to output the win32.c file, which implements the code labeler.

Building the compiler

The sources of the compiler are located in \lcc\src\lccsrc. They come with a makefile; so, in principle all you need to do is just type ‘make’.

You need to have the ‘lburg’ utility in your path. Besides, the first time, you should copy into that directory the binary from \lcc\bin. The reason is that the makefile is configured to compile the compiler with itself in a loop, so the makefile uses the lcc.exe residing in its own directory rather than the one in \lcc\bin. This way, when you make a change, you can test whether the changes still allows the compiler to compile itself, without touching the compiler that you know is working.

You may want to configure some of the flags in the makefile:

ASM_LIB This flag decides whether the new compiler will include the assembler accelerator functions contained in libasm.asm or not. Note that you should delete the object file from the list of object files if you do NOT define this option.

NDEBUG This flag decides whether the assertions will be included in the generated code.

-O Optimization flag.

If you have a Pentium II or higher, you should add the –p6 option to the compiler flags.

The makefile comes configured with ASM_LIB, with NDEBUG, and with the assembler accelerators, i.e. a maximum speed setting. You may want to change this when you want to change something.

You can use other compilers to build lcc, if you prefer. You should obviously turn OFF the assembler accelerators, and build a project for the compiler of your choice. Please remember to set the packing structures flag. Normally, there shouldn’t be any dependencies from packing considerations in the code, but you never know. You will experience problems if the other compiler doesn’t support 64 bit integers. In that case please turn off any 64-bit int stuff. Obviously, the compiler so generated will be incomplete. The places to watch are:

lex.c : function readatoi64().

w32incl.c : function defconst().

c.h : The union ‘value’ should be modified to eliminate the 64-bit integer member.

Lcc’s assembler


The tasks that are done in the assembler file asm.c are mainly the following:

  1.  Obviously, assembler mnemonics generated by ‘genasm’ are translated into machine codes
  2.  For each compilation unit, an object file is built using the guidelines specified by Microsoft for the construction of Coff object files.
  3.  The debugging information generated by cv.c is stored in the debug sections of the object file.

The construction of the assembler was one of the first tasks in lcc-win32. Lcc used for its Dos version Borland’s Turbo assembler and a Dos extender to run on 32 bits. Since I wanted to build a self-contained tool, an assembler was inevitable. There are a few public domain assemblers that I looked upon.

The first choice was obviously Gnu’s GAS (Gnu Assembler). I started looking at it, trying to see how I could take it away from their ‘BFD’ (Binary File Descriptor) stuff, but I found the task so enormous, that I gave up. It was impossible to see how the assembly process was being done in an enormous mass of bloated code. GAS is ‘machine’ independent. What this means for an assembler, is not at all clear to me. In any case, each functionality of the program is redirected through function tables that dispatch to the right machine dependent section of it.

This is maybe very clever, but if you want to follow the actions of the program... well forget it. You are confronted to

(* TableOfFunctions[TableOfIndirections].fn)(arg1,arg2);

Then I examined different assemblers in the public domain. The best that I found was the version of ‘emx’ for the OS2 system. It was a small (compared to GAS) program; the source seemed clean and usable. The problem was, obviously, that it generated OMF format files for OS2. This wasn’t so bad. I extracted all the section of the program that generated the 386 machine codes from it, and gave an overall ‘simplification’ pass over it. Actually the only things that rest now from the original program are the tables of machine instructions and the algorithm for parsing instructions. Still, I am indebted to Eberhardt Mattes for publishing his program.

In a patient work that lasted a few months, I added the code needed to handle the fixups instructions for Win32; the code needed to build the object file, and the building of the different sections.

The format of the x86 instructions

The x86 integrated circuit is an interpreter. It reads the instruction stream, decodes each instruction, and interprets it, doing what the numbers tell it to do.

Depending on the instruction, the x86 CPU instructions follow the general instruction format shown in the schema below. These instructions vary in length and can start at any byte address.

An instruction consists of one or more bytes that can include:

Prefix byte(s),

At least one opcode byte(s)

Mod r/m byte

s-i-b byte

Address displacement byte(s)

Immediate data byte(s).

An instruction can be as short as one byte and as long as 15 bytes.24


Following this schema, the core of the assembler, the function Asm386Instruction first parses the optional instruction prefix, and the mnemonics for the operation (the opcode field).

If the operation is a repeated string operation (movsb, or any such) there are obviously not any arguments to parse since the arguments (the registers ESI, and EDI for source and destination) are implicit in the instruction. If not, the assembler parses the arguments of the instructions.

Since I started with the machine code tables of lcc for the linux system, I kept their non-compatible assembler syntax. That syntax was introduced by gcc: instead of putting the destination first, and then the source, as all x86 assemblers do, following a convention introduced by Intel, gcc decided to do the inverse: they write the source first, and then the destination. Since in the linux system, gcc is the (only) compiler, the instruction tables followed gcc’s syntax. I decided not to change this, even if that decision provoked later several problems in the disassembler that the debugger uses, where I try to hide from the user that non-standard syntax.

The parsing of the instruction’s arguments is done in the i386Operand routine. There the different forms for addressing are translated into decorations of the global ‘i’, for the current instruction. Those decorations are used later by the calling function ‘Asm386Instruction’ to emit the instruction bits.

When the (optional) arguments are parsed, Asm386Instruction makes several tests for special situations. One of them is, for instance, the test for the INT 3 instruction that is reserved to break into the debugger. We will see in the debugger chapter why this instruction MUST be of only 1 byte length.

This tests over, it continues with the outputting of the instruction: jumps are written first, followed by the other instructions. I left in the code, the tests for all instructions, even if lcc will never emit them. Those instructions (like lcall or ljmp) could be useful in another context later on for others that would like to build another tool.

Here are some examples for instructions, and the binary code generated for them.

1. Move a memory location into a register

movl _Global,%esi  8b 35 00 00 00 00

     8b = 10001011  (139 decimal)

     35 = 00 110 101 (53 decimal)

The opcode for the move instruction is 8b. Then we have the mod/rm byte. This byte contains three fields: the mod field, the reg field and the r/m field.

The mod field is the first two bits of the modr/m byte. In this example they are 00.

The reg field is the next three bits. The have the value 110 (binary) specifying register number 6 : esi.

The r/m field combines with the mod field to form 32 possible values: 8 registers and 24 addressing modes.  In our example the value is 00 101, i.e. a 32 bits displacement, the offset of our ‘_Global’ within the data segment. If we change the value of the register to edi (i.e. we write

movl _Global,%edi   8b 3d 00 00 00 00

the binary sequence changes only slightly: only the modr/m byte changes.

      3d = 00 111 101

Our mod+r/m field rests the same of course, since we are still accessing the same ‘_Global’, only the register number changes to 7 (111 binary) meaning register edi.

And what about those 4 bytes of zeroes?

Well, here the assembler realizes that this is a memory access to a symbol. The linker will only know the real address. The assembler limits itself to emitting the 4 zeroes as a place holder, and emits a ‘relocation record’ in the object file, i.e. in the today fashionable ‘internet’ way:

FROM: lcc

TO: linker

SUBJECT: Relocation at offset 2645

Dear linker:

Would you mind putting here the effective address of the _Global variable?

The ‘_Global’ symbol is the 354th in the symbol table. Please do a DIR32NB reloc.

Yours truly


We will see the format of the relocation instructions below.

Emitting correctly those relocations was one of the tough tasks of building that assembler. An incorrect relocation record would destroy the code somewhere else, because the linker would patch the wrong location! This could make the program crash, or just would be invisible, if the program didn’t access that part of the code...

I will explain the relocation records more fully below. Let’s come back to our instruction encoding chores.

A more sophisticated example is:

movl  %eax,(%esi,%edi,1)  89 04 3e

This means that the contents of eax will be stored at the table where the base is pointed to by edi, at the index indicated by esi, scaled 1.26

Here we are introduced to the ‘sib’ byte mentioned above.

We have:

    89  opcode

    00 000 100 modr/m byte

    00 111 110 sib byte

In this byte we have three bit fields:

1. The SS field or scale field is the two most significant bits of the byte. In this case is zero, meaning a scale of  2 ^ 0= 1.

2. Then (bits 5 4 and 3) we have the index field (111) meaning register esi for the index.

3. Then we have 110, the base field, meaning the edi register for the base.

To get a detailed description of how each instruction is disassembled you should look in the pedump source tree, specifically the file disasm.c. There you will find the disassembler that is an adaptation of several files from gnu’s as program.  The entry point of the disassembler is called  ‘DoDisassemble’, and outputs the disassembled instruction into a buffer.


Since the assembler makes only one pass over its input, we are confronted with the problem that the exact size of some instructions is not known. For instance when it sees:

jmp _$L2





The position of the ‘_$L2’ label is not known. Now, we do not know if we can generate a jmp instruction with a byte displacement or with a 4 byte displacement, until we find the _$L2 label. We could solve this by just generating always a 4-byte displacement with it, but this would be really wasteful of RAM. The solution Eberhard Mattes had in its program is to use a ‘Fragment’ data structure that can grow if needed. I maintained that of course, even if the code is very difficult to follow due to a ‘macrology’ a bit confusing.

The format of object files

Object files contain not only code bytes. They contain the initialized data of the module, the size of the non-initialized data (bss), information for debugger, and maybe some other stuff. This has to be organized in a standard way so that utilities are interoperable.

In a radical departure from the Windows 3.1 days, Microsoft decides to follow the already existing COFF standard. Before Win32, Microsoft used the OMF (Object Module Format) file format that was developed by them and IBM for the MS-DOS system. IBM stayed within the OMF format for OS2, and extended it for accommodating larger fields since OS2 is a 32 bit OS too. Microsoft decided otherwise, and introduced for their new system the Unix COFF standard. Actually, this is not a forced decision. In the first versions of the Borland compilers, the object format produced was still OMF, and only the linker was changed to produce a COFF executable.

Since I hadn’t written the linker anyway, I needed, to test my code, a linker so I hadn’t any choice but to use COFF from the start.

Object files, within the Coff27 system, have the following structure:

  1.  A ‘section’ header, indicating where in the file is stored each section, its length and other stuff like its characteristics.
  2.  The different sections generated by lcc-win32 are:

The .text section that doesn’t contain any text as you would suspect, but contains the machine codes for the compilation unit that is stored in this object file.

A .data section containing the initialized data for the file. In this section go all the statements like:

int startingPos = 45987;

A .bss section, containing the non-initialized data. In this section go statements like

int Table[645];

A .debug section containing the debugging information

Other special sections

3. A Symbol Table, describing the symbols contained in the compilation unit.

  1.  The contents of the different sections.

To look at all the details of the sections in object files look at the pedump sources in the file ‘common.c’.


Each section can have ‘relocations’ instructions. This instruction appears only in object files. They specify how the section data should be modified when placed in the image file and subsequently loaded into memory. They are fixed-length records and each element of the array has the following format:

Position and size


0, 4 bytes

The address of the item to which the relocation should be applied. This is an offset from the beginning of the section, plus the starting address of the section.

4, 4 bytes

The index in the symbol table of the object file that refers to the symbol that is being relocated. In our example the symbol ‘MyTable’

8 2

The type of relocation being performed. There are several relocation types possible. I describe later which ones lcc-win32 uses.

Basically, the linker should build the executable from the different modules. When it does this, it will mix all the .text sections in a single .text section in the executable file. This would be impossible, if all references to offsets weren’t described in each object file in a relative fashion.

For each symbolic reference that ‘genasm’ emits the assembler generates the code for the instruction and leaves a zero in the object file. This position is recorded by the assembler for building the relocations at the end of each object file.

We have then records of this type:


Symbol table index

Type of relocation




The offset is the start of the instruction plus the start of the address part of the instruction, 2 bytes further up than the start of the instruction itself.

There is nothing special with the symbol table index. It is just what it name says: an index. The type of relocation in this case is 7, i.e. a direct 32-bit relocation relative to the base of the section.

Other type of relocation arises when there is an instruction like:

Mnemonic   Code

call _strcpy  e8 00 00 00 00

The address of the ‘strcpy’ function is not known at assembly time. It will be known to the linker only. The assembler just generates the code and the following relocation instruction:


Symbol table index

Type of relocation


The offset of the first zero within this section


Where in the symbol table is the _strcpy symbol


PC relative relocation

All calls instruction generated by lcc are program counter relative offsets. This relocation instructs the linker to calculate how far away is the ‘strcpy’ function from that position in the section, and put that difference into the 4 bytes filled in with zeroes.

An optimization here is obviously possible: When the function being called is in the same module, the offset to that function is fully known to the assembler, and we can just put it in there and erase any relocation instructions, since both addresses are known.28

The symbol table

Each object file has a symbol table specifying the names and addresses of the symbols in it.

The structure of a symbol is defined as follows:29

typedef struct _IMAGE_SYMBOL {

   union {

       BYTE    ShortName[8];

       struct {

           DWORD   Short;     // if 0, use LongName

           DWORD   Long;      // offset into string table

       } Name;

       DWORD   LongName[2];    // PBYTE [2]

   } N;

   DWORD   Value;

   SHORT   SectionNumber;

   WORD    Type;

   BYTE    StorageClass;

   BYTE    NumberOfAuxSymbols;


Symbol names can be up to 8 characters long. If they are equal or shorter, the union member ‘ShortName’ will be used, if not, the member ‘LongName’ contains the index of the name in the string table.

The value corresponds in an object file to the address of this symbol relative to the beginning of the file. In an executable, it corresponds to the addresss when loaded.

The section number indicates which section this symbol belongs to. For instance one would be the text section, etc. The numbers aren’t fixed, but are actually indices into the section table.

The type field describes the type of the symbol. Lcc-win32 doesn’t generate this kind of information because it generates already debug type information in the .debug section.

The storage class field can contain many other properties for the symbol, but lcc-win32 uses only the following ones :

  1.  EXTERNAL (2). This symbol has public scope.
  2.  STATIC (3) This symbol has local scope.
  3.  FILE (103) This is a file name type of information.
  4.  LABEL (6) This is a code label.

The code that generates the symbol table is located in asm.c.

Pseudo instructions

The assembler receives directives from the compiler for reserving space (.space directive) for changing the current section or other chores: setting the name of the current file, building initialized data tables (.long or .byte instructions) This instructions are fairly straightforward, and they present no special difficulty.

Further work

The assembler uses up more than 13-15% of the total compilation time with no optimizations. This could be greatly reduced if lburg generated a table of opcodes instead of just ASCII mnemonics.

Another big problem is the assembler syntax, inherited from the machine description for the LINUX system. It is utterly incompatible with all other assembler syntax, and it should be eliminated, in my opinion. But rewriting the assembler’s parser is a work that would eat me months of development...

The debug section

If your compiler system foresees any use of a debugger, (and I wouldn’t call a system a compiler system if it doesn’t), you have to solve the problem of the debug information. This means:

Encoding all the type, symbol and line information in each object file.

Link all this information at the link stage.

Read all the information in, from within the debugger.

When I started this project, several alternatives for debug information format were available:

  1.  Using the ‘stabs’ format of gcc,
  2.  Using the CodeView (Microsoft’s debugger) format.
  3.  Using dwarf.
  4.  Using a new debug information standard.

After reading the specifications of several formats I decided to use Microsoft’s one. This decision was almost forced, because at that time, gcc (and gdb) did not run very well in the Win32 environment. The big advantage of using Microsoft’s one is that the code generated would be compatible with all Microsoft’s tools, including their debugger. I could test if I was generating the right thing by using debugged tools like Microsoft’s linker, and I could debug the first programs using their debugger. If I had used another format, I would have been confronted to the problem of not being able to see if I was generating correctly the code until the debugger was finished!

The Coff system has defined very primitive type/symbol information. Lcc’s debugger uses some of this information, but it has only a marginal role.

To implement Microsoft’s debug information I wrote ‘cv.c’, that is called by lcc when generating code if the ‘debug’ global flag is set, the variable ‘glevel’. In the file ‘dag.c’ I added at each block’s begin, a call to ‘GenBlockInfo’ to generate debug information for all variables that are used within the current block.

Besides this, I modified the functions ‘stabsym’ and ‘stabtype’ to generate debug information in the new format in the file ‘w32incl.c’.

The format of the debug information

This format is described in the Microsoft’s documents ’CodeView Symbolic Debug information Specification 3.1’. Besides, in an effort to promote standards for its X86 architecture, Intel Corp. has started a project where the information concerning the debug specification can be found.

Within lcc, I generate symbolic (.debug$S) and type (.debug$T) information. This will allow the debugger to make sense of the addresses and instructions found during program execution and allow it to present the information in a meaningful way to the user. Basically it will allow interpreting the data, to know the symbolic names that were used in the user’s program, and to position itself in the source code files.


The ‘Symbols’ records are variable length records that contain the following fields:

  1.  Length (2 bytes). Specifies the length of the record, excluding the 2 bytes of the length field.
  2.  Type of the symbol (2 bytes)
  3.  Data specific to each symbol.

The Type field is either a user defined type (with a value bigger than 4096) or a primitive type (like ‘integer’ for instance) that is predefined by Microsoft. The predefined types are very rich, and lcc uses only a small subset of them: The ‘Cobol’ parts of the specifications are obviously ignored, as are the symbolic information for Pascal or C++ types.




0x200 BP Relative

The offset of the symbol with the stack frame (4 bytes), the type (2 bytes), and its name.

All local variables, of a procedure.

0x201 Local Data

Offset within the section, the section number, and the name of the symbol

Static variables of the module

0x202 Global Data

Offset within the section, the section number and the name of the symbol

Global public symbols of the module

0x204 Local Procedure start

Length in bytes of the procedure’s code

Offset in bytes from the procedure start to the point within the procedure where the stack frame has been set up. In a similar fashion, the offset to the last byte before the current stack frame becomes invalid.

Local procedure

0x205 Global procedure start

Format identical to the Local procedure format.

Global procedure description.

0x207 Block start

Length of the scope that this block defines.

Name of the block

Used to define procedure scopes.

All symbol names are written with the length of the name in the first byte, followed by the bytes of the name.

Simple Types

The format for the type information is similar to the symbol information. It consists of a 2-byte length, followed by a ‘Type string’ as described in Microsoft’s terminology. This type string is a series of leaf structures that contain a 2-byte type, followed by additional data




0x2 Pointer

Attributes of the pointer (2 bytes)

Type of the object pointed to (2 bytes)

All pointers in the code generated by lcc are 32 bit flat model pointers

0x3 Simple Array

The type of the object that forms the array. This is a 2-byte type index.

The type of the index that can be used to scan the array. This is always an integer in C.

The length of the array in bytes

The name of the array

This record will be generated for simple tables like:

int Table[345];

0x6 Unions

The count of fields in the union

An index into the list of types for each member, as in the structures



0x5 Structures

The count of fields in the structure

An index into the list of fields



Describes structures. Most fields are not used, since the same record is used to specify classes.

0x7 Enumeration

Number of enumerates (2 bytes)

Type of each.


Lcc always will put ‘integer’ in the type field.

Types referenced from the ‘types’ record

This record encodes for complex lists like the lists of arguments of a procedure or the list of fields in a structure.

Lcc uses only two record types:

0x201 Argument list

0x204 Field list

A field list contains the descriptors of the fields of a structure, class, union, or enumeration. The field list is composed of zero or more subfields. To complicate things further, because of the requirement for natural alignment, there may be padding between elements of the field list.

As a program walks down the field list, the address of the next subfield is calculated by adding the length of the previous field to the address of the previous field. The byte at the new address is examined and if it is greater than 0xf0, the low four bits are extracted and added to the address to find the address of the next subfield. These padding fields are not included in the count field of the class, structure, union, or enumeration type records. If the field list is broken into two or more pieces by the compiler, then the last field of each piece is an LF_INDEX with the type being the index of the continuation record. The LF_INDEX and LF_PADx fields of the field list are not included in field list count specified in the class, structure, union, or enumeration record. See Section 3.5 for field list elements.


Suppose the following program:

enum e { F1,F2,F3,F4};

int main()


       enum e j=F3;



We can display the debug information using the ‘pedump’ utility. We use the command line:

pedump /D program.obj.

In the above example we obtain :

section 03 (.debug$S)  size: 00134  file offs: 00730

[  1] offset 4 length 16 index 9 (0x9H) Obj signature 0, name 'tenum.obj'

[  2] offset 22 length 41 index 517 (0x205H) Global Procedure '_main'

     Parent 0 End 0 Next 0 Length 44 debStart 18 debEnd 39 offset 0

     Segment 0 proc record type index 4099(3) flags 0x0H

[  3] offset 65 length 10 index 512 (0x200H) Automatic 'j'

     EBP offset -4 type: '0x1001 (4097 1)'

[  4] offset 77 length 2 index 6 (0x6H) Block End

[  5] offset 81 length 51 index 1 (0x1H) Compile Info:

     Machine 0x4 Language 0 Flags1 0x0 Flags2 0x0

     'Logiciels/Informatique lcc-win32 version 2.4'

In the .debug$S (symbols) section there are five records. They begin at offsets 4, 22, 65, 77 and 81 relative to the start of the section.and they are presented indexed with square brackets by pedump.

The first record contains the object file name and signature field. Lcc doesn’t fill the signature field.

The second symbol record is the information for the global procedure ‘_main’. Lcc doesn’t fill the ‘Parent’ field, since there are no nested procedures in C. The important fields are the procedure length, and the start/end of the debugging information. The numbers represent the number of bytes from the first byte of code of the procedure. Another field here is the ‘type index’ field, in this case 4099. These tell us that the type description for this procedure, i.e. the types of its argument list is to be found in the type record 4099. Since user defined types start at index 4096, the record containing that information is actually the fourth type record in this file.

The third record contains the description for the local variable ‘j’, in the C code above. It tells us that the value of this variable is to be found at EBP - 4, and the type description for it has an index of 4097, i.e. is the second record in the types section30.

The fourth symbol record is just an indicator that the list of local variable symbols finishes here.

The fifth symbol record is a signature emitted by the compiler describing the translator, the ‘company’ that made it, etc. This record finishes the .debug$S section.

The .debug$T section contains the types defined by this small program. Pedump will display this section like this :

section 04 (.debug$T)  size: 00090  file offs: 00884

[  1] Offset 0 length 50 index 516 (0x204H) Structure member list

     Enumerate. Attribute=0  Value: 0 F1

     Enumerate. Attribute=0  Value: 1 F2

     Enumerate. Attribute=0  Value: 2 F3

     Enumerate. Attribute=0  Value: 3 F4

[  2] Offset 52 length 14 index 7 (0x7H) Enumeration: 'e' 4 members

     Enum type 32 bit signed int,fields 4096 (0)

[  3] Offset 68 length 4 index 513 (0x201H) Argument list. 0 arguments

[  4] Offset 74 length 10 index 8 (0x8H) Procedure:

     return type: 116 call 0 parameters 0 arglist idx 4098 (2)

Here we find four records. The second one was the one indicated by the third symbol record above. It is an enumeration description, telling us the name of the enumeration, and that the field list of the enumeration starts at offset 0 (i.e. with the index 4096) of the .debug$T section.

Looking at the offset zero (the first record) we effectively find a list of enumerates, containing each one a description of each member of the enumeration.

The other type record referenced from the ‘symbols’ section was the fourth, i.e.  the record that describes the ‘_main’ procedure. This record defines the return type of the procedure (116 ;i.e. an integer), and points to the record containing the description of the procedure arguments : the third type record. Since we have declared main with no arguments in the C code, this record is empty.

To get a detailed description of the debug information look in the ‘pedump’ source tree for the ‘DumpExeCodeViewInfo’ function, in the source file common.c. There you will see the start of the debug info dumping functions of pedump. All the available documentation about that format is included in that source file in the form of comments. You can follow pedump under the debugger to see how wit goes unraveling the binary formats.

Implementing the “inline” operator (C99)

The new standard published by ANSI specifies the “inline” function qualifier. This construct should specify that a function will be expanded at each call site, and the compiler should replace the function call with the actual body of the inlined function.

The implementation is as follows:

  1.  The “inline” keyword was introduced to the lexer (in lex.c)
  2.  The declaration of inline is detected when parsing the function specifiers. A new type qualifier “inline” was introduced.
  3.  The body of the inline function is stored for later used in the form of a token list. This is done in the function “InlineTemplate” in the file templates.c31.
  4.  At each call site, the compiler tests if the function being called is an inline function (see enode.c the “call” function). If this is the case, the compiler will, before any arguments are evaluated, announce this fact to the inline module through the function BuildInlineReturnSymbol.
  5.  The compiler will then gather the arguments, and will assign each argument to a new local temporary variable using the BindOneInlineArg function.
  6.  The compiler will make a copy of the token list, and will replace everywhere the arguments by the newly created locals.
  7.  The token list is transformed into a character string, and feed into the lexer again, as if the user had typed the function body at the call site.
  8.  All occurrences of the “return” statement will be replaced by an assignment to a specially declared local variable, followed by a jump to the code following the inline call.
  9.  If the inline function returns a value, the result of the inline call is a id tree of that return variable, if not, the result is just NULL.
  10.  The first time an inline function is called, the compiler will mark the position of the counter where static variables are stored. The next time a function call to an inlined function appears, the counter of the static data segment will be reset to the value it had in the first call. This guarantees that in all invocations of the inline function the static variables will always point to the same static variable, and not to a new one.
  11.   The tree nodes generated by the inline expansion are converted into dags later on, not immediately.
  12.  The RIGHT nodes necessary to correctly evaluate function arguments are recreated by the inlining process.
  13.  The “listnodes” function in dag.c was modified to emit the correct trees, instead of the call tree that is generated by enode.c.

The core idea of this implementation is to use a character string to compile code dynamically within the compiler. This is an idea that will have a wide use in the future of lcc-win32.

The peephole optimizer

1. Motivation

Lcc generates very efficient code for each instruction taken separately. The problem is, as Fraser and Hanson remark at the end of their book, that «lcc’s instruction selection is optimal for each tree in isolation, but the boundaries between the code for adjacent trees may be suboptimal. Lcc would benefit from a final peephole optimization pass to clean up such problems»

2. Implementation

I started working in this, with the idea of a very crude pattern matcher, just to avoid some obvious nonsense instructions. I was worried by the spurious ‘movl %edi%edi’ that sometimes appeared in the generated code. That bug is since a long time corrected, but the design of the optimizer bears still that mark. I haven’t redesigned it since I started it. It just grew and grew and grew...

In retrospect it could be said that a little more effort in the design of the program would have saved lot of work later, but I do not think that is the case. This ‘keep it simple’ approach, allowed me to start immediately seeing the problems involved in assembly optimization, without having to spend a lot of time building complex data structures (and debugging them...)

The basic idea is to consider a ‘basic block’, i.e. a piece of code that contains jumps out of it, but not into it. Since lcc always generates jumps to labels I decided to use labels as a guide: a basic block then, is all code contained between two labels.

I built a scanner for digesting the flow of assembler instructions emitted by ‘genasm’, and to save the information it gathered in data structures. Here again, I just used a ‘Flags’ field with whatever it seemed important to keep: if it’s a ‘move’ instruction, or if the source is a register, or the addressing mode, etc.

Once all information has been stored in the ‘Instruction’ data structure, the main work could begin. I built a table of the contents of each register, to keep track of the state of the machine as it executes one instruction after the other. In this table I just stored that text representations of the data that was currently in each register. Whenever I see a ‘load’ instruction, I scan that table to see if the data is already in a register, and if it is, this instruction can be deleted.

The main aim of the optimizer is to delete instructions. I keep thinking that there is nothing that executes faster than a deleted instruction. Besides other side-effects that make code-size a big win in optimizations: if your program is smaller, it will fit easier in the cache of the processor, it will take less time to load, and will make for less page faults when executing in a crowded machine.

I followed following rules:

I store the content of a register in the ‘State’ table, when the register is written to memory only, not when it is loaded. This is very conservative, but safer. I miss all the optimizations that could be done if I kept track of the value being ‘moved’ from one register to another.

I update the state of all registers that are invalidated explicitly or implicitly by the instruction. Specifically:

A ‘call’ instruction invalidates EAX, ECX, and EDX. The others, EBX, ESI, and EDI are supposed to be preserved at all times.

A ‘call’ instruction invalidates all registers that contain references to global variables. The function called could modify those variables, so I invalidate all references to global memory.

I keep track of all instructions that implicitly use a register: for instance the ‘rep movsb’ increments ESI and EDI, so their contents are undefined at the end of the instruction.

To avoid the problem of the ‘aliases’, each LEA (load effective address) instruction invalidates all the registers. The problem is that when this instruction is executed, an alias for a memory location is built. If there was a register caching the value contained in that memory location, the register could be wrong after several operations. Rather than take any risk, I just destroy all state stored by the optimizer.

3. Patterns

Besides trying to eliminate redundant loads, the optimizer tries to substitute sequences of long instructions with shorter and faster ones. The list of patterns that I have built in this year is long (more than 30) and very specific to the code generated by lcc. I have never had in mind the idea of building a general-purpose optimizer, which could optimize ANY piece of code. I just built a specific one, tailored to lcc’s output. Because of this, sequences of replacements that could be of interest in a general-purpose peephole optimizer are not taken into consideration because lcc never generates them.

3.1 Cross -loads

mov src, dst

mov dst, src

For instance

movl %edi,-4(%ebp) ; store a local variable

movl -4(%ebp),%edi ; delete this one!

Preconditions: None. I can’t imagine a situation where this is not valid.

3.2 Eliminate unnecessary moves before a push

Instead of moving a memory location to a register for pushing it later, push that memory location directly.

mov -8(%ebp),%edi

pushl %edi


pushl -8(%ebp)

Preconditions: The value in %edi shouldn’t be used later on. Sometimes lcc (or the optimizer itself) will use that value in another series of pushes. This optimization should be avoided in those cases.

3.3 Avoiding unnecessary intermediate registers

3.3.1 Especially with byte moves, lcc generates a sequence of instructions like this:

movl %ax,%bx

movb %bl,-1(%ebp)

This can be safely replaced with:

movl %al,-1(%ebp)

Preconditions: The value in bl shouldn’t be used in later instructions. Actually, lcc never uses it.

Other variants of this same problem are sequences like this:


movl -24(%ebp),%esi

movl %si,%bx

movl %bl,-1(%ebp)

That can be safely replaced with

movl -24(%ebp),%ebx

movl %bl,-1(%ebp)

sparing the contents of esi, that could lead to other improvements.

Preconditions: None


This long sequence:

movl -16(%ebp),%esi

movl %si,%bx

movzbl %bl,%eax

movb %al(,%edi)

can be replaced with:

movl -16(%ebp),%ebx

movb %bl(,%edi)

This sequence arises when generating code for the ‘C’ statement:

p[0] = (unsigned char)c;

Where  the variable ‘c’ is an unsigned long, and p[0] is a ‘unsigned char’ variable.

Preconditions: None.


Still another sequence is:

movl -4(%ebp),%esi

cmpl $45,%esi

That can be safely replaced with:

cmpl $45,-4(%ebp)

Preconditions: The value in %esi shouldn’t be used later.

4 Eliminate constants if possible.

Instead of

movl $0,%eax   b800000000


xor %eax,%eax    31c0

which is much shorter: 5 bytes vs. 2 bytes.

The same with some additions/substractions:

addl $-1,%eax   83c0ff

can be replaced with

decl %eax    48

3.5. Use the ‘new’ 486 instructions.

These instructions were introduced by Intel for the 486 or higher processors. Since lcc-win32 will surely run in a 486 or higher, it is safe to assume this can be done. In the documentation it is stated explicitly that lcc will not run in a 386 or without the math coprocessor...

Instead of the long sequence:

mov mem,reg1  load value

mov reg1,reg2  save old value in reg2

inc reg1   add offset

mov reg1,mem  store updated value

generate the much shorter

mov  $1,reg2

xadd reg2,mem

This corresponds to the much-used ‘C’ idiom:

unsigned char *p;

*p++  = ‘c’;

The old value of the memory locations is loaded into the ‘reg2’ register, and the contents of the memory location are incremented by the value in ‘reg2. This allows us to use the old value in reg2 to do the assignment.

Preconditions: The value in reg1 shouldn’t be used later. This is never the case for the code generated by lcc, so this is a safe optimization.

3.6. Optimize sequences.

3.6.1 Sequence of calls with the corresponding stack adjustments.

In the sequence:

push arg1

push arg2

call someProcedure

add  $8 , %esp

push  arg1

push arg2

call OtherProc

add  $8,%esp

The additions to the stack can be concentrated in a single add to the stack. This saves space and time.

Preconditions: There shouldn’t be any intervening jumps or conditional jumps out of the sequence of instructions.

3.6.2 Initialization sequences

Instead of

xor %eax,%eax

movl %eax,-4(%ebp)

xor %eax,%eax

movl %eax,_GlobalVar

We can do the ‘xor’ operation once and the just store the same value (zero) into the indicated memory locations.

3.7. Improving the block move instructions.

Lcc will generate a block move when one structure is assigned to another. The instruction generated will always be ‘movsb’ that moves only 1 byte per cycle. Since normally the size of the structure is an explicit constant, this can be improved by using the ‘movsl’ instruction, that moves 4 bytes per cycle, or, if the size of the structure is not a multiple of four, movsw that moves 2 bytes per cycle. It is this optimization that increases enormously the performance of the compiler in the Whetstone benchmark.

3.8. Inlining library functions.

Instead of generating:

push 128   6880000000 (5 bytes)

push src   ff3500000000 (6 bytes)

push dst   ff3500000000 (6 bytes)

call _memcpy  e80000000000 (6 bytes)

add $12,%esp   83c40c        (3 bytes)

      Total:                        26 bytes

we can generate:

movl $128,%ecx  b980000000 (5 bytes)

movl dst,%edi  8b3500000000 (6 bytes)

movl src,%esi  8b3d00000000 (6 bytes)

rep  movsb  f3a4          (2 bytes)

    Total                         19 bytes

We save 7 bytes and generally, this should be faster than the call. If the memcpy function is very well implemented however, and the blocks to move are very big, it could be that this is not an optimization at all. Most calls to memcpy are for copies of several dozens or hundreds of bytes, where the cost of the call would be significant. This optimization is especially good for older processor (486). The new Pentium processors have a very small ‘call’ cost, and the quality of the memcpy function is more predominant.

Preconditions: This destroys esi and edi. They are normally saved at function entry, but in some functions they are not. This happens if they weren’t used in the original, non-optimized version. If that is the case, their contents should be saved/restored in other registers.

Another problem is the stack adjustment. If the call to memcpy was the last of a series of calls,  (see 3.6.1) and the final stack adjustment was done after the call to memcpy, the instruction for adjusting the stack should NOT be just erased. In an example:



call somefunc

add $8,%esp



call someother

add $24,%esp



call memcpy

add $44,%esp (12 for the args of memcpy, and 32 for the others)

That last instruction should be changed to

add $32,%esp

instead of being just erased.

3.9 Avoiding clobbering when possible.

If an instruction will destroy a value in a register, and that value is later loaded from memory, try to substitute the destination register by another one (%ecx or %edx are good choices since lcc almost never uses them).

1. movl -8(%ebp),%edi

2. movl %edi,-16,%edi

3. movl -4(%ebp),%edi ; destroys the value in %edi

4. incl %edi

5. movl %edi,--4(%ebp)

6. movl -8(%ebp),%edi ; reloads %edi

We can avoid the redundant load by substituting %ecx instead of %edi for the second load (instruction 3). This allows us to delete instruction 6.

Preconditions: The ‘spill’ register shouldn’t be used in other instructions further down, and it shouldn’t contain a value used later on, since this would nullify the gains.

3.10 Avoid unnecessary moves

When you acces to a member of a structure whose value is the result of a function call, lcc generates the following code :

movl %eax,%edi

movl 8(%edi),%edi

We can get rid of this unnecessary operation and write directly :

movl 8(%eax),%edi

This optimization is encountered very often, and contributes significantly to reducing the code size.

Preconditions : In general, this should be possible with any operation of the form :

movl %eax,%edi

OP offset(%edi),%edi

I restricted this, however, to mov (in any of its forms) and leal.

4. Construction of the optimizer.

Debugging the peephole optimizer has been a constant chore for months. Endless nights without sleep were lost chasing very difficult to find errors, which wouldn’t surface at first sight. In the ‘lucky’ scenario, the generated program just traps, you disassemble the offending code near the trap and you see where the optimizer went wrong. In the ‘unlucky’ scenario (by far the most common) the program will just produce incorrect results sometimes. Not always. These kinds of bugs have been very difficult to find. The problem is stating clearly the preconditions that are necessary for the optimization to work.

Besides this, I made the optimizer re-enter optimization with its own output, until there is no further improvement. This augmented its effectiveness, but... what a nightmare. Many of the optimizations were not really completely debugged for the general case. When I used the optimizer in a recursive fashion, its input at the third or fourth pass would differ significantly from what lcc generates. Assumptions about the code generated by lcc were no longer valid. But in retrospect, I think that helped to produce a more general tool.

All the source code for the optimizer is contained in the optimize.c source file in the compiler’s sources directory.

5 Speed considerations

The slowing down of the compilation time due to the optimizer is barely noticeable. Compilation remains an absolutely I/O bound process, and since the optimizer takes a memory buffer that is intended for the assembler, changes it, and passes it to the assembler, all in core, the difference when compiling a big program is not significant. Besides a compilation time of 8 seconds goes down to 2 seconds for the second time, since Windows caches the disk systematically. It is very difficult to see which part in that huge difference is due to the optimizer.

6. Results:

In general, the gains in space oscillate between 20 and 30 percent, and the speed improvement can go as high as 50 percent for the Whetstone benchmark. Since that benchmark is very sensible to block-move operations, this huge improvement is not significative of the overall improvement you can expect from this optimizer.

7. Description of the code

The only interface with the rest of lcc is the function ‘ChangeBlock’ at the end of the source file ‘optim.c’. This function receives a block of assembler instructions from ‘SendToOptimizer’, a function in ‘output.c’.

The optimizer scans the assembly code in the function ‘FindEndBlock’, that searches for an ending label within the assembler instructions. The optimization proper is done in ‘OptimizeBlock’ that calls ‘FindStateChanges’ for each assembler instruction. ‘OptimizeBlock’ does the stack adjusting optimization described in 3.6.1.

8 Further work

The next step is of course to build the graph of the program and do tail merging. I hope I find the resources to carry on the work by Fraser, (‘Analyzing and Compressing assembly code: C. W. Fraser ACM Sigplan 84) This work would mean that huge programs could be compressed by a factor of 50% or more.

The high level optimizer

Even if the peephole optimizer allows some improvement of the code generated by lcc, it is surely not enough. Impressive speed/code size gains can be only obtained by modifying the structure of the register allocation, using more evenly the available registers.

The main directions of the improvements done for the second public release of lcc-win32 are:

Improving the machine description, the basis of the tree labeler.

Introducing automatic variable caching in registers.

Improving the code generated for common C idioms

Improving floating-point performance by avoiding unnecessary copies of data from/into the floating-point unit.

Constant folding.

Improving the machine description

The x86 is CISC architecture. Thus, for generating good code for it, it is necessary to take into account all addressing modes and data types supported natively by the machine. The original version of the machine description, as published by Chris Fraser and Dave Hanson, consisted of approximately 195 rules. I have increased this to approx. 350, taking into account the byte and word-addressing modes everywhere that is possible. This eliminates many unnecessary moves to/from memory and reduces code size, improving speed. As an example of this kinds of improvements consider this rule:

stmt: EQI(CVUI(BANDU(CVIU(CVUI(CVCU(INDIRC(mr)))),con)),zero)

This sequence of instructions of lcc’s virtual machine arise when you use the innocent looking idiom in C:

if (isdigit(c))

This sequence of statements in lcc’s virtual machine operators was translated using the standard backend into the following sequence of operations:

       movzbl  -1(%ebp),%edi

       movzbl  __ctype+1(,%edi),%edi

       andl    $4,%edi

       cmpl    $0,%edi

       je      _$2

Using the new machine description, this entire sequence of operations will be collapsed into:

       movzbl  __ctype+1(,%edi),%edi

       testb   $4,%edi

       je      _$2

Note that we save two 4 bytes constants (4 and zero), that are replaced by a one byte constant, since the type of the operation is now byte (testb). This allows using directly the machine instruction, instead of doing the conversions that are in the end not needed at all.

Most of the new rules added are just attempts to avoid any unnecessary conversions, and use directly the x86 addressing modes.

Another important issue was that several machine instructions, especially in the floating-point instructions, were missing. I added the reverse divide/substract, and all operations that allow the floating-point unit to directly address 16 bit data, avoiding (again) unnecessary conversions.

Rules to directly operate with an address using a constant.

These rules have the general form:



ASGNX can be any of ASGNI, ASGNP

OPERATION can be one of  SUB, ADD, AND, OR, RSH, or LSH.

Both addresses, i.e. the source and the destination, must be the same

constant is an immediate program constant.

The x86 CPU can directly add to an address an immediate program constant. This simplifies many constructs enormously, and makes for smaller, tighter code.

Rules concerning the usage of the constant one.

The x86 CPU supports many operations that implicitly use the constant one, i.e. the different forms of the inc/dec instructions. Some of them were catched by the original machine description, but not all of them.

Actually, these rules are just a specialization of the first kind of rules above, for the special case where ‘constant’ is one.

Rules for using the special machine modes.

This rules use (always in the case of an immediate constant) the machine modes of the x86 CPU that has a rich set of data types. It has words (16 bits), chars (8 bits), floats (32 bits), doubles (64 bits), and even others that are not used in the C language: BCD digits, or 80 bits floats.

Rules for fully using the floating-point unit.

The floating-point unit can load a 32-bit integer, a 16 bit short, without any need for extra conversions. Other modes allow collapsing long sequences of intermediate language statements into one machine instruction.

Rules for fully using LEA.

This versatile instruction allows adding three items with one machine instruction. It can be enormously useful within the context of array indexing. It wasn’t really used in the original machine description.

Problems caused by the larger machine description.

The proliferation of longer rules produces a tree labeler that has to make more tests to find the optimum solution for the given tree.  To easy somewhat the impact of the new rules:

I have modified lburg, the program that generates the tree labeler from the machine description, to generate better code by avoiding unnecessary additions of zero, putting very frequently accessed structure fields in local variables, and other modifications that speed up the labeler.

I have kept the standard back-end for use when the compiler is not called with the optimize option (-O). This faster backend keeps the same compilation speed as lcc had, when the user is not interested in code optimizations but in compile speed.32

Many of the optimizations that the peephole optimizer did in a very time-consuming way, have been moved to the tree labeler, saving much time in the optimization process, and improving the maintainability of the peephole code that had grown to a huge size.

Improving register usage

The basic problem with lcc, was that 90% of the instructions just used two registers: ESI and EDI. As an example, consider the file ‘edit6.c’ that is part of the IDE (Wedit). It has approx. 10 000 lines: This is the output of the utility  ‘regusage’

EAX    821

EBX    158

ECX      9

EDX     20

EDI   7373

ESI   2086

Total: 10467

We see here that  9459 instructions used ESI or EDI, i.e. more than 90%! The other registers are almost never used: ECX is used sometimes in modulo operations (9 instructions out of 10467!) EDX holds the result of the modulo operation (20 instructions..), and the usage of the other registers is minimal. This means that the machine is used to 25-30% of its real capacity, making a very slow code.

The reults using the -O option look now very different:

EAX   1005

EBX    791

ECX   1153

EDX   3431

EDI   1730

ESI   1111

Total: 9221

The first thing that is apparent is that the number of instructions emitted has been reduced by 10%. This is because many unneeded instructions (conversions from char/shorts to int) were eliminated by the new machine description.

The second important fact is that the program is using the registers much more evenly, especially ‘cheap’ ones like EDX, ECX, that need not to be saved when they are used.

Register allocation now uses automatic register variables. Before code generation, the optimizer looks for the best three variables to put into EBX, ESI, and EDI. To do that, it analyzes the hints about the code in the function left by the parser and the node construction function ‘labelnodes’ (in dag.c) to determine how many registers can be allocated. The code generator and register allocator works now with only three free registers in some situations.

Spills have not increased though, a big fear I had when embarking into the construction of the optimizer. This is due in part to the elimination of unnecessary conversions that now save a register when one was being wasted before.

Since the x86 instruction set has many ‘special features’, they have to be considered when allocating registers. Big troubles are those instructions that accept their inputs in fixed registers, and clobber them: I mean for instance the block-move instructions, that destroy ESI and EDI, and use up ECX to hold the repeat count. For the time being the compiler will NOT optimize any function that contains this instructions, to avoid generating wrong code in some cases. Since these instructions are not very frequent, this doesn’t hamper the optimization process as a whole.

The rules for choosing the best register variables are as follows:

  1.  If the function calls no other functions (leaf function) and has no block move or division instructions, EBX, ESI and EDI are free for register variables.
  2.  If the function calls another function, but has no calls within the argument list to a function, no division/modulo and no block moves, EBX, and ESI can be used for register variables
  3.  If the function has no block moves EBX is used for a register variable.
  4.  Else... well no register variables at all.

These rules seem very arbitrary, and they surely are. The problem lies in the function ‘clobber’ and in general in the register allocator of lcc that doesn’t work when those instructions are encountered. I have spent countless nights trying to find out a solution, but I had to give up. So I examine every statement trying to minimize the trouble that those operations cause.

Before the call to gencode() the high level optimizer looks at the information passed to it by the parser, and determines which registers can be used for register variables. Then it will sort the locals of the function (arguments and local variables), and allocate the best ones to the available registers.

Another small optimization that is done at that stage of compilation is the elimination of any unneeded labels. Since the peephole optimizer uses the labels as a mark for blocks to optimize, it is essential that it receive blocks as large as possible. Unneeded labels would split unnecessarily a block, diminishing the usefulness of the peephole optimizer.

After ‘gencode’ is done, if the function is a leaf function, and some cheap registers weren’t used, they will be exchanged with the ‘expensive’ registers, like EBX,ESI or EDI. Those registers are expensive since they have to be saved/restored at function entry/exit. This can be done safely if the function is a leaf function only, because having a register variable in EAX would be very expensive to maintain: at each call that register variable would have to be saved!

Improving floating point performance

Analyzing why the performance of lcc in floating point was so mediocre, I discovered that the compiler was considering each floating point number as a common sub-expression, i.e. an expression that should be saved into a register. This is not a bad idea, since the loading of a floating-point number can be costly. Problem is, lcc doesn’t use the floating-point registers in the x86 architecture due to the organization of those registers as a stack. This is surely a wise decision from the part of the authors of lcc, since other compilers do the same, probably for the same reasons.

The code for the common sub-expressions was not eliminated, however. This means that this ‘cse’ will be spilled to memory, since there are no floating-point registers. Consequence: for every floating-point operation, lcc will duplicate the floating-point number in memory, and load it from there. This slows down the code enormously.

The remedy for this was very simple: I eliminated the cse, and the floating-point code generated by lcc is now comparable with the code emitted by professional compilers.

Another improvement of floating point has been obtained by eliminating unnecessary conversions, since the floating-point unit can use directly 16 bit and 32 bit integers.

Front end modifications

Several modifications were necessary to make the optimizer work. First, since the x86 never used any register variable, I started using code that wasn’t tested before in this configuration. Second, I eliminated all the unnecessary conversions from, for instance, unsigned integer to pointer. These conversions are completely vacuous in all targets of lcc, so I just filtered them in the corresponding ‘case’ of the function ‘listnodes’ of dag.c.

Since the optimizer needs information about the instructions used in a function, I added some fields to the ‘Node’ structure to hold bits indicating whether this function has a block move instruction (ASGNB) somewhere, whether it uses division... etc. This allows the register allocator to make decisions about register usage/allocation.

Heavier modifications were needed in the register allocator of course. It has been modified to return EDX, ECX, EAX, ESI, and EDI in that order.

Another small modification is the improvement of the return instruction in the popular C idiom:

if (!SomeFn())

 return FALSE;

I eliminated in this context an unnecessary assignment to a register, since if the function ‘SomeFn’ returns false, that same result can be returned without any further processing. This can happen of course only if the return type of the function is an integer.

Finding variables with disjoint domains

Lcc has only a ‘global’ register allocator. A register will be assigned to a variable once for all within a function body. Even if the variable is no further used, it blocks the usage of a register until the function exits. To improve this situation, the optimizer looks at the ‘firstuse’ and ‘lastuse’ fields that are maintained by lcc-win32’s front end. Those fields define the domain of a variable, i.e. the portion of code where they are ‘alive’. If the domains of the two variables are disjoint, they will be assigned to the same register.

Of course this is possible only with local variables or arguments. Global variables will be always alive.

A complication for this is loops. Consider for instance the following code snippet:

for (i=0; i<10;i++) {




A parser looking at this and ignoring loops would think that the variable <i> is not used after the for statement... To avoid this, I extend the lifetime of all variables used in a loop until the end of the loop.

Modifications to lburg

  1.  I added the special symbol “;;” as a comment separator. This allows commenting those hairy rules in the machine description.
  2.  I have tried to eliminate all additions of zero. I didn’t succeed in eliminating them all, but most of them.
  3.  I factored out some common sub-expressions into local variables in the generated code. This speeds up the compilation time, since the labeler uses up 5-8% of the total compilation time (with no optimizations)

The modified sources of lburg are included with the sources of lcc-win32.

Constant folding

Constant folding is an attempt by the compiler, to replace all variables that are directly assigned to an integer with its equivalent constant in an attempt to simplify the code.

For instance :

int foo(int a)


int b = 8 ;

int c = 9 ;

return (b+c)/(b * (c-b)) ;


can be replaced with:

int foo(int a)


return 2 ;


The compiler achieves this by replacing all instances of the constant assignments with constants. This leads to :

int foo(int a)


int b = 8 ;

int c = 9 ;

return (8+9)/(8 * (9-8)) ;


After this substitution phase, an evaluator reparses the output and applies the corresponding operation when possible.

Sometimes this optimization will discover dead code :

A = 8 ;

if (A == 0) {



In this case, the if construct will be replaced with a jump, but the dead code will not be eliminated.

For a variable to qualify for constant folding it must be :

A local variable

Its address must never be taken. This avoids the aliases problem.

The assignment must be directly an integer constant. This can be extended to floating point numbers of course, but this is not yet implemented.

Folding will stop if the variable is the target of an assignment, or a flow control instruction is reached, or a label.

The implementation is done in the file analysis.c.

Dividing by a constant.

Peter L. Montgomery proposed in his research article ‘Division by invariant integers using multiplication’ an algorithm for dividing by a constant using a multiplication by the constant’s inverse instead of an expensive division. I implemented his algorithm in lcc-win32, producing two functions :  GenerateUDivByConst, and GenerateSDivByConst in win32incl.c.

The algorithm for unsigned division goes roughly like this :

  1.  Put the quantity do be divided in EAX.
  2.  Test if it is an exact power of two. This is already done in the front end simplifications (simp.c) but maybe an uncaught division was generated later by constant folding for instance. If the constant is an exact powser of two just generate a shift right of eax and return. No more is necessary.
  3.  If it is not an exact power of two, we call Montgomery algorithm with 32 bits, and precision 32. The algorithm returns the sign of the most significant part, and a multiplier that we use for multiplying by the inverse.
  4.  If the high bit is set and the constant we are dividing with is even, we compute a pre-shift.
  5.  If the pre-shift was done, generate a shift right of eax by that amount.
  6.  Then, generate code to put the multiplier into edx.
  7.  If we need the value in eax, save it with a push or in ecx.
  8.  Generate code to do the multiplication. The result of this is a 64 bit quantity in EAX :EDX.
  9.  If the most significant bit (returned by the call to the algorithm in 3) is set, generate code to restore the dividend that we saved in step 7 into eax. At this step the generated code has the dividend in eax, and the most significant 32 bits of the multiplication in edx. If the most significant bit was not set, go to step 13.
  10.  Substract edx from eax.
  11.  Shift right eax by 1.
  12.  Add eax to edx.
  13.  Shift right edx by the shift amount returned by the call to Montgomery algorithm above in step 3. Only do this of course, if shift is different from zero.
  14.  Move edx, that contains the result of the division, into its destination : eax.

Optimizing ‘const’ variables

When you declare

const int C = 4876 ;

the compiler should be able to replace all appearences of that constant with its value, making it possible to use it in all contexts where a constant is needed. You should be able to write :

switch (input) {

 case C :


  break ;



double array[C] ;

This is implemented by modifying the function idtree(), to test if the symbol being asked for is a constant, and returning code for a constant instead of returning code to fetch the constant. Some care is needed to ensure that the const int is really initialized to a constant and not a parameter to a function for instance. This is done at initialization time, when the compiler records if the variable is initialized or not. Note that with external const ints this optimization is not possible.

The function idtree then, returns a tree node for the actual constant, instead of returning a tree that accesses that constant. Very good, but a problem will sulrely occur if the user takes the address of that constant ! Suppose :

const int C = 4766 ;

pC = &C ;

This is perfectly legal, but now it will not work now, because the compiler will be confronted with the equivalent of :

pC = &4766 ;

We have to setup a flag to tell « idtree » not to do this optimization if an address is required. We see here the problem with optimizations : each one is simple, but their interactions with other parts of the language is what makes them extremely difficult.

A further optimization would be to eliminate the const int from the generated code altogether, what can be done when it is declared static, but this isn’t done (yet).

Optimizing structure arguments

When doing operator overloading, a common situation is to hide a pointer to a big structure within an opaque « struct » with just one member.

Suppose you have then, a structure like this :

typedef struct tagV {

void *handle ;

} SomeType ;

Then, you call a function, passing this structure by value :

SomeType st ;

someFunction(st) ;

When you look at the generated code, you will see that the compiler generates code for copying a block of data (i.e. sets EDI to the value of the stack pointer, source as the value of the address of that data, and then generates a block copy instruction for moving the data from its origin to the stack.

This is quite a huge overhead for something that actually should be done just with a push instruction !

In lcc’s intermediate language, the rule for this operation in the specific case of a structure that resides in the local variables area is :


This means : Push a block in the stack (ARGB) from the result of fetching the contents of a block that starts at the address of the local variable (ADDRLP) named tmp. Here is a more detailed look at the generated code :


reg: addr

leal    -4(%ebp),%esi

The register allocator chooses the right register for this operation (esi) because it was told so by the function ‘target’, in w32incl.c.


stmt: ARGB(INDIRB(reg))

subl    $4,%esp ;make place in the stack

movl    %esp,%edi ;setup destination

movl    $4,%ecx ;setup count

rep    ;now copy bytes


A first approach would be to go to dag.c, and see the case for ARG. Then we could just change the ARGB instruction into ARGI, and everything would be perfect. : the « block » would be handled as an integer.

Well, this will never work : we get immediately a « compiler error in kids » fatal message, telling us that there is no rule for


Yes, using ARGI for INDIRB is something that is not foreseen in the machine description of course, since it is a contradiction : the type of the ARG operation must match the type of its argument.

Well, another approach would be to avoid generating the ARGB in the first place. As a general rule, the sooner an optimization is done, the cleaner it is. This optimization could be done in the peephole optimizer, for instance, but there it would be much more difficult to extract the right sequence of instructions from the assembly stream. Block copies could be generated in other contextes, and we would have to distinguish the one that use the stack. Better to do it right at the source of the problem, i.e. when pushing each argument to a function call.

We write then a new function « Generateonepush() » like this :

Tree GenerateOnePush(Tree q,Tree args)


if (OptimizeFlag &&

 q->op == INDIR+B && q->type->size == 4) {

 q->op = INDIR+I;

 args = tree(ARG+I,inttype,q,args);



 args = tree(ARG + widen(q->type), q->type, q, args);

return args;


Note that we require that the size of the structure should be always equal to 4. This would mostly work for smaller structures, but at the risk of accessing undefined memory.34 Note too, that as a rule, all this optimizations are done only if the user requested optimizations. If not, the slower and straightforward code is generated.

Other changes to the front end.

  1.  There is no need to convert enumerations into integers since enumerations are represented as integers. I added some code in stmt.c to eliminate this conversion when reading the arguments to a function.
  2.  The code for assigning registers in decl.c was moved to the optimizer.
  3.  The option «-z» was added to the options of lcc. It will generate a .lil (lcc’s intermediate language) file for perusing if you are interested in looking at this feature of lcc.
  4.  The file ‘enode.c’ was modified for accommodating the ‘intrinsic’ interface. See the chapter about that below.
  5.  There is no need to generate an intermediate value if the return type of a function is the same as the value being returned. A small modification in stmt.c (function retcode) improves this, and eliminates a lot of unnecessary register moves.
  6.  Since lcc doesn’t spill floating-point registers, it would fail with some complicated expressions. This problem hasn’t been solved for the general case, but a small modification in simp.c produces now code that will use optimally the floating-point stack.
  7.  I have added several simplifications that were missing from simp.c: additions/substractions by zero, AND with 0xFFFFFFFF, and other operations are now eliminated.
  8.  Instead of doing a right shift of 24, I generate code to access a memory location 3 bytes higher.
  9.  Changed init.c to accept user configurable packing of structures.
  10.  The building of the stack frame can be avoided in some cases. This is done in w32incl.c

Quality control

The following programs are used as the test-bed for a new version of lcc.

lcc   37.000 lines

lcc’s IDE (Wedit) 72.000 lines

lcclnk    8.000 lines

weditres  31.000 lines

tst directory   9.000 lines

gcc  11MB of C source.

Lclint   5MB of C source

Further work

This is the first try of a high level optimizer. Many optimizations are still possible of course, but this release represents a compromise between the ‘ideal’ optimizer, that would be ready in a year or so, and the state several months ago.

Within the framework of the existing register allocator, and the existing structure of lcc, some improvements can be readily be done still:

  1.  The handling of the post-increment/decrement expression forces the usage of two registers when only one is needed in 99% of the cases. This needs a change in several functions of the front end, and in dag.c. I have attempted that change several times, but I couldn’t find a stable configuration yet.
  2.  The handling of function call results uses up a register instead of using EAX directly. This needs some small changes in enode.c and expr.c, and in dag.c
  3.  The rules for assigning registers should be simplified, and the block-move / division operations should work without the contortions needed now. This will be one of the top priorities for the next months.
  4.  Now that variables are held in registers, this information has to be written in the object file so that the debugger is aware of that. This hasn’t been done yet, so the debugger will not see those variables.

Compiler optimizations. A case study with gcc-2.95.2

When writing the debugger for gcc, I found a bug in the code generation of that compiler. This bug appeared when compiling the following code :


char *PickList[MAXPICKLIST];

void UpdateList(char *p)


int i;

i = 0;

    while (i < MAXPICKLIST) {

     if (PickList[i] == (char *)0) {

          PickList[i] = p;





    if (i == MAXPICKLIST) {

/* The last assignment that this code specifies is


or, in numbers:

PickList[6] = PickList[7] 

This is not respected in the generated code by gcc */

i = 0;

         while (i < MAXPICKLIST - 1) {

PickList[i] = PickList[i + 1];



         PickList[i] = p;


return (0);


This code searches for an empty slot in the table, and if it doesn’t find it, it will shift the table to the left one position, making place for the new slot at the end of the table. We will concentrate in the second loop (in bold in the text), where the bug appears.

To make the time difference measurable I changed the size of the table from 8 to 100,000 elements, and I called the loop 10,000 times.

The code generated by lcc for this loop is :

       xorl    %edi,%edi               (i = 0)


       movl    _Table+4(,%edi,4),%ebx  (read table[i+1])

       movl    %ebx,_Table(,%edi,4)    (write table[i])

       incl    %edi                    (i++)

       cmpl    $99999,%edi

       jle     _$10

As you can see, it is a straightforward, very simple code.  We maintain the loop index in a register (edi in this case), and make one read, one write per pass.

Gcc, generates the following :

       xorl %ecx,%ecx

       movl $_Table,%esi


       leal 1(%ecx),%edx         (1) t1 = i+1

       leal 0(,%edx,4),%ebx      (2) t2 = t1*sizeof(int)

       movl (%ebx,%esi),%eax     (3) t3 = table[t2]

       movl %eax,(%esi,%ecx,4)   (4) table[i] = t3

       movl %edx,%ecx            (5) i++

       cmpl $100000,%ecx         (6)

       jle L10

Obviously this is an optimizing compiler… the code generated is quite complicated ! But, is it really faster ?

Note that for adding one to a register it uses leal 1(%ecx),%edx. Clever, you add 1 storing the result in edx.

  1.  Then, it loads the effective address of 4*index into ebx.
  2.  Finally, it access the table and puts the table element in eax.
  3.  Stores the table element into the index pointed by ecx. Note that edx contains the index + 1, and ecx contains still the old value.
  4.  Now the counter is incremented by copying edx into ecx, the counter
  5.  Comparison, here there was a bug since sometimes this upper bound is not correct. But let’s forget this, this is not so important in this context.

I think that the obvious point here is the complexity of the generated code.

I decided to measure the speed of this « improvements ». I wrote a test function using the code generated by lcc, another one with the code generated by gcc, and a third one with hand optimized assembler.

The results were the following :


lcc: 12.9 seconds

gcc: 20.0 seconds

optimized assembly: 12.9 seconds

The code generated by gcc is almost twice as slow as the simple minded code generated by lcc. This has several reasons :

  •  The code generated by gcc performs many unneeded operations.
  •  It uses too many temporary variables.
  •  It maintains two parallel counters instead of one.

Why this situation ?

Gcc is a popular freeware compiler. Many people have worked on it, and many have added code to it. Most of them want to be remembered for what they left in there, not by the number of things they erased… so nobody ever takes the time of simplifying the code, erasing unnecessary or buggy routines, etc.

But erasing unnecessary routines, simplifying the code, etc, are things needed by any software. It is not possible to always add code. We must get used to erase code too. When this cleanup work is not done, the software as a whole gets so complicated and buggy that nobody ever can fully understand how it works. It becomes slowly more and more difficult to maintain. The maintainers of gcc are really afraid of touching things, since the interactions between the dozens of passes and optimizations are beyond their understanding.

I sent an e-mail to bugs-gcc@gnu.org and asked them if all this complexity is at all necessary. The answer I received is revealing :

> It would take even more code to disable this transformation in this

> case and cases like this, leading to an even larger compiler and

> buggier compiler...

Well, this has allowed that the simple optimizations of lcc-win32 arrive at 90% of the speed of fully optimized gcc.

The intrinsics interface

In the lcc discussion mailing list, we had a very heated debate last December about in-line assembly. I defended the point of view that in-line assembly is very useful, and can and should be used in any C program.

Now, several moons later, older and wiser, I think that the best thing would be to have the money of the cake and eat it too. The solution is to combine the efficiency of assembly language within a high level framework using the compiler to recognize pseudo functions, which will be inlined to its assembly instructions counterpart.

All the MMX calls are done this way.

The choice of each intrinsic is somewhat arbitrary. I didn’t want a full-blown interface that would mean you program assembly language in ‘C’. I choose some instructions from the many this CISC machine offers, and I am open to suggestions from your part concerning instructions that you would like to add. This interface is realized in the file ‘intrinsic.c’, in the compiler sources. For the prototypes for this pseudo functions see the file « intrinsics.h » in the include directory of the distribution.

For the documentation of the mmx functions see the Appendix 2

The intrinsics (non-mmx) recognized by lcc-win32 are:

Table : The intrinsics of lcc-win32




Returns a long long integer (64 bits) containing the number of cycles that the machine has done since it was turned on. Since this counter is automatically incremented at each cycle, to get a time in seconds you have to divide by the clock speed of your machine. For example if you divide the value by 166 x 1e6 you get the number of seconds elapsed for a 166 MHz machine. This intrinsic will use up some cycles for converting the 64 bit value into a floating point number, so there will be an overhead of at most 1000 cycles: at 200 MHz this should be 5 millionths of a second...


Returns the byte swapped long integer

_fsicncos(arg,cos *)

Returns sin(arg), and stores the cosine in the address pointed by cos *.


Returns the cosinus of its argument


Returns the sinus of its argument


Returns the constant pi in floating point format


Returns the logarithm base 2 of e.


Returns the logarithm base 10 of ‘e’


Returns the natural logarithm (base e) of 2.


Returns the carry flag as an integer. This value is VERY volatile, since all calculations set/unset it. You shouldn’t assume that the value returned is the value of the last C instruction done, since the lcc’s optimizer could re-arrange the instructions


This will return a long integer from the given double using the rounding mode that is in effect at the time of the call. Normally this should be rounding to nearest, since lcc-win32 never changes this setting. This allows for very fast rounding. It must be remembered that to satisfy the rules of ANSI C, lcc-win32 is forced to set the rounding to truncation, make the rounding, and then restore the original mode. This can be avoided by using this intrinsic function to round floating-point data.

int rint(double) 

Rounds the given floating point number to an integer using the current rounding mode. By default, the rounding mode is round to nearest. To use this, you should include <math.h>


Returns an integer with the index of the first non-zero bit in the given integer, starting with the most significant bit.


Returns an integer with the index of the first non-zero bit in the given integer starting with the least significant bit.

Implementing the operators redefinition in lcc

The C language has been more or less stable since its last major revision with the introduction of function prototypes and other improvements with C89. The latest standard, C99, adds many necessary things but fails to address the modifications needed to the language to make it less error prone. It is thought implicitely, that C programmers should become C++ programmers, and have to give up the simplicity of C to get a monster language that a few people in the world fully understand.

The situation where there is only a primitive language, frozen for all eternity, and a monster language at the other side forces people to leave this non-alternatives and look elsewhere. Most of the popularity of the Java language arises precisely from the fact that it is a less complex language than C++, maintaining some superficial similarity with C.

As far as I know, lcc-win32 is the only C compiler being actively developed. All other compilers that I know of are C++ compilers, where developments are done exclusively in the C++ side. I believe that C is a viable language by itself, and that it could benefit from the latest experiences in other languages and from progress in software engeneering in general.

The objective is to add some key features that would make C programming safer, and less painful than what is now.

  •  Array bounds checking would eliminate nasty memory overwrite bugs, a real nightmare of C programming.
  •  A garbage collector would easy the memory related problems we all know: keeping exact accounting of memory chunks is a tedious and error prone work, better suited for machines to do.

I decided then, to add this features to the implementation of the C language I distribute: lcc-win32.

To my surprise, the implementation required very little modifications to the source code of the compiler. Only a single small file, (called operators.c) was added to the distribution. When the operator keyword is detected by the front end (in decl.c), a list of operators overloaded is maintained, and at the appropiate places the corresponding function call is generated. In most cases this was quite easy to do, since I had just to insert the corresponding call before emitting the error message that was supposed to be emitted when something like

struct A + struct B

was seen.

C and C++

After implementing operator overloading, many people asked about other features of C++. Here is the explanation why some of those features will not make it into lcc.

Function overloading.

Function overloading is a C++ feature that allows you to define several functions with the same name that are differentiated by the compiler using their arguments list.

This way you can define:


print(char *);



The justification for this is given in “The C++ programming language” by B. Stroustrup, in the paragraph “Overloaded function names”.  It is interesting to note that the paragraph begins with: “Most often, it is a good idea to give different functions different names”, but then… Bjarne continues: “The overloaded function names are primarily a notational convenience”.

Convenience? Convenience for whom? For the writer of the program obviously, not for the maintenance programmer that has the job of figuring out which of the zig “print” functions is being called at this concrete call point!

I have the old fashioned opinion that a language shouldn’t be only easy to write, but also, and that is equally important, easy to read!

The problem is that it may not be obvious which overloaded function is being called, and that the algorithm for the compiler in C++ runs for several pages! And the reader of the program is supposed to redo that algorithm in his/her head when reading the program at each call site?

A nightmare. Compared to this, the famous “computed goto” statement of Fortran is much clearer! This feature leads to “write only” computer languages, i.e. languages that are easy to write programs with, but that are impossible to read after the program has been written.35

The implementation of this feature in lcc-win32 is designed to allow for easy reading. The overloaded functions must be marked with the « overloaded » marker, so it is clear that this particular function is special.

Other “hidden” but important cost of this feature is that in C++ forces the compiler to figure out a name, an “internal” name, for each function defined. Since the compiler can never know if in another module a new print function with different arguments is defined, it must rename all defined functions without exception.36 This feature would then provoke a host of changes that are absolutely unjustified. One of the main design issues in the implementation of overloading in lcc-win32 has been to avoid this feature at all costs.

Stroustrup says about this way of doing things:

“Compared to the overloaded print(), we have to remember several names and remember to use those correctly. This can be tedious, defeats attempts to do generic programming and generally encourages the programmer to focus on relatively low-level type issues.”

Lcc-win32 tries to let you have the cake and the money of the cake too !  You avoid the complexities of C++ with the additional problem of the class hierarchy that compounds the complexity of function overloading, but you let the computer figure out most of the names it needs to implement a generic function.

This is not really generic programming in its real sense, since we are calling different functions, even if we are using the same name. This fact is barely hidden by the compiler machinery that tries to give the user of the language this impression, but as we will soon see, this impression can disappear very quickly.

Another problem with the overloading scheme, is that since the decision which function will be called is passed to the compiler, there is no way to call print_int(double), when we happen to store an integer value in a double. We can make a cast, but then… it is much easier just to call that print_int function in the first place isn’t it?

Up to now we are considering simple functions with only one argument. What happens when several arguments are given? Then, the real nightmare begins:


int     pow(int,int);

double  pow(double,double);

complex pow(double,complex)

complex pow(complex,int);

complex pow(complex,double);

complex pow(complex,complex);

Wih this scenario, the call

double d = pow(2.0,2);

will be rejected by the compiler!

The reason is that the poor compiler can’t differentiate between pow((int)2.0,2) and the other possible function pow(2.0,(double)2).

Stroustrup says:

“Some C++ novices get irritated by the ambiguity errors reported by the compiler. More experienced programmers appreciate these error messages as useful indicators of design errors”.

Well, wait a minute, what about the need to avoid to “focus on relatively low-level type issues” that he just talked about several lines before? Forgotten?

Function overloading should relieve us from those “low-level” issues isn’t it?

No! It is the contrary that happens. Those low level issues get worse, and take more time to solve. We can’t just take the time needed by one programmer to write the stuff, we have to take into account the time needed by him and all others to get hold of those strange results, debugging, finding which function is called etc.

Other problems complicate the situation even further: What happens when a pointer to the function '‘print'’is taken, and passed around? Will the match take the right overloaded function? It all depends on the right declaration of the function pointer, of course. But then, we are forced to pay a LOT of attention to the exact types of that function pointer, precisely what Stroustrup said at the beginning we should try to avoid, and the very objective of function overloading!

Conclusion: Function overloading complicates both the programs and the compiler for  a very limited usefullness in the general case. This is a feature to avoid, if possible.

The implementation of lcc-win32 was more or less forced because the need to implement easily constructors for data types that can receive several forms of initialization arguments. For instance :

typedef struct tagExample {

char *n ;

int a ;


EXAMPLE * overloaded newExample(char *) ;

EXAMPLE * overloaded newExample(int) ;

This is easier than

EXAMPLE *newExampleFromInt(int) ;

EXAMPLE *newExampleFromString(char *) ;

Specially when the structure has many different fields, the number of names to remember increases too much. In this cases it is handy to have a single creation name that branchs into different functions, according to the creation parameters given.

Default function arguments

What are ‘default’ function arguments?

To keep the above example, it would be nice for instance, if the function ‘print_int’ would take an extra argument, the base in which the given integer should be printed.

The problem with this, is that in 99% of the cases that base will be 10, so we would have to write an unnecessary argument at most calls.

The C++ solution to this problem is to introduce default argument, i.e. you would write:

 print_int(int n,int base = 10) { /* … */ }

At the call site we would invoke the print_int function like this:


Only in very rare case we will use


 print_int(n,16); or print_int(n,2);

There are several ways to achieve the same result in C.

  1.  #define print_int_base10(n) print_int(n,10)
  2.  void print_int_base10(int n) { print_int(n,10); }
  3.  setbase(10); print_int(n);

Solution 1: The pre-processor is not “in” any more. Fads come and go, and year 2000 the pre-processor is considered harmful. Why? It is simple and in many cases very useful. But simple and useful things (as anything) can be over-used, and Stroustrup warns against it. This solution is equivalent to an inlined function, so it costs in extra space at each call site.

Solution 2: Make a cover function for the often used argument. This has the advantage that the program is much clearer, and hints to the reader that printing in other number bases is possible. On the contrary, print_int(n) with default arguments doesn’t ever give the reader the hint that another argument can be used. There is no way to know unless you look up the definition of print_int. This solution  is space conservative, since the passing of the extra argument is avoided in most calls.

Solution 3: Use a global variable. See solution 1. Global variables, as many other simple and useful things are ‘out’. True, they are not thread safe, and used too much they can make program flow very confusing. But a simple solution like having a current output base, using a function to read/write to it can be very effective. Besides, global variables can be protected by semaphores. Used with care they are as good as other solutions… This solution is very space conservative, since the extra argument is never passed. The variable needs to be changed only when the base changes.

Conclusion: Default arguments do not bring anything really useful. They can be easily avoided.


In C++, you can use “namespaces”, i.e.  you can define a scope that will automatically hide all the names defined within it, and be accessible from outside only with specific qualification. For instance you can define:

namespace debug {

 int debug_print(Mystruct *p);

 int debug_active = 0;


later on in your program you can access those name qualifying them with their namespace prefix:


There are several ways of implementing an equivalent construct in C.

  1.  Each file or compilation unit provides an automatic namespace. Only names that are not declared static are exported.
  2.  You can easily prefix your identifiers with a standard set of letters to avoid name clashes. For instance you could use the prefix appl_debug to all exported names that refer to debugging in your application.
  3.  Use the import/export “wizard” of Wedit to easily tune which names are exported, and which should be kept private in a module.

The cost of implementing namespaces is quite high: the compiler would have to rename all functions (the same cost as function overloading), without really gaining much in program clarity. A solution done “by hand” or with the aid of the IDE is much easier for everybody: the compiler, that is simplified, and the user: another construct less to learn and be aware of.

To export definitions from a module, place all those definitions in a common header file. All private definitions should be placed in the beginning of a .c file, and declared static. This way, we achive exactly the same result without any new constructs.


We use the same example as Stroustrup’s ‘lexer’ example, used to justify the namespace construct. In C, you would place in the public header file a single definition:

file lexer.h:

 double Parser_interface_get(int);

This file would be included by all clients that want to use the module. Only a small set of symbols would be exported, prefixed by “Parser_interface”.37

Sometimes, as Stroustrup remarks, it is necessary to have a double interface: one for the clients of the module, and another for the people implementing the interface. In that case, it is very easy to make another header file, say “lex_intern.h”, that contains the internal definitions needed by implementation modules, and is not exported to the general public. This is another variation in the “simple is better” argument. All this works, and allows us to avoid the namespace construct, the “alias” directive38, and the using directive.

Namespace composition can be trivially achieved in C by including several headers defining different interfaces. Note that there is no possibility to resolve a name clash at compile time, what implies that the two interfaces must be compatible with composition: they shouldn’t have common names.

Note, too, that the resolution of the name clash can’t be done without modifying one of the conflicting names. The most trivial solution is to add a prefix, a solution that will always work. In C++ you use a “using” directive to disambiguate the name, i.e. you tell the compiler which definition to prefer. This could be a useful addition, since it would allow the compiler to resolve ambiguities directed by the user. A “using” directive would take a header name and a symbol, and would resolve any name conflict that way. But this would introduce yet another feature into the language. For the time being, this is not implemented.

Note that in C++ function overloading works across namespaces, so that you could introduce several unknown complexities in your software by importing a namespace. In C you have no such troubles. If there are any name clashes the compiler will warn you.

Conclusion: In general, in C you have a single point of definition for all names in use in a compilation unit. This rule that contributes a lot to C’s simplicity and readability would be destroyed by namespaces without any big practical gain.

Avoiding complexity

The current trend in C++ programming seems to be:

Do not use a clear and simple construct when you can build an obscure and complicated one.

Simple things, simple notation leads to programs that can be understood by most people. This is very bad, and essentially against information hiding, an old OO principle39. You can always add an unexpected construct, a hidden “gotcha”, and a useless complication to make the maintenance programmer go nuts.

An extreme example of this way of thinking is the paper “Techniques for scientific C++” of Todd Veldhuizen. One of the chapters of that document treats the following problem: Adding the contents of 3 tables.

If you use operator overloading for arrays, the expression

Vector<double>40 a, b, c, d;

a = b + c + d;

will generate code for making a temporary array of b+c, then adding to that temporary array the contents of d. All those temporary arrays produce a lot of data movement between memory and the CPU, and performance suffers.

The solution of course is to write:

for (int i=0; i<N; i++)

 a[i] = b[i] + c[i] + d[i];

But this solution is not very “elegant”. Anybody would understand this kind of simple stuff. Let’s do better. And Todd starts an incredible explanation of “expression templates”, i.e. parse-trees that are built from pairs of types by the compiler, and then inlined… After several pages of very abstract C++ code and many incredible contortions, he comes to: (I quote)

[snip snip]

So the end result is:

for (int i=0; i<N; i++)

 a[i] = b[i] + c[i] + d[i];

Nowhere is explained in the paper why this contortions are needed to arrive to the same result that could have been obtained with much simpler notation! What is basically wrong with writing the loop above in the program text?

Well, maybe this is just an example of expression templates, and maybe there are some situations where this complexity is needed. Unfortunately, the author never explains those cases, so we are left wondering what is so wrong with the simple loop above.

The Pentium SSE/SSE2 back-end

Intel decided at the end of 2000/beginning of 2001 to strip down the very complex architecture used (with so much success) until now, and speed up the clock, up to 1.5GHZ or even more in later models.

I decided to implement a new back end for this machine, and started working on this part in September 2001.

The new machine features some impressive changes, that make it quite an improvement over earlier processors. The new 128 bit registers xmm0 to xmm7 were extended to accept all MMX instructions, and could for the first time work with double precision numbers, what make them a serious choice for improving lcc-win32 floating point performance.

The ‘pedump’ utility

All along our discussions of this feature or that feature of the assembler, the linker, whatever, we have come across this utility. Here, I will describe a little bit more how it works.

Matt Pietreck wrote the original pedump that was included in his book: Windows 95 programming secrets. I have added many new binary formats to the code, but kept the original structure that Matt designed.

The sources are located in \lcc\src\pedump. They consist of the following files:

com.c. This module dumps the typelibs for activex/ole objects. It uses the ItypeLib interface for getting the type library, and the ItypeInfo interface for getting the data out of the type lib. It will show recursively all interfaces defined in the dll or type library.

disasm.c. This module disassembles the object code. Parts of this module are used in wedit’s debugger to show the machine instructions.

cplus-dem.c . Demangles gcc mangled names. This part is only necessary if you include the stabs debug info dumping module.

exedump.c. Dumps an executablefile. Here is the code for dumping the debug information that only appears in executable files (the Sst* tables described in various places in this documentation), and other constructs like the relocation table, that exist only in executables.

objdump.c. Dumps an object file. The object file relocations are of course different than the executable file relocations. Their format is analyzed here. You will find in this module the library dumper, that understands all library formats of Win32: the traditional one, and the new import library format.

resdump.c. Dumps a resource file. This part is just thought as a debugger for the resource editor. It could be better…

common.c. Common routines for reading COFF object files and dumping the debug information. Here you will find extensive comments for each function. This is one of the big files in the pedump module, containing the header dumpers, the debug info dumpers, and many others.

elf.c. Dumps an elf executable format. Useful only in Unix.

stabs.c. Dumps the debug information that has the stabs format.  Useful only with gcc.

debug.c. Dumps the debug information in the stabs format.

pedump.c Main module. The routines for dumping .dbg files are here too.

You should define the keyword ‘UNIX’ if you want to include the routines that use the stabs/elf formats. Normally this is not very useful under the windows OS.

To build pedump go to the \lcc\src\pedump directory and type ‘make’ at the command prompt.

You can use pedump to dump the debug information of any program that uses the NB09 Microsoft standard. Specifically, when using the MSVC compiler, you should specify the –z7 compiler option.

The linker


When I had my assembler working, I could test it with Microsoft’s linker. For the first time I could compile Wedit’s source code with lcc and I could at last experience the thrill of looking at a program that was built with this first, embryonic system.

But I wanted a self-contained tool. It would be annoying to tell the users: well folks, you have to buy MSVC and use the LINK command. They would surely answer: ‘If I buy MSVC, I do not need your system at all...’

So I started working in a linker.

The tasks a linker has to do are the following:

  1.  Consolidate all sections from all its input files into a single section. This means that all .text sections should be thrown together in a single .text section, all .data sections should be consolidated in a single .data section and so on.
  2.  Build an executable with the sections it finds in the specified object files or in the libraries it searches according to Microsoft’s PE specifications.
  3.  Link all the debugging information consolidating the definitions repeated several times within the object files into a single .debug section, and build a debug directory using Microsoft’s NB09 specifications. This is done by Microsoft’s CVPACK utility. within the MSVC system.
  4.  If any of the input files is a resource file, convert it to COFF format and link it with the rest of the object files.
  5.  Build the table of imported functions from system/user dlls.
  6.  If the user demands a dll, build the .reloc and .edata sections, with the relocation data and the export data.

The format of the PE Executable.

Figure 1.

An executable file contains basically two things:

Image data

Instructions for the loader

Under Win32, Microsoft adopted the COFF format that we have already reviewed in the chapter about the assembler. Executables follow this format, adding several sections that we will discuss in detail later.

The ‘Dos stub’

Windows is a better DOS. But that venerable ‘system’ is still with us, even years after the last version was out. Microsoft Disk Operating System must and should be supported, even if we are in 98, and the original MSDOS with 32K RAM would fit into the cache of today’s processors! Still, this commitment to preserve the user’s investment in software has made popular this environment, so we should take that seriously.

The very first part of an executable is then, a ‘Dos stub’, i.e. a program that will just print out ‘This program cannot be run in Dos mode’ and exit. This is a constant part of the executable that the linker writes at the beginning of the exe file. More sophisticated linkers have an option to use another ‘stub’ dos executable, instead of the standard one. I have thought for a moment doing this, but later discarded it, since contradicts the KISS principle. Why complicate things?

The File header

Then follow the interesting parts. The file header, the ‘optional’ header (required for executable files), and the data for each section. Here is an example of the output of the dump utility for this header:

File Header

 Machine:                      014C (i386)

 Number of Sections:           5

 TimeDateStamp:                32FBD489  (Sat Feb 08 02:19:05 1997)

 PointerToSymbolTable:         00000000

 NumberOfSymbols:              00000000

 SizeOfOptionalHeader:         00E0

 Characteristics:              010F






The linker at the end of the linking process writes this part, just before it exits. The reason is simple: the linker knows this information only when the whole file has been built, all symbols have been counted (and relocated) and the position of each item is known.

You will find in the source code of the pedump utility, in the file common.c, the function ‘DumpHeader’ that parses this table and prints its values.

The ‘Optional’ Header

Here is an example for this header:

Optional Header

 Magic                              0x10b        267

 linker version                     1.01

 size of code                       0xE200       57856

 size of initialized data           0x1E00       7680

 size of uninitialized data         0x1000       4096

 entrypoint RVA                     0x113D       4413

 base of code                       0x1000       4096

 base of data                       0x10000      65536

 image base                         0x400000     4194304

 section align                      0x1000       4096

 file align                         0x200        512

 required OS version                1.00

 image version                      0.0041

 subsystem version                  4.00

 Reserved1                          0x0

 size of image                      0x15000      86016

 size of headers                    0x400        1024

 checksum                           0x0

 Subsystem                          0x3 (Windows character

 DLL flags                          0x0

 stack reserve size                 0x100000     104857642

 stack commit size                  0x1000       4096

 heap reserve size                  0x100000     1048576

 heap commit size                   0x1000       4096

 loader flags                       0x0

 RVAs & sizes                       0x10

Data Directory

 EXPORT       rva: 0x0         size:        0

 IMPORT       rva: 0x14000     size:     1794

 RESOURCE     rva: 0x0         size:        0

 EXCEPTION    rva: 0x0         size:        0

 SECURITY     rva: 0x0         size:        0

 BASERELOC    rva: 0x0         size:        0

 DEBUG        rva: 0x11000     size:       84

 COPYRIGHT    rva: 0x0         size:        0

 GLOBALPTR    rva: 0x0         size:        0

 TLS          rva: 0x0         size:        0

 LOAD_CONFIG  rva: 0x0         size:        0

 unused       rva: 0x0         size:        0

 unused       rva: 0x0         size:        0

 unused       rva: 0x0         size:        0

 unused       rva: 0x0         size:        0

 unused       rva: 0x0         size:        0

 heap commit size                   0x1000       4096

 loader flags                       0x0

 RVAs & sizes                       0x10

The different fields are:

‘Magic’: Just that, a number that HAS to be there to verify that this is indeed an executable file.

Linker version: Lcclnk sets this to its version, 1.2 now, 1.01 in the example shown.

Size of code: The total size of the .text section.

Size of initialized data: The total size of the .data section. When you write in a ‘C’ program a variable (not a local variable that is allocated in the stack) but a global/static variable that is initialized, the linker arranges so that it points to a portion of the .data section.

Size of uninitialized data: The size of the .bss section. Here points any variable that is just declared, without any initialization. This .bss section is not written to the file. The linker just declares it, so that the loader will reserve at this address a portion of memory that will be zero filled.

Entry point RVA: The Relative Virtual Address of the entry point. This is simply the offset in bytes of the entry point minus the base of the image. The loader NEEDS to know this, since it will pass control to this address once the program is loaded in memory. If, for any reason, the linker doesn’t know the address of the entry point, it will abort the link, since the generated executable will never run.

Base of Code: An offset from the image base, where the .text section is located. Normally this is 4K. I suppose that this is to make the very first page of memory absent, so that the system can catch any writes to NULL pointers. If this weren’t the case, you could read from a NULL pointer...

Base of Data: Where the data of the program begins. This can be, as this example shows, the starting address of the .bss section. It depends on the program.

Image base: An offset from the start of the virtual address space, where this program is being loaded.

Section align: The alignment for each section. Lcclnk uses a 4K alignment, but this wastes some precious RAM. A way of eliminating this wastage would be to cram several sections in one, so that the alignment is not needed. For instance, we could pack the .bss section at the end of the .data section, saving the alignment of the bss section. If I have time and resources, this is definitely a thing to do for a better version of the linker.

File align: The alignment chosen by lcclnk is too big. I should experiment with smaller alignments like 16 instead of 512, its value now.

Required OS version: All programs linked with lcclnk will run in the Win32 environment. I was not sure what to put in there, so I settled for a value of 1.0. I do not think that the loader cares a lot about this value, but this may change in the future.

Image version: This is a user defined value that defaults to 0.00. With the -version option of lcclnk, you can put a value in here.

Subsystem version: Again, I didn’t know what value put in here, so I settled for the same value as LINK: 4.0

Reserved1: Always zero.

Size of image. The sum of all sections.

Size of headers. Lcclnk will always put a 1024 in here.

Checksum: Lcclnk doesn’t support this value, because I haven’t found the algorithm needed. It should be in ImageHlp.dll, according to the documentation. Looking at the exports of that dll, you will find indeed a ‘CheckSumMappedFile’ function. Now, this is another thing that future releases will cover. In any case you need that checksum only of you are writing a device driver, or similar stuff. If you do it, just send me a mail and we will see how this works.

Subsystem: Lcclnk has a -subsystem option. The value you enter ends up in here.

Dll Flags: The documentation marks this as ‘obsolete’ so I put always zero in there.

Stack Reserve size: The maximum size the stack can grow. Lcclnk sets here 1MB, what should be enough for the most demanding applications. There is no option for configuring this however.

Stack commit size: The amount of real RAM that the system will commit for you before the program begins to run. You can set this value with an option in lcclnk.

Heap reserve size: Same as Stack reserve.

Heap commit size: Same as Stack commit.

RVAs and size: This is a table of 10 positions, each with a relative virtual address and a size. Lcclnk uses the following:

EXPORT: The .edata (exports) section. This is needed for dlls only, and it is described further down.

IMPORT: The rva of the import table and its size. This is the same as the .idata, and consists of the names of the imported functions from other dlls. All programs have this table, because you will surely need to import from the C run time library or Kernel32 or other basic system dlls.

RESOURCE: If you have a windows program with resources you will use this entry.

DEBUG: The debug information address and size.

The function ‘DumpOptionalheader’, in the source file of pedump common.c, reads this table and shows its values. In the linker sources, this is done in the “WriteExeFile” function, in lcclnk.c.

Standard sections

The linker builds the image based upon the data found in the object files it uses: each section of the object file will be assembled in a section of the executable file. All the .text (code) sections will build the final .text section of the executable, all .data sections are gathered together in a single .data section, etc.

This sounds simple but is not very easy to do. I spent a lot of time in the linker, trying to do things that are very easy said, but no so easily done. For instance, the linker just puts the different sections in the order of the object files given in the command line. Taking a concrete example, if you are linking ‘foo.obj’ and ‘bar.obj’, the text section of foo.obj will be taken first, and then the text section of the file ‘bar.obj’. But then we have to worry about alignment between different sections (I do not use any), and other complications like the fact that the startup object file must be first in the order, because it contains special sections for building the import libraries, etc.

The standard sections that lcclnk understands and uses are the following:

.text: The code of the program. These are sequences of bytes that the processor you are using ‘understands’ This bytes are interpreted by the processor as a series of instructions, executed, etc. All the executable and all computer processing comes down to this: a series of numbers that the processor executes in an absolutely stupid way. It has no ‘soul’ nor free will, nor nothing else but this never ending desire of doing what is being told to do!

.data Here come the initialized data items, i.e. data that is know to the program at program start.

.bss: Non initialized data. This is just RAM space reserved for future use by the program at the program start. This name ‘non-initialized’ data is a misnomer since this RAM will be cleared to zero by the program loader.

.idata: The imports table. This is a table constructed by the different ‘import libraries’ and several sections in the startup code, that tells the program loader which dll to search and which functions in that dll to link at load time.

.edata: The exports table. In the case of a dll, this tells the program loader, which functions this dll exports.

.rdata. Several small information pieces about the executable, like the position of the Coff debug header, and other trivia. Not really necessary, and I am thinking that in a future version of the linker I could well just ignore it.

.reloc: Relocation information telling the program loader which sections to fix if the program can’t be loaded at its default load address, the Image base value we discussed above.

.rsrc: Resources of the application.

.reloc: Relocation information for the program loader. This is used only in Dlls.

.debug: Debugging information. The program loader at startup does not load this section.

The linking process (overview)

Roughly, the whole thing goes like this:

Process all command line options.

If any file in the command line has a .res extension, assume it is a resource file and convert it to an object file.

If any file has a .def extension process it as a .def file containing the names of the functions that will be exported from the dll being built.

Open all files given in the command line.

Read in all sections from all object files and calculate the size of each of the output sections. Each output section is just the sum of its component sections.

Store all symbols encountered when reading the object files into the global linker hash table.

Look for undefined symbols, and search in the libraries for them.

Perform the symbol table construction, relocate the line numbers and auxiliary entries.

Perform the relocations, writing out all relocated sections into the executable. At this step all symbols should have been read. If a relocation points to an undefined symbol we can’t go on.

At each relocation above save the position of the relocation for the building of the .reloc section if this is needed.

Link the code view debug info. This steps eliminates redundant definitions and compacts the debug info. If the file doesn’t contain CV debug info, the linker tries to figure up some of that information using the COFF symbol table info.

Write the headers (file headers, optional header, etc).

If the -x options was indicated, scan all global symbols to see that they are used at least once. Print a report.

If a map file was indicated, write the map of the executable generated.

We are done, close everything and get out of the way!

The linking process: detail

Parsing the command line

This is not very interesting. All command line options are listed at the end of this chapter. The file name expansion is done at this stage. Here the type of the input file is deduced from the file name extension, i.e. .res files are handled different than normal object files. File name expansion is done in expand.c that returns a table of all the files that correspond to a given specification.

Convert the resources to object files

Well, this is not like parsing the command line options. Definitely not. It took me more than two months to write the ‘cvtres’ utility, to convert resource files to Coff object files. Basically this builds two tables, one with a series of pointers and directory structure, and the other with the actual data I read from the resource file. This two tables are merged into a .rsrc section by the linker.

Parsing and processing the .def files

This isn’t especially exciting. I build a linked list of exports that are read from the .def file. Of course, the usual problems have been detected after the first release... I hope I parse the .def files OK now.

I expect a file containing just:

  •  First line: “LIBRARY”  Name of the dll, without the extension.43
  •  EXPORTS keyword
  •  A list of name with 1 name per line specifying the symbols to be included in the export table. I accept the two identifiers separated by the equals sign, so that you can rename a symbol for exporting it, maybe eliminating decorations, etc.

As time passd, I have added some missing stuff like accepting comments (they start with a ‘;’) or accepting (and ignoring) things like “description” or other keywords from the 16 bit dlls. All this is read into a linked list with the following structure:

typedef struct tagExportList {

       struct tagExportList *Next;

       char *Name;

       char *VisibleName;

       int Ordinal;

       unsigned long Address;

       unsigned long Flags;

       struct LinkerEntry *h;


#define EXPORTFLAG_ISTEXT       1

I keep the name, optionally the visible name that I will write into the exports table, the ordinal, even if lcclnk doesn’t use that, the actual address of the export, a flag telling me whether this is something that lives in the .text section or not, and a pointer to the linker symbol structure so that if needed I can get any information from this symbol that I need when writing the export table.

The bulk of the work obviously isn’t parsing the rather straightforward .def files but building the .edata section in the executable. This is done in ‘BuildEdataSection”  function. That function goes through the exports list, filling the address field, that has to be passed to the program loader. Then I allocate a buffer and write it all.

The edata section contains several tables, that I allocate contiguously.


Table name


Export directory table

A table with just one row (unlike the debug directory). This table indicates the locations and sizes of the other export tables.

Export address table

An array of RVAs of exported symbols. These are the actual addresses of the exported functions and data within the executable code and data sections. Other image files can import a symbol by using an index to this table (an ordinal) or, optionally, by using the public name that corresponds to the ordinal if one is defined. Lcclnk doesn’t support the linking by ordinals, so the numbers it emits here are mainly for compatibility.

Name pointer table

Array of pointers to the public export names, sorted in ascending order. Sorting is done in the “SortExports” function (lcclnk.c).

Ordinal table

Array of the ordinals that correspond to members of the Name Pointer Table. The correspondence is by position; therefore, the Name Pointer Table and the Ordinal Table must have the same number of members. Each ordinal is an index into the Export Address Table. I substract the base of the ordinals to all numbers given, but I have never verified that this is actually correct. Beware.

Export name table

A series of null-terminated ASCII strings. Members of the Name Pointer Table point into this area. These names are the public names through which the symbols are imported and exported; they do not necessarily have to be the same as the private names used within the image file. This explains the equals sign in the .def files.

Open all files given in the command line

This needs a little reflection. As with the compilation process, the linking process is limited by I/O. It is essential that the linker reads the information from disk in the most efficient manner. Since I am not a disk guru, and besides I wouldn’t think that a linker should mess with the type of disk it is running in, I leave that to the operating system. Let’s face it: the operating system will have done a much better job of tuning its I/O than I could ever do. So I use this feature of Win32 called memory mapped files. This means that I use the swap file for all my input output, so that in fact the OS does all input output for me! This speeds enormously the linking process.

The linker that builds its symbol table as it goes, processing all object files. The embedded directives emitted by the assembler as .drectve sections, are processed. For the time being, the compiler only uses this  for __declspec(dllexport). The linker processes those records, and updates the exports list accordingly.

Each object is scanned, and the symbol table of each object goes into the global linker symbol table. Basically there are within the C language only two types of scopes: local to the compilation unit (object file) and global to the whole project/executable. This is not a very sophisticated arrangement, and I have discussed within the lcc mailing group the possibility of introducing ‘name spaces’, i.e. spaces of scopes that could alleviate the problem of having all modules share all global names. For a variety of reasons my proposal was rejected, so we stick to C language conventions for the time being.

The contents of the text, data and bss section of each object file are added up to its containing section, so at the end of the process, we know exactly how many bytes will be in each section, and the position of each global symbol within its section. For instance if the symbol i_table is located at the address 50 in the data section of module 3, and module 3 is starting at address 250, the address of  i_table will be 300.

A weakness of lcclnk is that it doesn’t know (or it doesn’t try to guess) what to do with sections that aren’t named « text », « bss »,or « data ». It could obviously use the section flags to figure out where a section should go. In the case of a general-purpose linker this would be highly desirable. Lcclnk is not a general-purpose linker however. It is designed to link the object files generated by lcc, so this schema it is enough for its needs.

A special treatment is done with the .bss section, since its contents are not read in anywhere: they are just implicitly defined by the sum of the .bss sections of all object files.


Symbols are just aliases for memory locations within the program. For people, it is easier to refer to « strcmp » than to 0x479887. The symbol « strcmp » then, is used to indicate the memory location 0x479887.

Symbols can refer to unknown memory locations, i.e. memory locations that will be resolved later in the linking process. The linker can see the first time a symbol just as undefined external symbols. It will add it then, to the undefined set of symbols.

There are then, the following sets of symbols during the linking process:

The set of defined symbols, not in the common section. All this symbols have a fixed address already.

The set of symbols in the common section

The set of undefined symbols that have been seen as externals but where the definition is not yet processed.

Symbols can be moved from the undefined set, into the common or into the defined symbols.

This needs some explanation. Suppose you have in the file file1.c the following declaration:

int iconst;

The symbol ‘iconst’ will be assigned to the common section that is initialized to zero at program startup. But consider what happens if you include ‘file2.c’ in the link, that contains the declaration:

int iconst = 53433;

The linker will move the symbol ‘iconst’, from the common section to the data section. The definition in file1.c will be lost.

And there are worst things that can be done:


int buf[256];


int buff[512];

The linker will leave ‘buf’ in the common section, but will set its size to the bigger value, i.e. 512. This is harmless, but beware that you make a definition in a file3.c

int buff[4] = {0,0,0,0};

Your table will have a size of just four positions instead of 512!!

The discussion about the common storage model

This issue has provoked heated debates in the lcc mailing list, especially when David Stes, came with his objective C compiler, where he needs this feature for initializing his objects. In earlier versions of the linker, I had a warning, when a symbol was moved from the common storage area to the data area. David wanted at all costs that the warning disappear. I argued that the following bug would be impossible to catch without that warning:

From jacob Tue Apr 28 11:53:30 1998

Subject: A case study.

To: lcc@cs.princeton.edu

Let's assume the following scenario:

In the file f1.c, I have code like this:

char *p;

int somefn(void)


       if (p == NULL) {


               p = "Initialized";




Since p is in the bss, I can safely assume that its contents are zero at program startup.

I compile this, and it works without any problems. Several months later, in another completely unrelated file, let's call it f2.c, in the same program, I write:

char *p="Here is the bug David";

I compile, link, and everything seems OK, but... the program crashes! After several days of work, I realize that the function doInit() is never being called!!!

What happened???

That according to David proposition, the warning 'redefinition of p' was dropped!!!

The char *p was assigned to the character string, its contents weren't NULL at startup, and doinit() is never called.

Well, you see the problem?

The linker *should* warn about this. Because there is NO OTHER tool that can issue that warning!!! The compiler will NEVER detect this since it sees only one file at a time.

And this can be IMPOSSIBLE to find if the linker doesn't issue a warning!!!

David wasn’t happy with this, and he answered:

Date: Tue, 28 Apr 1998 17:20:13 +0200 (MET DST)

From: David Stes <stes@mundivia.es>

To: Jacob Navia <jacob@jacob.remcomp.fr>

cc: lcc@CS.Princeton.EDU

Subject: Re: A case study.

[ he cited the example above]

> Since p is in the bss, I can safely assume that its contents are zero at program startup.


No, you can't !

p is called a "tentative definition" of p, and it may very well be non-zero.

So, I discovered this issue about « tentative definitions ». I am not a lawyer, but programming language issues tend to become quite full of « legalese ».

Dave Hanson, one the authors of lcc, entered the discussion and he wrote:

From: Dave Hanson <drh@microsoft.com>

Date: Tue, 28 Apr 1998 13:33:43 -0700

Subject: RE: Re: A case study.

[ he cited the message from David, and then continued ]

For the record, the declaration for p is indeed a tentative definition, but that status persists only until the end of the compilation unit, i.e., the end of f1.c. Since there's no subsequent external definition of p, the tentative declaration acts exactly as if there was a file-scope declaration for p with an initializer equal to 0. (See Sec. 3.7.2 of the ANSI Standard, which I think is Sec. 6.7.2 of the ISO Standard). As a result, p is null at program startup--assuming there are no other similar declarations for p.

This example illustrates nicely a problem with the common storage model: You can't determine whether or not a declaration is also a definition by examining just the program text, and it's easy to get strange behaviors. In this example, there was only one definition, which passes without complaint from linkers. In the stricter definition/reference model, linkers would complain about multiple definitions when combining the object code for f1.c and f2.c. This example also shows why it's best to initialize globals, because linkers will usually catch these kinds of multiple definitions.

The common model also permits C's (admittedly weak) type system to be violated. I've seen programmers declare "int x[2]" in one file and "double x" in another one, for example, just so they can access x as a double and as a pair of ints.

For a good summary of the four models of external definitions/declarations, see Sec. 4.8 in Harbison & Steele, C: A Reference Manual, 4th ed., Prentice-Hall, 1995.

Dave Hanson

Since David needed this feature of the common storage model, I dropped the warning. I think objective C is a nice object oriented language, and wanted to support the work of David with lcc-win32.

Relocate all symbols

The next thing to do is to go through all symbols, and decide whether they will go into the final symbol table or not. Many of them are discarded, since they are local symbols of each compilation unit.44 Global symbols need to be relocated, i.e. the ‘value’ of the symbol has to be set to its final address. This is easy now that the position of the section that contains the symbol is exactly known: we just go through them setting the value field to the right number. The function that does the relocations is called RelocateSection().

Its algorithm outline is simple:

  1.  Read the relocation information from the object file.
  2.  According to the type of relocation, adjust the value of the symbol. The relocations supported by lcclnk are just a few: the pc-relative relocation (code 7, and code 20), the normal 32-bit relocation (code 6), and two types of relocations for the debug information, code 10 and 11.
  3.  Save the position within the executable file where the relocation is being done in the case of relocation type 6 (normal 32 bits relocation), to later build the .reloc section if this is needed. Normally this is needed only when generating a dll, since executables aren’t relocated under windows.

Line numbers, in their COFF variant have to be relocated too, if they exist at all. This is the same as for the symbols. We know the starting virtual address of the section the line number records refer to, so it’s a matter of a few additions and subtractions to be done with them. Lcclnk uses this COFF line numbers to convert them into a NB09 compatible format. Both line numbers formats are written to the executable file.

Build the symbol table

The final symbol table of the executable contains all global symbols of the program. This table will be generated only if the debug information will be included in the resulting executable file.

The format used is the same as the symbol table built for the object files, with some minor modifications. The ‘.file’ entries are processed, because their ‘value’ field needs to be the next symbol in the symbol table that is a ‘.file’ symbol, and this needs to be fixed at link time. In a similar way, the entries for the function COFF descriptions are fixed.

You will find in the source code of the pedump utility the function DumpSymbolTable, in file common.c, that prints the symbol table contents.

Relocating/building the line numbers.

lcclnk supports two types of line numbers: the code view format line numbers, and the COFF line numbers. They have to be relocated by the linker.

The line numbers as generated by the compiler are in COFF format. From Winnt.h we have:

typedef struct _IMAGE_LINENUMBER {

   union {

       DWORD   SymbolTableIndex;    (1)

       DWORD   VirtualAddress;    (2)

   } Type;

   WORD    Linenumber;      (3)


If the field (3) Linenumber is zero, the information in the union is interpreted as a symbol table index that gives the offset of the function where the line number exists. The line numbers are emitted in sets, each starting with a line number with this field set to zero, and followed by a set of records where a line number is associated to a virtual address (2) second field of the union above.

This schema is a limitation since no line number information can be emitted for non-code symbols, i.e. data symbols. They have an address of course, but that information is not passed on to the debugger.

Besides, all this thing of line number misses the point that a line can contain a lot of code, and there is no way for the compiler to pass that information to the debugger using this schema. This means that a line like:

for (i=0;i<5;i++) if(i == 3) fn3(7); else fnX(i);

is not debuggable and should be avoided...

The offset stored by the compiler for each source line refers to the beginning of the text section (code) of the module being compiled. The linker knows where this module will be in the executable when it runs, so it fixes the line number information according to:

offset = original offset + code section offset + displacement - Image Base

The code section offset, is the start address of the .text section. The displacement is the position of the current module within the code section, and the Image base is the start address of the image in virtual memory. Now let's see this in greater detail.

The code section offset is normally 4096. The first page of the executable image is not mapped, so any access to it using a badly formed address will provoke a trap, and the execution will stop. From the linker point of view this is just a number that should be added to the total offset.

The displacement is the position, relative to the beginning of code, of the module we are linking. If we have several modules to link (and that is always the case), the second module begins some bytes after the first. If the first module makes 1019 bytes, the second module will start at offset 1024 since modules are aligned at 16 bytes boundaries by the linker. The displacement of the second module would be then 1024.

The image base is the virtual address where all programs are loaded. This is normally 0x400000 for windows programs.

The line numbers for the code view debug information is simpler: They have just the offset from the base of the module. The linker using the information from the COFF line numbers creates them. At each Coff line number, the corresponding routine for CV line numbers is called, that builds the record needed in another format. Maintaining two formats is cumbersome, and maybe slows down the linker. It was valuable when the linker was being built though, since other debuggers understand COFF debug info, and provided a checkpoint for the CV information.

Performing the relocations

Once symbol relocating and line number relocating is done is done, we go through all object files performing the relocations as specified in the corresponding sections. We add,  subtract the image base, etc. At the same time that we do this, we record the position in the output file where each relocation is done, so that we can tell the program loader, that in the case that the image can’t be loaded at the calculated address, it should patch the address <here>. This is called a .reloc section, and is used mainly in Dlls.

More specifically, what the linker does, is fixing the data/code references that each module contains from all others, patching the code with the offsets that the external symbols have, now that the positions of all sections are known. For a C source line like:


the linker reads the corresponding relocation record emitted by the compiler, and looks up the symbol ‘foo’ in the symbol table. It patches the zeroes that are stored by the assembler at the position following the call opcodes with the relative offset from the point of the call to the address of foo. This will allow the processor to make a PC relative call instruction: the 4 bytes after the call instruction contain a 32-bit offset to the address of foo.

Using the utility pedump, you can see this process. Consider the following well-known program:

#include <stdio.h>

int main(int argc,char *argv[])




Compile this with:

lcc -g2 hello.c

Now, disassemble hello.obj with pedump like this:

pedump /A hello.obj45

You will see near the end of the long listing that follows, the disassembled text section:

section 00 (.text)  size: 00020  file offs: 00220


_main:   Size    18


[0000000] 55               pushl  %ebp

[0000001] 89e5             movl   %esp,%ebp

                                                           Line 5

[0000003] 6800000000       pushl  $0   (_$2) (relocation)

[0000008] e800000000       call   _printf (relocation)

[0000013] 83c404           addl   $4,%esp

                                                           Line 6

[0000016] 5d               popl   %ebp

[0000017] c3               ret

[0000018] 0000             addb   %al,(%eax)

Let’s follow the relocation to the function printf. You will see that pedump has a listing of the relocations that looks like this:

Section 01 (.text) relocations

Address  Type    Symbol Index Symbol Name

-------  ----    ------------ ----- ----

4        DIR32           4   _$2

9        REL32          16   _printf

The linker will then take the bytes starting at the address 4, and put the address of the symbol 4 in the symbol table of main.obj. It will search the address of printf, and put the relative address, i.e. the difference between the address of printf and the address of main+9 in those bytes starting at byte 9.

As you can see there are several types of relocations, each specifying a different way of doing these additions. The compiler emits only three types of relocations:

Type 6 : Direct 32-bit reference to the symbols virtual address

Type 7: Direct 32-bit references to the symbols virtual address, base not included.

Type 20: PC-relative 32-bit reference to the symbols virtual address.

This last one is the one used in the relocation to printf. We have to know too that the relative call is relative to the next instruction, i.e. to the byte 13 and not to the byte 9. Happily for us the linker now knows this stuff...46

And we are not done with the relocations. If the linker is building a dll, we have to send a message to the program loader to tell it that there was a relocation here, so in case the dll can’t be loaded at its preferred address, this addresses, i.e. code bytes 9-13 and 4-8 should be patched with the new load address.

This message is sent in the form of a relocation section that is mandatory for dlls. It can be issued for .exes too, but lcclnk doesn’t do it. What for? The executable is always the first to be loaded into its virtual address space, so there is no need to relocate it.

The documentation for the reloc section says:

The Fix-Up Table contains entries for all fixups in the image. The Total Fix-Up Data Size in the Optional Header is the number of bytes in the fixup table. The fixup table is broken into blocks of fixups. Each block represents the fixups for a 4K page. Each block must start on a 32-bit boundary.

Each block of fixups (as they are called the relocation messages for the loader) then, starts with an address and a size. Then follows the information: the address, and the type of relocation to be performed, in a much similar way to the other relocation records. Since we are describing the relocations for a 4096 byte page, the address is written over 12 bits, using the other 4 bits of a 16 bit word for the type of relocation.

Clever isn’t it? But not so easy to generate.

Relocations in lcclnk avoid most of the complexities of the x86 architecture. Since we are using the flat model, where all segments are the same, we avoid the complex issues of segment relocations, etc. Lcclnk is not designed to handle relocations to intersegment jumps that could theoretically be generated in another segmentation model.

You will find in the source code of the pedump utility, in file exedump.c, the code that dumps the relocation table.

Counting a symbol’s usage

If the -x option is specified in the linker’s command line, at each relocation lcclnk increments the symbol’s ‘usage’ field. After the link is done, all the symbols that aren’t used (i.e. an ‘usage’ field of zero) are printed out. This is not very sophisticated, since in the case of a symbol from the .text section (a function), it could be that is used implicitly by a relative call instruction, that doesn’t use any relocation. Lcc doesn’t generate a ‘call myfunction’, but a ‘call +587’, with 587 being the difference between the value of EIP at this point, and the position of the function to call, i.e. the function ‘myfunction’ would be 587 positions from the program counter (eip) relative to the beginning of the next instruction.

Lcc doesn’t ever generate a jump without a relocation when you are calling a function that is in another module, so this problem will appear only when the function is used in the same module.

Linking the debug information

When you write a windows program, and you #include <windows.h>, all definitions that you use from it generate debug information. This means, when you link several modules that include windows.h, you will have several times the debug information for structures like LOGFONT, or whatever. This is wasteful of space. The linker should pack that information so that only one definition of the LOGFONT structure is linked in.

Besides doing that, the linker should build the information relative to the debug directory: start and size of each module that has contributed code/data to the link, and other information.

To support CodeView debug information, the linker:

  1.  Generates the header and "NB09" signature.
  2.  Packages the header with .debug$S and .debug$T sections from object files and synthetic (linker-generated) CV4 information, and creates a debug directory entry.
  3.  Generates the subsection directory containing a pointer to each known subsection, including subsections that are linker-generated.
  4.  Generates the sstSrcModules subsection, which specifies the address and size of each module's contribution(s) to the image address space.
  5.  Generates the sstSegMap subsection, which specifies the address and size of each section in the image.
  6.  Generates the sstPublicSym subsection, which contains the name and address of all externally defined symbols. (A symbol may be represented both by .debug$S information and by an sstPublicSym entry.)

The function that coordinates the linking of the debug information is called WriteCVInfo(), and is located in cvlink.c.  It does the process described above, beginning with a call to CreateSignature(), followed by a call to CreateModulesFromCoff().

That function writes for each module contained in the link, a structure like this:

typedef struct OMFModule {

   unsigned short  ovlNumber; // overlay number. Not used always zero

   unsigned short  iLib;  // library that the module was linked from

   unsigned short  cSeg;  // count of number of segments in module

      // This is for now always one.

   char            Style[2];  // debugging style "CV"

   OMFSegDesc      SegInfo[1]; // describes segments in module

   char            Name[1];  // length prefixed module name padded to

      // a long word boundary

} OMFModule;

The OMFSegDesc structure is just two integers indicating the start address and the size of each module.

Once this part finished, the linker writes the descriptions of the source modules. This done, the stage is set for linking the type information.

The task of the linker is to find out all the types that are defined several times in different source modules, and emit only one definition for the type into the executable debug information.

Since symbols refer to the corresponding type information using a number that index the types table, if different definitions of a type are consolidated, the indexes will be all wrong. The linker arranges to modify all those indexes to make them point to the new ones. The function that does all this is LinkTypes(), in cvlink.c.

But we are far from finished. The global tables for the debugger have to be built and written to the output file.

The rationale for writing those tables is that they allow the debugger to quickly find the offset of a symbol in the debug information, without having to scan all debug information for all loaded modules. The format of those tables is fairly involved.

They begin with a header that contains:

  1.  The index of the symbol hash function. For this I use the only published algorithm, that has an index of 10.
  2.  The index of the address hash function. The same as above, I use the only known one, 12.
  3.  The number of bytes in the symbol table.
  4.  The number of bytes in the symbol hash table.
  5.  The number of bytes in the address hashing table.

Once this header is out, the linker writes all public symbols. The linker loops through all modules, reading the symbol information (that was previously patched when linking the types), and building the name hash tables and the address hash tables, as it goes along.

The linker uses a fair amount of memory for this operation, what is not really a problem these days. I shudder when I read the documentation specifying what to do in low memory situations, because in the times of windows 3.0 and windows 3.1, all this had to be written with the limitations of 64K for each data object. Today, machines with 32MB are common, and many machines have already 64MB of RAM or more. In this context, and considering that we have a virtual memory system, the linker just builds all those tables in memory, to be able to write them in one pass to the executable. In this, lcclnk has certainly an advantage over Microsoft Link that wasn’t rewritten for Win32...

Mapping source files to addresses

The debugger needs to know which source line corresponds to which address. This mapping is built by the linker from the COFF line number information generated by the compiler.

The central table is the sstSrcModule table (type 0x127),  a fairly complicated construction building a many to many mapping between source files, modules, and source lines. Let’s use the following simple example to see the issues involved:

File f0.c :

#include <stdio.h>

int f0(int a)     

{      // line 3

return a+6 ;   // line 4

}      // line 5

#include « f1.c »

int main(void)


foo1(5) ;    //

printf(« hello\n ») ;


File f1.c is :

int foo1(int a)


return a+9 ;


Here we have :

  1.  Several source files contribute to the object code file 
  2.  A source file contributes several different portions to the object file.

We need then, to specify three code ‘segments’ 47:

  1.  The code of function f0
  2.  The code of function foo1
  3.  The code of ‘main’

We have to associate each section with a consecutive range of addresses, and with a sequence of line number information.

Lcclnk builds the segment list, adding a segment whenever a .file special symbol is seen during the link. For each module, we have then, a list describing all the files seen for that module. The linker tries to avoid making duplicates, i.e. when a .file symbol is seen, it verifies that it is not the same as the last one before adding it to the table.

The information is generated in a table as mentioned before : the sstSrcModule structure. This structure has several parts :

It begins with a header, describing the information that follows .

For each file, a table describing the modules and the line numbers is built. Roughly, the structure of the table is a sequence of file information blocks describing each source file for each module that participates to the link.

The file header




4 * cFile

8 * cseg

2 * cseg







This header consists of the following :

  1.  The number of files that participate to the object module (cFile).
  2.  The number of code segments that make the module (cseg).
  3.  A table of integers specifying the offsets relative to the beginning of the table, where each file will be found (baseSrcFile).
  4.  A table of 2 integers per segment, specifying its start address and the end address (Start/end pairs).
  5.  A table of numbers specifying which section will receive code from this piece of code. This is always one, i.e. the text section, but it could be that we generate later line number information for data description, for instance, to be able to see the place in the source code where a data item is defined (seg).

The information per source file




4 * cseg

8 * cseg








Name len.


For each source, we write the number of pieces of code that this source contributes to. For instance we have in the example above that file f0.c contributes two separates pieces of code : the function ‘f0’ and the function ‘main’, separated from a piece that comes from another source file (f1.c). For this example we would write 2.

Afterwards, we build an array of offsets to the line number information that will be written for each of those pieces. This offsets are relative to the beginning of the whole sstSrcModule table.

Then, we write a length prefixed chain of characters with the file name proper. Under lcc-win32 the file name will be always a fully qualified file name, containing an absolute path.




4 * cPair

2 * cPair






Then, at last, we write the line numbers for each piece of code from this source file, as two parallel arrays with the address and its corresponding line number, i.e.an array of 4 bytes integers for the addresses, and an array of 2 byte shorts for the line numbers.

You will find in the source code of the pedump utility, in the file exedump.c, the function DumpSstSrcModule, that dumps this table.

This table is an example of the complex tables the linker has to build for the debugger. Maybe I would have been able to design a much easier ad-hoc format, but using the Microsoft standard ensures that the generated code is compatible with many other tools, and, last but not least, allowed me to have a reference implementation : it suffices to start the MSVC debugger to see if I generated the tables correctly or not .

Building the executable headers

Just before the executable will be finally be written to disk, the linker sets the corresponding fields in the headers, so that the loader will accept it.

Under Windows 95, there were holes in the file, left by moving the file pointer beyond the end of file without actually writing any data into it. This produced the very bad effect of getting garbage creep into the executable. Any contents of memory could be written in the executable, even the documents you were editing or even mail messages you were reading when the linker was working. I corrected that problem which wasn’t a linker problem but an OS problem at this point by writing zeroes to the empty space in between any sections.

Loaders are not really user-friendly. They will load or refuse to load, and the linker writer has not many clues as to what is wrong when the loader just tells: « This is not a valid executable ». Obviously, it is impossible for the loader to give a meaningful message anyway, since the average user wouldn’t understand anything about sections/loader tables etc.

The source distribution of lcclnk.

The source files of lcclnk are located in \lcc\src\lcclnk. They are the following :

  •  lcclnk.c. This is the main linker file. Here are the relocations done, the executable format is writtezn to disk, etc etc.
  •  cvlink.c. This is the part concerned with the linking of the debug information.
  •  cvtres.c. This converts a resource binary file into an object file.
  •  expand.c. This file performs the wild-card file name expansion, i.e. it will transform an argument like *.obj into a list of files.

Dynamic linking


Within the context of the interpreter development, I needed to be able to load an object file into the running program. The idea is that the interpreter or the debugger could be able to load any library object file by just entering some command at the interpreter’s prompt.

This dynamic linking package is a nice thing to have when you do not want the problems associated with building a full blown executable, a dll.

The first motivation for doing this, was in the debugger context. You have just modified a function, and you want to recompile and go on with the new version. The object file for the new function should be loaded by the debugger, and in the place where the old version was, a jump instruction will refer to the freshly loaded and (hopefully) corrected version.


In principle then, the algorithm is the following :

  •  Load the object file into memory. I did this with memory mapped files.
  •  Copy the code section into a freshly allocated memory block.
  •  Copy data and bss segment (initialized and not initialized data) to the data segment, in a freshly allocated memory block too. 
  •  Relocate the code and the data segments, relative to their loading address. The base of the allocated code block becomes the base of the code segment, the base of the data segment becomes the base of the data segment. All relocations are relative to this addresses. The dynamic linker handles two types of relocations only : direct addresses, and relative addresses, for relative call instructions.
  •  The still unresolved symbols are put in a symbol table. The symbol table of the running executable is then opened, and the dynamic linker figures out where each address is, by looking at the load address field in the extended « optional » header. This addresses are then patched in the code and data segments of the loaded object file.
  •  The symbol table of the loaded object file is used to build a small « symbol table », i.e. just a list of names that indicate which functions and data are defined in the loaded file. This small symbol table will be used to find out where a concrete procedure is located and to return its address using the GetProcedureAddress call.

Binding the object file

Object files can reference functions from external DLLs. When they do so, (and most of them will do so), a problem arises : how will the dynamic loader resolve those references ?

The solution I retained was to build several utilities to support dynamic loading. The first, is a small dll that will generate a list of all APIs from windows in a text file called ‘Apilist.txt’, that resides in the « lib » directory of the lcc-win32 distribution. This file is useful in other contextes, like, for instance, when you forget an import library and the link step fails with ‘unresolved references’. The IDE can now look into this list and find out in which import library the flagged function exists, and automatically add the needed library to the link step.

The format of the apilist.txt file is very simple :

  •  The first line contains the number of DLLs used
  •  The next <n> lines contain the names of each dll
  •  Then, the names follow with a number indicating which DLL contains this name.

The file is build using the list of known DLLs in \lcc\buildlib\dlls.txt. If you want to add your own dll to the apilist.txt file, you should just add the name of the dll to this file, and copy your .exp file to the \lcc\buildlib directory.

For instance, if you want to add ‘mydll.dll’ to the list of known DLLs in apilist.txt, you build the DLL. The linker will automatically generate a .exp and a .lib for you. Copy the .exp, in this case mydll.exp to the \lcc\buildlib directory. Then add a line ‘mydll.dll’ to the dlls.txt file in \lcc\buildlib. And then run buildapi.exe, that lives in the \lcc\bin directory.

Once this list was build (no easy task:  at last count there were 10 810 entries!), the ‘bind’ utility has just to look up in this list all external references from an object file to determine which DLL contains a given symbol, if any.

The list of needed dlls is appended to the end of the object file. This list has the following format :



_somefn@16 somefn


The keyword ‘IMPORTS’ starts the list, and the keyword ‘$$end$$’ ends it. This is all in ASCII, so you can read it with an hex editor.

The import librarian

This utility allows you to build an import library from a dll executable. This is nice, when you do not have the source code. For building your own dlls, you should leave this problem to the linker that has been changed to take care of this problem: it builds an import library automatically when building a dll.

Import libraries

Import libraries are just a series of ‘stubs’ that will satisfy the linker, but contain no code. They are very easy to write, since they are just an indirection through a function pointer table.

When you call any function that is in a dll (all standard C library functions, windows functions and any other functions the executable imports), you do NOT call directly the code in that function, but you call a ‘stub’ that consists of only one instruction:

call jmptable[10]

This ‘jmptable’ is the .idata section. It will be filled with the actual contents at load time, when the loader searches in the PATH where the dll in question is, and determines the addresses that need to be written to the table.

The task of the ‘import’ utility is just to generate a series of object files in the .lib format so that they contain pointers to that jump table.

Basically, the building of an import library is the process of creating a series of object files, that will in most cases contain just those 6 bytes above. Obviously, it is more complicated than that, the whole object file structure has to be built, the symbol table, the string table, and the different idata sections.

The import librarian comes in two flavors: one (called implib.exe) that just emits the library by reading its information from the dll, and the other called ‘buildlib.exe’,  that builds the library from an Ascii description of names.

The importance of ‘implib’ will decrease in the future, since the linker has been modified to use buildlib directly. At the end of the linking process, lcclnk will emit a .exp file for the buildlib utility that makes possible to include the building of the library in the link.

The format of the generated ascii .exp file is very simple:

  1.  The first line contains the name of the dll we are building this import library for.
  2.  The next lines contain three columns, the last being optional. The first column should contain the name of the symbol as emitted by the compiler. The next column should contain the name of the symbol as exported from the dll. The third optional column contains (or not) the keyword ‘data’ to indicate to buildlib when the export is a data item, and no ‘jmp’ instruction is needed.

There aren’t any checks to verify the correspondence of those names, to allow the user to introduce ‘alias’ for an exported function from a dll. I use intensively this feature in the CRTDLL.EXP file, where there is always the problem of the underscores Microsoft has added to many functions of the standard library like _open, etc. This is easily solved by making two aliases for the real open dll function: _open and __open. With this, both names will be accepted and produce the same result.

Summary: Building an import library from a dll.

The usage of the implib utility is really simple: you just give it the name of a dll. It will eliminate the .dll suffix, replace it by a .lib suffix, and build a library file with all the stubs for the exported functions of the dll.

The resource compiler

The .rc language specifications.

This language was developed to specify resource files. It describes the basic resource types, i.e.; the following objects:


Dialog boxes


String tables






Each object is introduced in the following manner:

Name or ID OBJECTNAME [Options]

For instance, to specify a Menu object, you would write:

2800 MENU Discardable

The language supposes that there is a preprocessor that allows for #defines, #if, #ifdef, etc. The specifications of the preprocessor are identical to the C preprocessor, so I will not discuss it further here.

This language is only slightly 'specified'. The specifications are actually the behavior of the Microsoft's resource compiler 'rc'. For instance, rc accepts that instead of writing a keyboard objectname, you can replace it with a number from 1 to 16, specifying a resource type in numerical form.


Initially, I was tempted to use the public domain ‘rcl’ resource compiler developed by Gunther Ebert, in Leipzig. Gunther used the standard Unix approach of lex/yacc.

His compiler is constructed with a ‘lexer’ that is written in the ‘lex’ language specifications, and a parser, specified in the ‘yacc’ notation. The ‘lexer’ is built by a program called ‘lex’ that generates a lexer from the specifications it reads. The parser is generated by the ‘yacc’ program, that writes a parser using the specifications given in the ‘yacc’ language.

This approach is very flexible, but it had several disadvantages for the purposes of a resource compiler:

It is very hard to debug a ‘yacc’ specification.

Yacc and ‘flex’ are not readily available under Windows 32. I had to use my linux machine, to compile the lexer+parser, then use ftp to move the result to windows, and then recompile. And this for each change done to the specification.

It was very complicated to do apparently simple things like processing optional arguments etc.

So I decided to use the same approach as described by Fraser and Hanson in their book about lcc: the recursive descent approach.

The resulting parser was much smaller, and  much easier to debug.

The parser

Parsing is done following a rather simple ‘algorithm’:

while (not EndOfFile) do {

token = readtok();

if (isAnObjectToken(token)) {




Following are the object tokens, i.e. the tokens that introduce an object:


Simple isn’t it? The parser just ignores all input except those words. When it sees one of those, it will parse an object specification that should leave the last token untouched.

To avoid that the resource compiler sees a token in another context it #defines the following pre-processor symbol: RC_INVOKED. Most definitions in the windows header files are enclosed in a pair of ifdefs

#ifndef RC_INVOKED

typedef ...


This prevents random matches. If you use structure definitions in the header files the resource compiler will see, take care to enclose your definitions as shown. In any case, the consequences of seeing a random definition are not very bad, since it will only provoke a cascade of error messages.

The function ‘gettok() does a classification of the input passed to the parser by the preprocessor. It returns :

A character, i.e. a value between 32 and 127. This will be returned for commas, brackets, etc.

A constant between 1000 and 1999 for different classes of input, indicating whether the input is a number, an identifier, a floating point constant, a string, or other things. The value will be left in a global variable, so that the rest of the parser can use it.

A constant between 2000 and 3000 if the input is an rc keyword.

The top level parser function looks like this:

static int Parse(void)


int ttype;

ttype = gettok();

do {

 switch(ttype) {

 case EOI:








 // other cases omitted for brevity


 ttype = gettok();

} while (ttype != EOI);




Each of the functions called will parse the grammar for the description of the object, leaving the last token in the input. Here is the description of each one:

Dialog statement

nameID DIALOG [ load-mem] x, y, width, height
   . . .

The ParseDialog() function retrieves the last integer/identifier, and uses that as the name for the dialog. Then, the ‘load-flags’ are parsed, but ignored. Those flags were very important under windows 3.1, but now they are not used. They were maintained to have a backward compatibility with the resource scripts written for windows 3.1.

The function goes on parsing the coordinates of the dialog box. After that, we arrive at the optional statements. Those can be:

FONT Followed by a point size, a font name, and if it is a DIALOGEX statement, the weight of the font and an italics flag.

CAPTION Followed by a character string indicating the caption text.

CLASS statement indicating the class of the dialog. Normally this is not used.

CHARACTERISTICS field, followed by a 32 bit number. Windows do not use this, and the resource editor never generates it.

STYLE followed by the window style of the dialog box window.

EXSTYLE followed by the extended style of the dialog box window. This can be used only within a DIALOGEX statement, and not a normal DIALOG statement.

LANGUAGE followed by two numbers specifying the language and the sub-language code.

VERSION followed by a version number for the dialog. Windows do not use this.

Once this preliminaries are done, we can start parsing the contents of the dialog box, the different CONTROL statements. This statements have all the same syntax, excepting the CONTROL statement.

NAME text, id, x, y, width, height [, style [, extended-style]]

Here is a table indicating for each of the named controls, its class, default styles and usage.



Default Styles



This statement builds a transparent rectangle to group some items into a common group




Left aligned static text




Centered static text




Right aligned static text




Builds a listbox. Normally the ‘text’ field is empty




Builds an edit field. Normally the ‘text’ field is empty




This creates a static text window at the position specified. Since Icons have a fixed width and height, those fields are not necessary. If you specify them, the resource compiler will read and ignore them.




This creates a button with the BS_PUSHBOX style




This creates a normal push button.




This creates a radio button




Creates a scroll bar




Create a three state check box




Creates a combo box



Does a check box




Radio button with the ‘automatic’ style.




Check box with the ‘automatic’ style




This is the default button that will be used when you press return.



The general Control statement has the following form:

CONTROL text, id, class, style, x, y, width, height [, extended-style]

All statements numbered before can be written using the ‘control’ statement. They are just shorthand’s for this one.

The ‘extended style’ can only be used if it is an extended dialog (DIALOGEX).

MENU Statement

menuID MENU [load-mem]
   . . .


This statement introduces a menu description, with each of the POPUPS described within a BEGIN/END block.

The binary format generated is as follows:

  1.  Normal resource header with type 4 (menu).
  2.  A flags word. If the flags contain the ‘POPUP’ flag (0x10) this indicates an implicit begin/end block. If not, it is a menu item. The text of the popup follows in unicode characters.
  3.  A menu item contains a flags word, followed by the ID of the menu item (the parameter you will receive when processing WM_COMMAND messages), followed by the text of the menu item in unicode characters.
  4.  The last item of the menu contains the flag 0x80 to indicate the end of the menu description.

The different image resources (icons, bitmaps, cursors)

nameID ICON [load-mem] filename

nameID BITMAP [load-mem] filename

nameID CURSOR [load-mem] filename

The specifications for those resources are like this:

The ICON statement in the .RC script does not create a single resource object, but creates a group of resources. This allows Windows-based programs a degree of device independence through the use of different pixel bitmaps on hardware configurations with differing capabilities. Icons, most often designed for differing numbers of pixel planes and pixel counts, are grouped and treated by Windows as a single resource. In the .RES and .EXE files, however, they are stored as a group of resources. These groups are stored in a .RES file with the components first (in this case the different icons [type 3]) and a group header following (type 14). The group header contains the information necessary to allow Windows to select the proper icon to display.

The components have the following structure:

    [Resource header (type = 3)]

    [DIB Header]

    [Color DIBits of icon XOR mask]

    [Monochrome DIBits of AND mask]

I had a lot of trouble building that resource header. I think I have gotten it right now, but beware... As a test of the correctness of ‘lrc’, I used ‘weditres’, that will display the icons, and decode that resource header.

Basically, the compiler treats all of these the same: it will read a file specification, and put that file unchanged in the resource file. No magic is performed, i.e. for instance for icons, the Xor of the image is not performed to have a black and white and a color version. The compiler doesn’t generate different icons for vga, svga, and other resolutions either... SO: please edit your icons with another tool before. Of course, the resource editor will eventually have an icon editor, but this is not done yet...

The string table resource

STRINGTABLE [load-mem] [optional-statements]
stringID string
   . . .


As usual, the ‘load’ flags are parsed but ignored. The optional statements can be ‘Language’, ‘Characteristics’ or ‘Version’ keywords.

The format of each string is just a numerical ID, followed by the string enclosed in quotes. The usual expansions for C character strings are performed.

Strings are clustered together in blocks of 16 strings, keyed by the numerical ID. The algorithm is rather complicated;

  1.  ID should sort the strings, because the linker lcclnk will not sort them when transforming the resource in object files.
  2.  Transform all strings into unicode before storing them. They are stored preceded by a word indicating the string length. They are NOT zero terminated.
  3.  Use the lower 4 bits of the id to indicate a position within a block of 16 strings with consecutive Ids.
  4.  Use the upper 12 bits of the plus one, to give a resource ID for a string resource of type 6 (string table resource type).

This means that if you give a string an ID that is isolated from other Ids you will waste space in the resource file, since an almost empty block containing only one string+the resource header will be stored.

Here is the essential part of the string-resource writing function:

idx = 0;     // running index through the table

buffer = allocate(20000);  // buffer space

bufferlen = 20000;   // its length

do {

 memset(slots,0,sizeof(slots))// clean the slot

 first = StringTable[idx]->id;// first index

 len = 32; // 16 * sizeof(word)

 // the variable ‘strings’ is the total number of strings in the table

 while (idx < strings ) { // fill a slot of 16 strings

  id = StringTable[lastidx]->id; // get this string id

  // test if this string belongs to this block

  if ((first & (~0xF)) == (id & (~0xf))) {

   slots[(id&0xF)] = StringTable[lastidx];

   len += 2*(StringTable[lastidx]->len);



  else {// this block is full




 // test if the buffer is big enough

 if (len > bufferlen) {

  len += 500;

  buffer = MyRealloc(buffer,len);

  bufferlen = len;


 // Now copy all the strings found into the output buffer

 p = buffer;

 for (i=0; i<16; i++) {

  if (slots[i] && slots[i]->len) {

   len = slots[i]->len;

   p = WriteWord(p,(WORD)(len));

   mbstowcs((WORD *)p,slots[i]->Text,len);

   p += 2*len;


  else p = WriteWord(p,0);



  p - buffer,

  1+(first >> 4),


 idx = lastidx;


} while (idx < strings);

The string table format uses two bytes for each character, plus the overhead for storing the blocks. If you want to save space, it is much more efficient and faster to use a structure with a tag and a pointer to the string, and looking up that at run time yourself. As an indication, I had the string table in resource format in earlier versions of wedit. When I translated the format into a normal C structure with ASCII characters the program shrank by 22K.

The RCDATA resource

nameID RCDATA [[load-mem]]
   . . .

This resource is not extremely difficult to do. Just parse a normal header, its type in the resource header is 10, and then read a series of strings or integers between the ‘BEGIN’ and ‘END’ keywords. A thing to be remembered is that integers, unless they end with the ‘L’ are 16 bits...

The Message Table resource

nameID MESSAGETABLE filename

This resource type is generated by the message compiler mc. This tool hasn't been added to the capabilities of lrc, so you will have to get it from another compiler. Probably the capabilities of mc will be incorporated into lrc at a later stage in development. For the time being, it is only possible to use already compiled message tables. This is used in NT service executables for instance, or in other specialized applications. The format of the binary resource is very similar to the string table resource format.



acctablename ACCELERATORS
event, idvalue, [type] [options]
   . . .

The acctablename token is either a number or a name. The optional statements that lrc recognizes here is only the LANGUAGE keyword.


Specifies the keystroke to be used as an accelerator. It can be any one of the following character types:


A single character enclosed in double quotation marks ("). The character can be preceded by a caret (^), meaning that the character is a control character.


An integer value representing a character. The type parameter must be ASCII.

virtual-key character

An integer value representing a virtual key. The virtual key for alphanumeric keys can be specified by placing the uppercase letter or number in double quotation marks (for example, "9" or "C"). The type parameter must be VIRTKEY.


Specifies a 16-bit unsigned integer value that identifies the accelerator.


Required only when the event parameter is a character or a virtual-key character. The type parameter specifies either ASCII or VIRTKEY; the integer value of event is interpreted accordingly. When VIRTKEY is specified and event contains a string, event must be uppercase.


Specifies the options that define the accelerator. This parameter can be one or more of the following values:


Specifies that no top-level menu item is highlighted when the accelerator is used. This is useful when defining accelerators for actions such as scrolling that do not correspond to a menu item. If NOINVERT is omitted, a top-level menu item will be highlighted (if possible) when the accelerator is used.


Causes the accelerator to be activated only if the ALT key is down.


Causes the accelerator to be activated only if the SHIFT key is down.


Defines the character as a control character (the accelerator is only activated if the CONTROL key is down). This has the same effect as using a caret (^) before the accelerator character in the event parameter.

The ALT, SHIFT, and CONTROL options apply only to virtual keys.

Resource format

The format of the accelerators is simple: they consist of a normal resource header, followed by a table of structures like this:

typedef struct tagAccelTableEntry {

   WORD     fFlags;// This contains the flag fields above

   WORD     wAscii;// Contains the value of the accelerator

   WORD     wId;   // Contains the value of the identifier

   WORD     padding; // not used

} AccelTableEntry;

The fFlags field contains the values for the qualifiers described in the 'Syntax' header above. There is one called ACC_LAST, the appears in the last member of the table. Basically, this resource consists of a normal header size, followed by the entries, each one, containing the structure above.

I always write the last empty table member, with all its members to zero, and only in the fFlags field, the value ACC_LAST, but I am not sure if this is really necessary, since the size of the table can be easily deduced from the size of the resource divided by the size of each element.

Limitations of the resource compiler

The FONT resources are not supported yet. They will be added in the future as time permits...

The same is true for the user defined resource types.

The source distribution of the resource compiler

The sources of the resource compiler are located in \lcc\src\lrc. The following files are in there :

  •  parse.c. Main file of the resource compiler. Everything, from lexing to resource file generation is contained in this file.
  •  ncpp.c. The preprocessor for the resource compiler. It is almost the same code as the same named file in the C compiler distribution. It is there because I decided to separate the modifications done to the preprocessor code from the ones in lrc, that needs much less complexity for its own modest needs.

The main file contains a lexer in a very similar pattern that lcc uses. It will return an integer with the token parsed so far. The parsing routines decide how the tokens are interpreted.

The librarian

lcclib was the last utility that was missing to complete the cycle of lcc’s environment. Its task is to store several object files into one file called ‘library’, that contains the object files and three headers to describe to the linker what are the contents of each. The object files are stored without any modifications. No compression or other processing is required.

The structure is very similar to the one used under UNIX.

The signature

All archives have a signature that marks them as a library. Under win32 the signature used is


All .lib files have those 8 chars at the very beginning of the file.

The « first linker member »

After the signature, there are normally three headers. called ‘members’ in techspeak. The first one is a very simple table that tells the linker which symbol is in which object file. The name of the first linker member is "/". Its format is the following:







Number of Symbols

Unsigned long containing the number of symbols indexed. This number is stored in big-endian format. Each object-file member typically defines one or more external symbols.




Array of file offsets to archive member headers, in which n is equal to Number of Symbols. Each number in the array is an unsigned long stored in big-endian format. For each symbol named in the String Table, the corresponding element in the Offsets array gives the location of the archive member that contains the symbol.



String table

Series of null-terminated strings that name all the symbols in the directory. Each string begins immediately  after the null character in the previous string. The  number of strings must be equal to the value of the Number of Symbols fields.      

The elements in the Offsets array must be arranged in ascending order. This fact implies that the symbols listed in the String Table must be arranged according to the order of archive members. For example, all the symbols in the first object-file member would have to be listed before the symbols in the second object file.

The second linker member

The second linker member has the name "/" as does the first linker member.

Lcclnk does not use it. The second linker member includes symbol names in lexical order, which enables faster searching by name.







Number of members

Unsigned long containing the number of archive members.




Array of file offsets to archive member headers, arranged in

ascending order. Each offset is an unsigned long. The number

m is equal to the value of the Number of Members field.



Number of symbols

Unsigned long containing the number of symbols indexed. Each

object-file member typically defines one or more external symbols.




Array of 1-based indices (unsigned short) which map symbol names to archive member offsets. The number n is equal to Number of Symbols. For each symbol named in the String Table, the corresponding element in the Indices array gives an index into the Offsets array. The Offsets array, in turn, gives the location of the archive member that contains the symbol.



String table

Series of null terminated strings that contain the symbol names, sorted alphabetically.

Once all those headers are written, the only thing to do is to write the object files, each preceded by a header.

Usage of lcclib

An option consists of an option specifier, which is either a dash ( - ) or a forward slash ( / ), followed by the name of the option. Option names cannot be abbreviated.

Some options take an argument, specified after a colon (:). No spaces or tabs are allowed within an option specification. Use one or more spaces or tabs to separate option specifications on the command line.

Option names and their keyword or filename arguments are not case sensitive, but identifiers used as arguments are case sensitive.

lcclib processes options in the order specified on the command line and in command files. If an option is repeated with different arguments, the last one to be processed takes precedence.


Displays details about the progress of the session. The information is sent to standard output and can be redirected to a file.


Displays information about the output library to standard output. The output can be redirected to a file. You can use /LIST to determine the contents of an existing library without modifying it.


Overrides the default output filename. By default, the output library is created in the current directory, with the base name of the first library or object file on the command line and the extension .LIB.


Omits the specified object from the output library. LCCLIB creates an output library by first combining all objects (whether in object files or libraries), and then deleting any objects specified with /REMOVE.

Lcclib source files

The whole code of lcclib is a single file called appropriately lib.c. It is very small, but it implements all the machinery described above.

The resource editor 

I wrote a resource editor for JFC Informatique & Média in 1994, for Windows 3.1. That editor was specially tailored to them and started with the need that they have, of producing high quality dialog boxes and resources for their applications.

They used the ‘MDI’49 paradigm in all their applications. Problem is, it wasn’t easy to create the controls in the MDI’s by hand. They attempted this at the beginning, but quickly they arrived to the conclusion that they needed a tool for writing their dialog boxes and designing them interactively. They paid me an amount of money to do it, and I delivered to them an editor that could design dialog boxes without using dialog units, as they desired. The problem of dialog box units, as you maybe have experienced, is that it is impossible when you use them, to specify the positions in pixels. This means that depending on the circumstances your controls will miss a pixel here and there, and will not be exactly aligned. This was unacceptable for them.

I decided to use the experience I had accumulated about resources and resource files, to build a resource editor for Lcc-Win32.

The best place for starters was evidently the code of DLGEDIT, a simple dialog box editor that is distributed freely by Microsoft with the SDK. That code is clear, more or less easy to follow if you understand what the program is doing. So, I think that the best would be that we start by that: what the program is doing, i.e. the format of the .RES files under WIN32.

The format of the .RES files

The only document that briefly describes the format of the resource files is the one by Steve Firebaugh that appears in the MSDN CDs under the strange title of ‘Windows NT File format specification’, sub item ‘Win32 Binary Resource formats’

There are currently about a dozen predefined resource types. These include Menus, Dialogs, Accelerators, Strings, Icons, Cursors, Bitmaps, Fonts, and Version. These resources are used by the Windows system to define the appearance of the application window, or to store dialog templates or other data.. The resource script (RC file) allows the application writer to represent these features in an easily editable form.

The general format of the entire resource file is simply a number of resource file entries concatenated together. Each resource contains the information about a single resource, such as a dialog template or a string table.

Each entry consists of a resource header followed by the data for that resource. A resource header is composed of four elements: two DWORDs containing the size of the header and the size of the resource data, the resource type, the resource name, and additional resource information. The data for the resource follows the resource header and is specific to each particular type of resource.

The reasons for this are simple: To go from one resource to the next within the file, you just add the fields for the header and the data, and you are all set... if you forget the alignment problems of course.

After the “Header size” field, we find immediately after it a data structure we will meet very often here: a ‘Name or ordinal’. Simply put, you examine the first WORD. If it is -1 (0xFFFFFF) the next WORD indicates the ordinal that is used instead of a name. If the first word is NOT -1, this means that a wide character set string starts at the given position, finished with a double zero byte.

The fixed part header then, is followed by a ‘Name or ordinal’ indicating the type of the resource that follows. The predefined resource types are described in the Appendix 1.If the type field is a string, its a user defined type. Lcc, for the time being doesn’t use those.

This is followed by the name of the resource that is in most cases just an ID to save space.

The predefines resource IDs that weditres understands are the following:

Resource type

Numerical identifier











String table


Font directory








Message table


Group cursor


Group Icon


Version resource


Include file name


The other fields of the header are language ID for indicating the language used, some flags to specify how the resource will be loaded, that are maintained mostly for compatibility with older Windows 3.1 applications, some version information and a ‘Characteristics’ field.

So, we have for our header the following format:

struct tagResource {

DWORD  DataSize;  //Size of data without header

DWORD  HeaderSize; //Length of the additional header

Ordinal or name TYPE  //Type identifier, id or string

Ordinal or name NAME  //Name identifier, id or string

DWORD  DataVersion; // Predefined resource data version

WORD  MemoryFlags; //State of the resource

WORD  LanguageId; //Encoded language ID

DWORD  Version;  //Version of the resource data

DWORD  Characteristics;//Characteristics of the data

} ;

Since this is not a regular C structure, we have to find out which type of data (ordinal or name type) each time we access it. In the code of the resource editor we do this in functions called “SkipResHeader”, etc, that will parse the fields according to the type of data stored in them. All those functions, together with the code to read and decode, write and encode resources are in the file “util.c”, in the resource editor source distribution.

Building a dialog box dynamically

Having read the resource file into memory, we have all the information needed to build a resource like a dialog box dynamically. The procedure for doing this is essentially very simple: Read the header, that contains the number of controls of the dialog box, and loop for each child window, making a CreateWindowEx() for each control. You get the position of each window to be created from the resource object, together with the class, font to be used, etc.

Of course this is more easily said than done...

I use the same structure that Microsoft proposed for doing that in the windows SDK. I use a ‘grabber’ window, that is created in front of each control, and that lives only to be able to drag and drop the controls, in this way allowing the user to manipulate them visually.

The core  of this process is building the data structure in memory that will be accepted by the DialogBoxIndirect function. This structure consists of two parts: a header for the dialog box, and a list of controls that follows it in memory. Beginning with the resource of the dialog box, I write that structure in the function ResToDialog, in dlgio.c, one of the central parts of weditres.

First then, I get the dialog box header, in the following structure:

typedef struct {

   DWORD lStyle;                   // Style for the dialog.

   DWORD lExtendedStyle;           // The extended style.

   WORD NumberOfItems;             // Number of controls.

   WORD x;                         // Starting x location.

   WORD y;                         // Starting y location.

   WORD cx;                        // Dialog width.

   WORD cy;                        // Dialog height.


As you can see from the above definition, the number of items that follow is given. We have then, after this,  NumberOfItems times the following structure used for each control:

typedef struct tagControlItem {

   DWORD Style;                   // Style for the control.

   DWORD lExtendedStyle;           // The extended style.

   WORD x;                         // Starting x location.

   WORD y;                         // Starting y location.

   WORD cx;                        // Control width.

   WORD cy;                        // Control height.

   WORD ID;                       // Control id.

       char Class;


Reading the binary res file, I build this structure in memory, and there it goes, I pass it to DialogBoxIndirect.

I encountered a difficult problem with the dialog boxes that used the extended styles. Microsoft changed the specifications for the dialog box header and the control items. It took me hours under the debugger to try to figure out what needed to be done and what changed. Several months later, I found a small notice in the SDK, that described this structure:

typedef struct tagDLGTEMPLATEEX{

   WORD wDlgVer;           // use 1

   WORD wSignature;        // use 0xFFFF

   DWORD dwHelpID;         // Dialog's context help ID

   DWORD dwExStyle;        // Extended style

   DWORD dwStyle;          // Style

   WORD cDlgItems;         // Number of controls in dialog

   short x;                // Initial position, horizontal

   short y;                // Initial position, vertical

   short cx;               // Width

   short cy;               // Height


The items had changed too. Here is the definition of the new structure:

typedef struct tagDLGITEMTEMPLATEEX{

   DWORD dwHelpID;         // Context help ID for control

   DWORD dwExStyle;        // Control extended styles

   DWORD dwStyle;          // Style

   short x;                // Initial position, horizontal

   short y;                // Initial position, vertical

   short cx;               // Width

   short cy;               // Height

   DWORD dwID;             // Window ID


The heart of the loading of the structures from disk (in the .res file) into memory is the AllocDialogResource  procedure, in dlgio.c. This function has grown into a very big and complicated piece of code, reflecting all the complexities that are inherent into reading this format.

It starts by setting a Boolean flag to indicate if we are building a control with an extended style (the “new” controls in ComCtrl32.dll, tree view, etc) or a traditional control like edit fields or check boxes. Then, it determines if we are building a new dialog from scratch or just using the selection, in the case of a copy operation for instance.

hen building a dialog for testing, i.e. when not saving/reading to disk, we have to take care to avoid including the real class of the dialog or control, since those aren’t available in the editor environment.

We calculate then, the size of the header, composed from the caption text, the class name (an ordinal) the point size of the dialog font, and all the fields detailed above. The function allocates a buffer (that is later resized as the controls are added), and writes the fixed part of the dialog box resource.

Then, we loop for each control in the dialog box, allocating and filling a control decription. The buffer is resized at each pass through the loop. We have to take care of the alignment specifications, and account for the fact that all the editor works in ANSI and not UNICODE, but the resources are all in UNICODE, i.e. two bytes for each character. It would have been maybe easier if the editor would have been always in UNICODE, but windows 95/98 users would be in bad shape since UNICODE is not supported in those systems.

The result is a RES structure, containing the size of the data, the size of the header, and followed immediately by the data itself.

Testing a dialog box interactively

To do this, the editor constructs on the fly a dialog box specification and calls DlgBoxIndirect. The procedure of that dialog handles the initialization of the diverse controls that need something to look better, like list boxes for instance, that will be filled with some lines. Some of the new controls like the tree view control, need to be initialized with some items too.

This is done in the function CreateTestDialog, in styles.c. The detailed process of creating a dialog runs like this:

  •  We start by erasing any current selection.
  •  We copy the current state of the dialog being edited into the list of dialog resources that the editor maintains.
  •  We allocate memory to hold a temporary copy of the dialog and copy the current dialog resource into it.
  •  We call  the Windows  API CreateDialogIndirect, passing to it the resource copy.
  •  If creation of the dialog succeeded, we gray several menu items in the main menu, and the new dialog gains focus.
  •  The dialog receives the WM_INITDLG message. The procedure for handling the messages sent to the dialog is called TestDlgProc, (also in styles.c). Here we loop through the controls of the dialog, looking at its type. We do some initialization actions for several classes of controls, by calling the TestInitDlg function. List boxes are filled a bit with some lines, edit fields are assigned an “Edit” character string, TreeViews are filled with some levels, etc. The idea is to simulate as far as it goes a working dialog box. The “progress” controls, need a timer, that regularly sends them “update” messages, so that they show a continuous progress, and when arrived at the end, they are again restarted, what provides a continuous display. Owner draw controls are provided a default procedure so that they show as rectangles. Drag lists are watched too, so that they receive the messages they need.
  •  When the user presses any button that has either the ID_OK or the ID_CANCEL identifier, the dialog box is destroyed. Note that in dialog boxes where those buttons are absent the only way to close them is to press the “test” switch in the toolbar to stop the testing.

Writing the .dlg file

This file contains the description of all the dialog boxes that you define with the editor in the ‘resources’ language. It is built by parsing the binary structures used by the editor, and emitting the corresponding keywords. This is a tricky business, especially when you consider the difficulty of interpreting correctly the different bits of the class styles. There is no simple correspondence very often between a bit and a style, but the signification of a bit changes if another is on, etc. This has been a constant problem, specially now in Win32 when you have more controls to worry about!

The code for this part is in dlgio.c  The principal entry point is the function Save,  called directly from the main window when the menu item is selected. It receives several flags, to indicate which type of save (normal or save as) is necessary. It will prompt for a name if save as is requested, and then call WriteTheFile, giving it the type of file to write (res, rc, or dlg).

WriteTheFile is straightforward, limiting its work to resolve the extension (dlg or res) and calling a specific function WriteRes, WriteRc, or WriteDlg.

Writing the .res file

This consists simply in writing all the binary structures to disk. A special marker is emitted as the first resource to mark it as a Win32 resource file. There is a perfect correspondence between the structures stored in RAM and the format of the file in the disk, so just a call to lwrite with the data field of each dialog resource list item is necessary. Note that the writing of the resource include header file is done here too. This is a special kind of resource that indicates the path of the include file that is used to associate the numeric codes for each control (or dialog) with their symbolic counterparts.

Writing the .rc file.

This was a new development. The problem was that weditres edited only dialogs, not accelerators, menus, icons, etc. That is why in the first time I generated a .dlg file, that was included in the main .rc file, where the ASCII descriptions for all other resource types could be written.

When I found time for it, and people pressed me to do it, I added the string table editor, and the accelerator table editor to the software.

Basically those table editors are not very difficult to do/use. They edit the string table, whose format was described in the chapter about the resource compiler, and the accelerator table.

The only interesting problem was how to enter the accelerator key... It's not easy to open a window just for keyboard input that will register correctly ANY key combination. One of the biggest problems is how to detect the ALT key, a problem that I haven't solved completely. The proper way to do it is probably to follow the WM_SYSCOMMAND message, but this detects the combination of Alt+F6 Alt+F4 and Alt+any ASCII key. The other Alt+FXX combinations aren't detected, and they do not seem to generate any messages.

Another problem is the writing of all the resources in ASCII form. This was solved already for the dialog boxes, but had to be done for the accelerator tables, for the string table, and for all the icons/cursors/whatever.

The big difficulty is to keep the association between the external files needed/used by the image resources, and the binary resource data. I modified the resource compiler to emit that information if given the (undocumented) /a option. That option makes the resource compiler write to the given file name a list of all external files used by the resource. The syntax is:

lrc /afilename.dat

At the end of compilation, "filename.dat" will contain a table of external files and the resource identifiers associated with them.

To complexify even more this already cluttered problem area, we have resource directories for the image resources, that add yet another set of problems. To make possible to write icons for different screen resolutions, an icon resource can be associated with several images, that windows at run time chooses according to the resolution of the user's screen.

I had to figure out then, which resource to edit, within an icon file. I built correspondence tables, and solved (halfway) the issue.

What happens however, when weditres has loaded a binary resource file? The information about the external file is not available at all. I had two choices:

  1.  Modify the resource compiler so that it would understand a hexadecimal format, where I could enclose in the .rc file a binary description of the resource. Borland adopted this solution in their windows 3.11 resource compiler.
  2.  Edit/write the icons/cursor/bitmaps into temporary files, and generate a normal ICON resource statement for the resource compiler.

There are advantages/problems to each of the two solutions. If I would use 1), I would loose the compatibility with Microsoft's resource compiler.

Wedit: the Integrated development environment


This part of lcc-win32 is the oldest. I started working on an editor for programmers around 1989-1990. I used some code from the then popular ‘micro-emacs’ editor, to handle the display logic, mixed in a lot of code to handle the display, and had a functional editor that had the same functionality as window’s notepad, but without the limitation to files smaller than 20K...

Time passed, and I continued to work in the editor, adding the function list, the ability to jump to a symbol’s definition with a function key, and a ‘make’ utility..

I developed special purpose-parsers to be able to reparse a whole file in a 486-33 without much waiting. Programmers tend to be impatient, and a long wait is always something that puts people off This special purpose scanners just ignore most of the text, and concentrate in some sequence of characters of interest. This makes them very fast. No I/O is needed, since the text is already loaded in memory by the editor.

One of the earliest was the parser to scan for function definitions. That one only searches for the sequence of a closing parenthesis and an opening brace, ignoring white space and comments of course. This method is extremely effective, and easy to implement.

Another things that I added were a versioning system, software metrics, and then, the real-time syntax coloring of keywords/comments.

I built several utilities in the editor like ‘grep’, ‘diff’, a ‘camera’ to take screen snapshots, an utility to generate automatically .hlp files from the C sources, another utility to extract strings into .def files, the project management module, the keyboard configuration... the years passed by.

I quitted my flat in the center of Paris to go to the suburbs, where I found a home with a small garden, and enough space to put the two children that Annie gave me.

And I went on building that system.

Programmers are bound to build ephemeral works. Nothing is left from all the programs we write. In a few years, all of our work is thrown away with that obsolete system we wrote it in, that can never be like the new one... just because its ridiculous hardware limitations.

I wondered sometimes where are now those APL programs I wrote during the early seventies...Not even the docs of that Siemens 4004 are left, with that ‘enormous’ virtual workspace of 32K. And APL itself, has faded into the emptiness awaiting obsolete languages. ‘STSC’, the biggest ‘APL’ company has disappeared from view, and many people today have never even heard that name.

There were good concepts in that language though, concepts that I have been re-introducing into lcc-win32 with the ‘intrinsics’ interface. Operations with tables were the basics of APL. You could write in that language:

C = A+B

A and B being vectors of the same size, C would become the element-by-element sum of A and B. The notation was simple, and very powerful.

This ephemeral nature of my activity was frustrating though.. I concentrated in Wedit because I tried to build something that would stand the judgment of the years, something that was worth working for. Most people are happy if they can work and are being paid for what they do. No matter what they do.

I don’t.

How to start an editor.

Wedit started with ‘multipad’ the MDI demonstration program that came with the Windows 3.0 SDK. It is an MDI application, i.e. it supports multiple windows with different documents open at the same time.

As you know, it starts in that function... WinMain. That function does all the initialization, and then enters the infinite loop processing the messages that windows sends.

Messages to the frame window are processes in FrameWndProc, messages to the document windows go to the MainWndProc procedure. The commands from the menu are done in HandleCommand. That function uses a dispatcher table, to each of the function that treats a command, mainly in the file command.c.

Commands lead very often to a dialog box. In command.c you will find the association between a command and a dialog box in the form of:

void HandleNewProject(void)



 r = CallDlgProc(NewPrjDlg, IDD_NEWPROJECT,&tmpPrj);


This means that the dialog box with the numeric identifier IDD_NEWPROJECT and the dialog procedure NewPrjDlg, will ask user input to treat this command.

When you want to find out the sequence that associates a dialog box procedure to a command you just have to follow the association from the command dispatcher to the function that processes that command in command.c, to the dialog box procedure mostly in dialog.c.

Of course, there are a lot of commands that do not involve any dialog box: Cut/Paste, and many others.

Other messages that windows sends are important: WM_PAINT is one of those.

Coping with legacy code

Software is developed with a specific machine, and it runs under a specific version of a concrete OS. This imposes to the software certain limits, certain conventions, etc. It is really not the same if you are limited by the power of a 386DX-16, with real RAM of 4MB, and with a 16 bit OS, or if you run under a Pentium 200MHZ with 64MB of RAM under a 32bit OS. No, it is really not the same at all.

Windows imposed in its first versions a lot of limits to the programs. Data couldn’t be handled in chunks of more than 64K at a time, the machine power at your disposal was limited, and many idiosyncrasies of Windows entered into your design. One of the most dangerous, as it turned out, was the reliance on global variables.

Why global variables?

Globals are a very fast and efficient way of passing arguments from a procedure to the next. Since all is implicit, the machine saves the pushing of the arguments, and the restoring of the stack frame afterwards. Besides this, the event-driven nature of windows, and above all the interface of the dialog boxes, made much easier to communicate with dialog boxes using some global variable than to explicitly pass the parameters.

Of course, you could pass your parameters to a dialog box procedure, but the way of doing it was (and still is) very cumbersome. You had to call another function, not the usual DialogBox, and pass a parameter to that function, that windows would then pass to the dialog box in the WM_INITDIALOG message. Even more difficult was to return results. You were limited to a 16 bit signed number. That was all. You couldn’t return any pointers, and you were forced to use globals to return pointers.

It would have been possible to return values in a structure, whose address would be passed to the dialog box function at entry, stored in a static area, and then used to store all results of the dialog box. But then, you would have to reserve a static area of at least 4 bytes for each dialog box. This looks quite acceptable today, with 64MB of RAM, but in the environment of windows 3.0 and windows 3.1 you had to solve a difficult problem: you were limited to 64K of space for holding all the initialized data of the program, all static data, and all the stack...

Fitting all the tables of Wedit, all the character strings used somewhere, and all the stack into 64K was an incredible difficult undertaking. I moved all the strings to the resources, doing a string table, and tried to allocate the data from the heap instead of using static or global data.

In this context, using 4 bytes for each of the several dozens of dialog boxes that I had already built, seemed out of the question.

So I used globals, especially one called ‘OpenName’ that contained a file name that many dialog boxes returned from user input.

And once this started, I got used to it, even later, when I moved to windows 32 bit. In that system, many of the horrible constraints that had formed my software disappeared, but then, I was confronted with the problem of changing all those globals to something more readable, a work I started leaving always for tomorrow. We all develop software under time constraints, and that code was unreadable yes, but it worked. Changing it, I knew, would inevitably bring a new set of bugs that I could happily live without...

So, my code became legacy software: running, but unmaintenable.

I have only recently started to invest into eliminating all globals from Wedit, and forcing upon myself a discipline of globals usage. The problem with them is, as I painfully discovered, that I couldn’t understand what the code was doing in many places, and that it became impossible to add easily new features. There was always some trashed global that would either provoke a trap or just provoke bad results. I was forced to clean it up.

Maintenance work is inevitable, and it is better done in small chunks at each change, than left for later, when its bad results accumulate, provoking other bad consequences that have to be eliminated too. For instance, you can abstract away a global if you save its value before using it, and restore it later, but this is a step in the wrong direction of course. Later on, you have to eliminate the global anyway, and then you have to eliminate the code that you are just adding...

How does Wedit work

Wedit consists of different components: the editor, the debugger, make file generation utility, other utilities etc. Here is an overview of each one.

Window structure of Wedit:

Main application window (Frame window)

Manages the menu, and the application start/exit sequence

MDI Client

Manages the document windows

Open files toolbar

Manages the file list display

Output window

Manages editor output

Document 1

Document 2

Document n

N buttons, each with a file name

Sub classed listbox

Lower left buttons

The main application window cares for the following tasks:

  1.  Creating the mdi child window
  2.  Dispatching all menu commands
  3.  Finalizing the editor at applications exit

A child of the MDI client window handles each open source. The window procedure for those windows is quite complex, since handles all tasks touching the input and display of program text.

The open files toolbar is an independent window that just displays the open file names. It has some code to resize itself to the right size given the names it contains. Each name is drawn in an owner draw button. Those buttons send to the main frame window commands to change the display from one source file to another.

The output window at the bottom of the editor should display all utilities output, make (build) results, and debugger displays. It contains a list box, and some owner draw buttons at the lower left.


The editor is a normal MDI Windows application. It uses no controls for drawing or selecting text, all is done by logic within the editor itself.50 The procedure for managing each document window is the same, of course, as is the case for most MDI applications.

Scrolling is done either with the ScrollDC primitive of windows, or just by redrawing the whole screen with a different line origin. Normally, ScrollDC is used when the whole screen has to be shifted only one line, redrawing is preferred in other cases. The smooth scrolling effect is done by scrolling the whole screen just 1 pixel at a time (in the case of very smooth scrolling), or at most 3-4 pixels. This effect allows for text that appears to scroll slowly and smoothly, instead of the "jump" feeling of normal scroll of a line at a time. In fast machines (more or less any machine today), the slowdown is barely perceptible.

The selection is managed by using a region list, and inverting it. There are still problems with this in some cases, because keeping a list of regions to updates becomes fairly complex if you want to follow arbitrary mouse movements. In retrospect, a simpler algorithm of just keeping a selected region of rectangular shape would be much easier to implement.

Proportional fonts are supported in theory, but this is not well tested. Normal programs look so horrible with proportional fonts that I have barely tested this feature. It is assumed in most of the code that the user has chosen a fixed font, what greatly simplifies the drawing, specially the drawing of the selection.

The 'Undo' feature is one of the most acute problems in the editor. It is implemented as a circular list of commands to undo, but it has many problems, especially because not all the commands can be undone, sometimes because the effects of adding a new command to the user interface weren't done in the 'Undo' counterpart.