TUTOR.ZIP This file contains all of the installments of Jack Crenshaw's tutorial on compiler construction, including the new Installment 15. The intended audience is those folks who are not computer scientists, but who enjoy computing and have always wanted to know how compilers work. A lot of compiler theory has been left out, but the practical issues are covered. By the time you have completed the series, you should be able to design and build your own working compiler. It will not be the world's best, nor will it put out incredibly tight code. Your product will probably never put Borland or MicroSoft out of business. But it will work, and it will be yours. A word about the file format: The files were originally created using Borland's DOS editor, Sprint. Sprint could write to a text file only if you formatted the file to go to the selected printer. I used the most common printer I could think of, the Epson MX-80, but even then the files ended up with printer control sequences at the beginning and end of each page. To bring the files up to date and get myself positioned to continue the series, I recently (1994) converted all the files to work with Microsoft Word for Windows. Unlike Sprint, Word allows you to write the file as a DOS text file. Unfortunately, this gave me a new problem, because when Word is writing to a text file, it doesn't write hard page breaks or page numbers. In other words, in six years we've gone from a file with page breaks and page numbers, but embedded escape sequences, to files with no embedded escape sequences but no page breaks or page numbers. Isn't progress wonderful? Of course, it's possible for me to insert the page numbers as straight text, rather than asking the editor to do it for me. But since Word won't allow me to write page breaks to the file, we would end up with files with page numbers that may or may not fall at the ends of the pages, depending on your editor and your printer. It seems to me that almost every file I've ever downloaded from CompuServe or BBS's that had such page numbering was incompatible with my printer, and gave me pages that were one line short or one line long, with the page numbers consequently walking up the page. So perhaps this new format is, after all, the safest one for general distribution. The files as they exist will look just fine if read into any text editor capable of reading DOS text files. Since most editors these days include rather sophisticated word processing capabilities, you should be able to get your editor to paginate for you, prior to printing. I hope you like the tutorials. Much thought went into them. Jack W. Crenshaw CompuServe 72325,1327  LET'S BUILD A COMPILER! By Jack W. Crenshaw, Ph.D. 24 July 1988 Part I: INTRODUCTION ***************************************************************** * * * COPYRIGHT NOTICE * * * * Copyright (C) 1988 Jack W. Crenshaw. All rights reserved. * * * ***************************************************************** INTRODUCTION This series of articles is a tutorial on the theory and practice of developing language parsers and compilers. Before we are finished, we will have covered every aspect of compiler construction, designed a new programming language, and built a working compiler. Though I am not a computer scientist by education (my Ph.D. is in a different field, Physics), I have been interested in compilers for many years. I have bought and tried to digest the contents of virtually every book on the subject ever written. I don't mind telling you that it was slow going. Compiler texts are written for Computer Science majors, and are tough sledding for the rest of us. But over the years a bit of it began to seep in. What really caused it to jell was when I began to branch off on my own and begin to try things on my own computer. Now I plan to share with you what I have learned. At the end of this series you will by no means be a computer scientist, nor will you know all the esoterics of compiler theory. I intend to completely ignore the more theoretical aspects of the subject. What you _WILL_ know is all the practical aspects that one needs to know to build a working system. This is a "learn-by-doing" series. In the course of the series I will be performing experiments on a computer. You will be expected to follow along, repeating the experiments that I do, and performing some on your own. I will be using Turbo Pascal 4.0 on a PC clone. I will periodically insert examples written in TP. These will be executable code, which you will be expected to copy into your own computer and run. If you don't have a copy of Turbo, you will be severely limited in how well you will be able to follow what's going on. If you don't have a copy, I urge you to get one. After all, it's an excellent product, good for many other uses! Some articles on compilers show you examples, or show you (as in the case of Small-C) a finished product, which you can then copy and use without a whole lot of understanding of how it works. I hope to do much more than that. I hope to teach you HOW the things get done, so that you can go off on your own and not only reproduce what I have done, but improve on it. This is admittedly an ambitious undertaking, and it won't be done in one page. I expect to do it in the course of a number of articles. Each article will cover a single aspect of compiler theory, and will pretty much stand alone. If all you're interested in at a given time is one aspect, then you need to look only at that one article. Each article will be uploaded as it is complete, so you will have to wait for the last one before you can consider yourself finished. Please be patient. The average text on compiler theory covers a lot of ground that we won't be covering here. The typical sequence is: o An introductory chapter describing what a compiler is. o A chapter or two on syntax equations, using Backus-Naur Form (BNF). o A chapter or two on lexical scanning, with emphasis on deterministic and non-deterministic finite automata. o Several chapters on parsing theory, beginning with top-down recursive descent, and ending with LALR parsers. o A chapter on intermediate languages, with emphasis on P-code and similar reverse polish representations. o Many chapters on alternative ways to handle subroutines and parameter passing, type declarations, and such. o A chapter toward the end on code generation, usually for some imaginary CPU with a simple instruction set. Most readers (and in fact, most college classes) never make it this far. o A final chapter or two on optimization. This chapter often goes unread, too. I'll be taking a much different approach in this series. To begin with, I won't dwell long on options. I'll be giving you _A_ way that works. If you want to explore options, well and good ... I encourage you to do so ... but I'll be sticking to what I know. I also will skip over most of the theory that puts people to sleep. Don't get me wrong: I don't belittle the theory, and it's vitally important when it comes to dealing with the more tricky parts of a given language. But I believe in putting first things first. Here we'll be dealing with the 95% of compiler techniques that don't need a lot of theory to handle. I also will discuss only one approach to parsing: top-down, recursive descent parsing, which is the _ONLY_ technique that's at all amenable to hand-crafting a compiler. The other approaches are only useful if you have a tool like YACC, and also don't care how much memory space the final product uses. I also take a page from the work of Ron Cain, the author of the original Small C. Whereas almost all other compiler authors have historically used an intermediate language like P-code and divided the compiler into two parts (a front end that produces P-code, and a back end that processes P-code to produce executable object code), Ron showed us that it is a straightforward matter to make a compiler directly produce executable object code, in the form of assembler language statements. The code will _NOT_ be the world's tightest code ... producing optimized code is a much more difficult job. But it will work, and work reasonably well. Just so that I don't leave you with the impression that our end product will be worthless, I _DO_ intend to show you how to "soup up" the compiler with some optimization. Finally, I'll be using some tricks that I've found to be most helpful in letting me understand what's going on without wading through a lot of boiler plate. Chief among these is the use of single-character tokens, with no embedded spaces, for the early design work. I figure that if I can get a parser to recognize and deal with I-T-L, I can get it to do the same with IF-THEN- ELSE. And I can. In the second "lesson," I'll show you just how easy it is to extend a simple parser to handle tokens of arbitrary length. As another trick, I completely ignore file I/O, figuring that if I can read source from the keyboard and output object to the screen, I can also do it from/to disk files. Experience has proven that once a translator is working correctly, it's a straightforward matter to redirect the I/O to files. The last trick is that I make no attempt to do error correction/recovery. The programs we'll be building will RECOGNIZE errors, and will not CRASH, but they will simply stop on the first error ... just like good ol' Turbo does. There will be other tricks that you'll see as you go. Most of them can't be found in any compiler textbook, but they work. A word about style and efficiency. As you will see, I tend to write programs in _VERY_ small, easily understood pieces. None of the procedures we'll be working with will be more than about 15-20 lines long. I'm a fervent devotee of the KISS (Keep It Simple, Sidney) school of software development. I try to never do something tricky or complex, when something simple will do. Inefficient? Perhaps, but you'll like the results. As Brian Kernighan has said, FIRST make it run, THEN make it run fast. If, later on, you want to go back and tighten up the code in one of our products, you'll be able to do so, since the code will be quite understandable. If you do so, however, I urge you to wait until the program is doing everything you want it to. I also have a tendency to delay building a module until I discover that I need it. Trying to anticipate every possible future contingency can drive you crazy, and you'll generally guess wrong anyway. In this modern day of screen editors and fast compilers, I don't hesitate to change a module when I feel I need a more powerful one. Until then, I'll write only what I need. One final caveat: One of the principles we'll be sticking to here is that we don't fool around with P-code or imaginary CPUs, but that we will start out on day one producing working, executable object code, at least in the form of assembler language source. However, you may not like my choice of assembler language ... it's 68000 code, which is what works on my system (under SK*DOS). I think you'll find, though, that the translation to any other CPU such as the 80x86 will be quite obvious, though, so I don't see a problem here. In fact, I hope someone out there who knows the '86 language better than I do will offer us the equivalent object code fragments as we need them. THE CRADLE Every program needs some boiler plate ... I/O routines, error message routines, etc. The programs we develop here will be no exceptions. I've tried to hold this stuff to an absolute minimum, however, so that we can concentrate on the important stuff without losing it among the trees. The code given below represents about the minimum that we need to get anything done. It consists of some I/O routines, an error-handling routine and a skeleton, null main program. I call it our cradle. As we develop other routines, we'll add them to the cradle, and add the calls to them as we need to. Make a copy of the cradle and save it, because we'll be using it more than once. There are many different ways to organize the scanning activities of a parser. In Unix systems, authors tend to use getc and ungetc. I've had very good luck with the approach shown here, which is to use a single, global, lookahead character. Part of the initialization procedure (the only part, so far!) serves to "prime the pump" by reading the first character from the input stream. No other special techniques are required with Turbo 4.0 ... each successive call to GetChar will read the next character in the stream. {--------------------------------------------------------------} program Cradle; {--------------------------------------------------------------} { Constant Declarations } const TAB = ^I; {--------------------------------------------------------------} { Variable Declarations } var Look: char; { Lookahead Character } {--------------------------------------------------------------} { Read New Character From Input Stream } procedure GetChar; begin Read(Look); end; {--------------------------------------------------------------} { Report an Error } procedure Error(s: string); begin WriteLn; WriteLn(^G, 'Error: ', s, '.'); end; {--------------------------------------------------------------} { Report Error and Halt } procedure Abort(s: string); begin Error(s); Halt; end; {--------------------------------------------------------------} { Report What Was Expected } procedure Expected(s: string); begin Abort(s + ' Expected'); end; {--------------------------------------------------------------} { Match a Specific Input Character } procedure Match(x: char); begin if Look = x then GetChar else Expected('''' + x + ''''); end; {--------------------------------------------------------------} { Recognize an Alpha Character } function IsAlpha(c: char): boolean; begin IsAlpha := upcase(c) in ['A'..'Z']; end; {--------------------------------------------------------------} { Recognize a Decimal Digit } function IsDigit(c: char): boolean; begin IsDigit := c in ['0'..'9']; end; {--------------------------------------------------------------} { Get an Identifier } function GetName: char; begin if not IsAlpha(Look) then Expected('Name'); GetName := UpCase(Look); GetChar; end; {--------------------------------------------------------------} { Get a Number } function GetNum: char; begin if not IsDigit(Look) then Expected('Integer'); GetNum := Look; GetChar; end; {--------------------------------------------------------------} { Output a String with Tab } procedure Emit(s: string); begin Write(TAB, s); end; {--------------------------------------------------------------} { Output a String with Tab and CRLF } procedure EmitLn(s: string); begin Emit(s); WriteLn; end; {--------------------------------------------------------------} { Initialize } procedure Init; begin GetChar; end; {--------------------------------------------------------------} { Main Program } begin Init; end. {--------------------------------------------------------------} That's it for this introduction. Copy the code above into TP and compile it. Make sure that it compiles and runs correctly. Then proceed to the first lesson, which is on expression parsing. ***************************************************************** * * * COPYRIGHT NOTICE * * * * Copyright (C) 1988 Jack W. Crenshaw. All rights reserved. * * * ***************************************************************** LET'S BUILD A COMPILER! By Jack W. Crenshaw, Ph.D. 24 July 1988 Part II: EXPRESSION PARSING ***************************************************************** * * * COPYRIGHT NOTICE * * * * Copyright (C) 1988 Jack W. Crenshaw. All rights reserved. * * * ***************************************************************** GETTING STARTED If you've read the introduction document to this series, you will already know what we're about. You will also have copied the cradle software into your Turbo Pascal system, and have compiled it. So you should be ready to go. The purpose of this article is for us to learn how to parse and translate mathematical expressions. What we would like to see as output is a series of assembler-language statements that perform the desired actions. For purposes of definition, an expression is the right-hand side of an equation, as in x = 2*y + 3/(4*z) In the early going, I'll be taking things in _VERY_ small steps. That's so that the beginners among you won't get totally lost. There are also some very good lessons to be learned early on, that will serve us well later. For the more experienced readers: bear with me. We'll get rolling soon enough. SINGLE DIGITS In keeping with the whole theme of this series (KISS, remember?), let's start with the absolutely most simple case we can think of. That, to me, is an expression consisting of a single digit. Before starting to code, make sure you have a baseline copy of the "cradle" that I gave last time. We'll be using it again for other experiments. Then add this code: {---------------------------------------------------------------} { Parse and Translate a Math Expression } procedure Expression; begin EmitLn('MOVE #' + GetNum + ',D0') end; {---------------------------------------------------------------} And add the line "Expression;" to the main program so that it reads: {---------------------------------------------------------------} begin Init; Expression; end. {---------------------------------------------------------------} Now run the program. Try any single-digit number as input. You should get a single line of assembler-language output. Now try any other character as input, and you'll see that the parser properly reports an error. CONGRATULATIONS! You have just written a working translator! OK, I grant you that it's pretty limited. But don't brush it off too lightly. This little "compiler" does, on a very limited scale, exactly what any larger compiler does: it correctly recognizes legal statements in the input "language" that we have defined for it, and it produces correct, executable assembler code, suitable for assembling into object format. Just as importantly, it correctly recognizes statements that are NOT legal, and gives a meaningful error message. Who could ask for more? As we expand our parser, we'd better make sure those two characteristics always hold true. There are some other features of this tiny program worth mentioning. First, you can see that we don't separate code generation from parsing ... as soon as the parser knows what we want done, it generates the object code directly. In a real compiler, of course, the reads in GetChar would be from a disk file, and the writes to another disk file, but this way is much easier to deal with while we're experimenting. Also note that an expression must leave a result somewhere. I've chosen the 68000 register DO. I could have made some other choices, but this one makes sense. BINARY EXPRESSIONS Now that we have that under our belt, let's branch out a bit. Admittedly, an "expression" consisting of only one character is not going to meet our needs for long, so let's see what we can do to extend it. Suppose we want to handle expressions of the form: 1+2 or 4-3 or, in general, +/- (That's a bit of Backus-Naur Form, or BNF.) To do this we need a procedure that recognizes a term and leaves its result somewhere, and another that recognizes and distinguishes between a '+' and a '-' and generates the appropriate code. But if Expression is going to leave its result in DO, where should Term leave its result? Answer: the same place. We're going to have to save the first result of Term somewhere before we get the next one. OK, basically what we want to do is have procedure Term do what Expression was doing before. So just RENAME procedure Expression as Term, and enter the following new version of Expression: {---------------------------------------------------------------} { Parse and Translate an Expression } procedure Expression; begin Term; EmitLn('MOVE D0,D1'); case Look of '+': Add; '-': Subtract; else Expected('Addop'); end; end; {--------------------------------------------------------------} Next, just above Expression enter these two procedures: {--------------------------------------------------------------} { Recognize and Translate an Add } procedure Add; begin Match('+'); Term; EmitLn('ADD D1,D0'); end; {-------------------------------------------------------------} { Recognize and Translate a Subtract } procedure Subtract; begin Match('-'); Term; EmitLn('SUB D1,D0'); end; {-------------------------------------------------------------} When you're finished with that, the order of the routines should be: o Term (The OLD Expression) o Add o Subtract o Expression Now run the program. Try any combination you can think of of two single digits, separated by a '+' or a '-'. You should get a series of four assembler-language instructions out of each run. Now try some expressions with deliberate errors in them. Does the parser catch the errors? Take a look at the object code generated. There are two observations we can make. First, the code generated is NOT what we would write ourselves. The sequence MOVE #n,D0 MOVE D0,D1 is inefficient. If we were writing this code by hand, we would probably just load the data directly to D1. There is a message here: code generated by our parser is less efficient than the code we would write by hand. Get used to it. That's going to be true throughout this series. It's true of all compilers to some extent. Computer scientists have devoted whole lifetimes to the issue of code optimization, and there are indeed things that can be done to improve the quality of code output. Some compilers do quite well, but there is a heavy price to pay in complexity, and it's a losing battle anyway ... there will probably never come a time when a good assembler-language pro- grammer can't out-program a compiler. Before this session is over, I'll briefly mention some ways that we can do a little op- timization, just to show you that we can indeed improve things without too much trouble. But remember, we're here to learn, not to see how tight we can make the object code. For now, and really throughout this series of articles, we'll studiously ignore optimization and concentrate on getting out code that works. Speaking of which: ours DOESN'T! The code is _WRONG_! As things are working now, the subtraction process subtracts D1 (which has the FIRST argument in it) from D0 (which has the second). That's the wrong way, so we end up with the wrong sign for the result. So let's fix up procedure Subtract with a sign-changer, so that it reads {-------------------------------------------------------------} { Recognize and Translate a Subtract } procedure Subtract; begin Match('-'); Term; EmitLn('SUB D1,D0'); EmitLn('NEG D0'); end; {-------------------------------------------------------------} Now our code is even less efficient, but at least it gives the right answer! Unfortunately, the rules that give the meaning of math expressions require that the terms in an expression come out in an inconvenient order for us. Again, this is just one of those facts of life you learn to live with. This one will come back to haunt us when we get to division. OK, at this point we have a parser that can recognize the sum or difference of two digits. Earlier, we could only recognize a single digit. But real expressions can have either form (or an infinity of others). For kicks, go back and run the program with the single input line '1'. Didn't work, did it? And why should it? We just finished telling our parser that the only kinds of expressions that are legal are those with two terms. We must rewrite procedure Expression to be a lot more broadminded, and this is where things start to take the shape of a real parser. GENERAL EXPRESSIONS In the REAL world, an expression can consist of one or more terms, separated by "addops" ('+' or '-'). In BNF, this is written ::= [ ]* We can accomodate this definition of an expression with the addition of a simple loop to procedure Expression: {---------------------------------------------------------------} { Parse and Translate an Expression } procedure Expression; begin Term; while Look in ['+', '-'] do begin EmitLn('MOVE D0,D1'); case Look of '+': Add; '-': Subtract; else Expected('Addop'); end; end; end; {--------------------------------------------------------------} NOW we're getting somewhere! This version handles any number of terms, and it only cost us two extra lines of code. As we go on, you'll discover that this is characteristic of top-down parsers ... it only takes a few lines of code to accomodate extensions to the language. That's what makes our incremental approach possible. Notice, too, how well the code of procedure Expression matches the BNF definition. That, too, is characteristic of the method. As you get proficient in the approach, you'll find that you can turn BNF into parser code just about as fast as you can type! OK, compile the new version of our parser, and give it a try. As usual, verify that the "compiler" can handle any legal expression, and will give a meaningful error message for an illegal one. Neat, eh? You might note that in our test version, any error message comes out sort of buried in whatever code had already been generated. But remember, that's just because we are using the CRT as our "output file" for this series of experiments. In a production version, the two outputs would be separated ... one to the output file, and one to the screen. USING THE STACK At this point I'm going to violate my rule that we don't introduce any complexity until it's absolutely necessary, long enough to point out a problem with the code we're generating. As things stand now, the parser uses D0 for the "primary" register, and D1 as a place to store the partial sum. That works fine for now, because as long as we deal with only the "addops" '+' and '-', any new term can be added in as soon as it is found. But in general that isn't true. Consider, for example, the expression 1+(2-(3+(4-5))) If we put the '1' in D1, where do we put the '2'? Since a general expression can have any degree of complexity, we're going to run out of registers fast! Fortunately, there's a simple solution. Like every modern microprocessor, the 68000 has a stack, which is the perfect place to save a variable number of items. So instead of moving the term in D0 to D1, let's just push it onto the stack. For the benefit of those unfamiliar with 68000 assembler language, a push is written -(SP) and a pop, (SP)+ . So let's change the EmitLn in Expression to read: EmitLn('MOVE D0,-(SP)'); and the two lines in Add and Subtract to EmitLn('ADD (SP)+,D0') and EmitLn('SUB (SP)+,D0'), respectively. Now try the parser again and make sure we haven't broken it. Once again, the generated code is less efficient than before, but it's a necessary step, as you'll see. MULTIPLICATION AND DIVISION Now let's get down to some REALLY serious business. As you all know, there are other math operators than "addops" ... expressions can also have multiply and divide operations. You also know that there is an implied operator PRECEDENCE, or hierarchy, associated with expressions, so that in an expression like 2 + 3 * 4, we know that we're supposed to multiply FIRST, then add. (See why we needed the stack?) In the early days of compiler technology, people used some rather complex techniques to insure that the operator precedence rules were obeyed. It turns out, though, that none of this is necessary ... the rules can be accommodated quite nicely by our top-down parsing technique. Up till now, the only form that we've considered for a term is that of a single decimal digit. More generally, we can define a term as a PRODUCT of FACTORS; i.e., ::= [ ::= () This is where the recursion comes in. An expression can contain a factor which contains another expression which contains a factor, etc., ad infinitum. Complicated or not, we can take care of this by adding just a few lines of Pascal to procedure Factor: {---------------------------------------------------------------} { Parse and Translate a Math Factor } procedure Expression; Forward; procedure Factor; begin if Look = '(' then begin Match('('); Expression; Match(')'); end else EmitLn('MOVE #' + GetNum + ',D0'); end; {--------------------------------------------------------------} Note again how easily we can extend the parser, and how well the Pascal code matches the BNF syntax. As usual, compile the new version and make sure that it correctly parses legal sentences, and flags illegal ones with an error message. UNARY MINUS At this point, we have a parser that can handle just about any expression, right? OK, try this input sentence: -1 WOOPS! It doesn't work, does it? Procedure Expression expects everything to start with an integer, so it coughs up the leading minus sign. You'll find that +3 won't work either, nor will something like -(3-2) . There are a couple of ways to fix the problem. The easiest (although not necessarily the best) way is to stick an imaginary leading zero in front of expressions of this type, so that -3 becomes 0-3. We can easily patch this into our existing version of Expression: {---------------------------------------------------------------} { Parse and Translate an Expression } procedure Expression; begin if IsAddop(Look) then EmitLn('CLR D0') else Term; while IsAddop(Look) do begin EmitLn('MOVE D0,-(SP)'); case Look of '+': Add; '-': Subtract; else Expected('Addop'); end; end; end; {--------------------------------------------------------------} I TOLD you that making changes was easy! This time it cost us only three new lines of Pascal. Note the new reference to function IsAddop. Since the test for an addop appeared twice, I chose to embed it in the new function. The form of IsAddop should be apparent from that for IsAlpha. Here it is: {--------------------------------------------------------------} { Recognize an Addop } function IsAddop(c: char): boolean; begin IsAddop := c in ['+', '-']; end; {--------------------------------------------------------------} OK, make these changes to the program and recompile. You should also include IsAddop in your baseline copy of the cradle. We'll be needing it again later. Now try the input -1 again. Wow! The efficiency of the code is pretty poor ... six lines of code just for loading a simple constant ... but at least it's correct. Remember, we're not trying to replace Turbo Pascal here. At this point we're just about finished with the structure of our expression parser. This version of the program should correctly parse and compile just about any expression you care to throw at it. It's still limited in that we can only handle factors involving single decimal digits. But I hope that by now you're starting to get the message that we can accomodate further extensions with just some minor changes to the parser. You probably won't be surprised to hear that a variable or even a function call is just another kind of a factor. In the next session, I'll show you just how easy it is to extend our parser to take care of these things too, and I'll also show you just how easily we can accomodate multicharacter numbers and variable names. So you see, we're not far at all from a truly useful parser. A WORD ABOUT OPTIMIZATION Earlier in this session, I promised to give you some hints as to how we can improve the quality of the generated code. As I said, the production of tight code is not the main purpose of this series of articles. But you need to at least know that we aren't just wasting our time here ... that we can indeed modify the parser further to make it produce better code, without throwing away everything we've done to date. As usual, it turns out that SOME optimization is not that difficult to do ... it simply takes some extra code in the parser. There are two basic approaches we can take: o Try to fix up the code after it's generated This is the concept of "peephole" optimization. The general idea it that we know what combinations of instructions the compiler is going to generate, and we also know which ones are pretty bad (such as the code for -1, above). So all we do is to scan the produced code, looking for those combinations, and replacing them by better ones. It's sort of a macro expansion, in reverse, and a fairly straightforward exercise in pattern-matching. The only complication, really, is that there may be a LOT of such combinations to look for. It's called peephole optimization simply because it only looks at a small group of instructions at a time. Peephole optimization can have a dramatic effect on the quality of the code, with little change to the structure of the compiler itself. There is a price to pay, though, in both the speed, size, and complexity of the compiler. Looking for all those combinations calls for a lot of IF tests, each one of which is a source of error. And, of course, it takes time. In the classical implementation of a peephole optimizer, it's done as a second pass to the compiler. The output code is written to disk, and then the optimizer reads and processes the disk file again. As a matter of fact, you can see that the optimizer could even be a separate PROGRAM from the compiler proper. Since the optimizer only looks at the code through a small "window" of instructions (hence the name), a better implementation would be to simply buffer up a few lines of output, and scan the buffer after each EmitLn. o Try to generate better code in the first place This approach calls for us to look for special cases BEFORE we Emit them. As a trivial example, we should be able to identify a constant zero, and Emit a CLR instead of a load, or even do nothing at all, as in an add of zero, for example. Closer to home, if we had chosen to recognize the unary minus in Factor instead of in Expression, we could treat constants like -1 as ordinary constants, rather then generating them from positive ones. None of these things are difficult to deal with ... they only add extra tests in the code, which is why I haven't included them in our program. The way I see it, once we get to the point that we have a working compiler, generating useful code that executes, we can always go back and tweak the thing to tighten up the code produced. That's why there are Release 2.0's in the world. There IS one more type of optimization worth mentioning, that seems to promise pretty tight code without too much hassle. It's my "invention" in the sense that I haven't seen it suggested in print anywhere, though I have no illusions that it's original with me. This is to avoid such a heavy use of the stack, by making better use of the CPU registers. Remember back when we were doing only addition and subtraction, that we used registers D0 and D1, rather than the stack? It worked, because with only those two operations, the "stack" never needs more than two entries. Well, the 68000 has eight data registers. Why not use them as a privately managed stack? The key is to recognize that, at any point in its processing, the parser KNOWS how many items are on the stack, so it can indeed manage it properly. We can define a private "stack pointer" that keeps track of which stack level we're at, and addresses the corresponding register. Procedure Factor, for example, would not cause data to be loaded into register D0, but into whatever the current "top-of-stack" register happened to be. What we're doing in effect is to replace the CPU's RAM stack with a locally managed stack made up of registers. For most expressions, the stack level will never exceed eight, so we'll get pretty good code out. Of course, we also have to deal with those odd cases where the stack level DOES exceed eight, but that's no problem either. We simply let the stack spill over into the CPU stack. For levels beyond eight, the code is no worse than what we're generating now, and for levels less than eight, it's considerably better. For the record, I have implemented this concept, just to make sure it works before I mentioned it to you. It does. In practice, it turns out that you can't really use all eight levels ... you need at least one register free to reverse the operand order for division (sure wish the 68000 had an XTHL, like the 8080!). For expressions that include function calls, we would also need a register reserved for them. Still, there is a nice improvement in code size for most expressions. So, you see, getting better code isn't that difficult, but it does add complexity to the our translator ... complexity we can do without at this point. For that reason, I STRONGLY suggest that we continue to ignore efficiency issues for the rest of this series, secure in the knowledge that we can indeed improve the code quality without throwing away what we've done. Next lesson, I'll show you how to deal with variables factors and function calls. I'll also show you just how easy it is to handle multicharacter tokens and embedded white space. ***************************************************************** * * * COPYRIGHT NOTICE * * * * Copyright (C) 1988 Jack W. Crenshaw. All rights reserved. * * * ***************************************************************** LET'S BUILD A COMPILER! By Jack W. Crenshaw, Ph.D. 4 Aug 1988 Part III: MORE EXPRESSIONS ***************************************************************** * * * COPYRIGHT NOTICE * * * * Copyright (C) 1988 Jack W. Crenshaw. All rights reserved. * * * ***************************************************************** INTRODUCTION In the last installment, we examined the techniques used to parse and translate a general math expression. We ended up with a simple parser that could handle arbitrarily complex expressions, with two restrictions: o No variables were allowed, only numeric factors o The numeric factors were limited to single digits In this installment, we'll get rid of those restrictions. We'll also extend what we've done to include assignment statements function calls and. Remember, though, that the second restriction was mainly self-imposed ... a choice of convenience on our part, to make life easier and to let us concentrate on the fundamental concepts. As you'll see in a bit, it's an easy restriction to get rid of, so don't get too hung up about it. We'll use the trick when it serves us to do so, confident that we can discard it when we're ready to. VARIABLES Most expressions that we see in practice involve variables, such as b * b + 4 * a * c No parser is much good without being able to deal with them. Fortunately, it's also quite easy to do. Remember that in our parser as it currently stands, there are two kinds of factors allowed: integer constants and expressions within parentheses. In BNF notation, ::= | () The '|' stands for "or", meaning of course that either form is a legal form for a factor. Remember, too, that we had no trouble knowing which was which ... the lookahead character is a left paren '(' in one case, and a digit in the other. It probably won't come as too much of a surprise that a variable is just another kind of factor. So we extend the BNF above to read: ::= | () | Again, there is no ambiguity: if the lookahead character is a letter, we have a variable; if a digit, we have a number. Back when we translated the number, we just issued code to load the number, as immediate data, into D0. Now we do the same, only we load a variable. A minor complication in the code generation arises from the fact that most 68000 operating systems, including the SK*DOS that I'm using, require the code to be written in "position-independent" form, which basically means that everything is PC-relative. The format for a load in this language is MOVE X(PC),D0 where X is, of course, the variable name. Armed with that, let's modify the current version of Factor to read: {---------------------------------------------------------------} { Parse and Translate a Math Factor } procedure Expression; Forward; procedure Factor; begin if Look = '(' then begin Match('('); Expression; Match(')'); end else if IsAlpha(Look) then EmitLn('MOVE ' + GetName + '(PC),D0') else EmitLn('MOVE #' + GetNum + ',D0'); end; {--------------------------------------------------------------} I've remarked before how easy it is to add extensions to the parser, because of the way it's structured. You can see that this still holds true here. This time it cost us all of two extra lines of code. Notice, too, how the if-else-else structure exactly parallels the BNF syntax equation. OK, compile and test this new version of the parser. That didn't hurt too badly, did it? FUNCTIONS There is only one other common kind of factor supported by most languages: the function call. It's really too early for us to deal with functions well, because we haven't yet addressed the issue of parameter passing. What's more, a "real" language would include a mechanism to support more than one type, one of which should be a function type. We haven't gotten there yet, either. But I'd still like to deal with functions now for a couple of reasons. First, it lets us finally wrap up the parser in something very close to its final form, and second, it brings up a new issue which is very much worth talking about. Up till now, we've been able to write what is called a "predictive parser." That means that at any point, we can know by looking at the current lookahead character exactly what to do next. That isn't the case when we add functions. Every language has some naming rules for what constitutes a legal identifier. For the present, ours is simply that it is one of the letters 'a'..'z'. The problem is that a variable name and a function name obey the same rules. So how can we tell which is which? One way is to require that they each be declared before they are used. Pascal takes that approach. The other is that we might require a function to be followed by a (possibly empty) parameter list. That's the rule used in C. Since we don't yet have a mechanism for declaring types, let's use the C rule for now. Since we also don't have a mechanism to deal with parameters, we can only handle empty lists, so our function calls will have the form x() . Since we're not dealing with parameter lists yet, there is nothing to do but to call the function, so we need only to issue a BSR (call) instead of a MOVE. Now that there are two possibilities for the "If IsAlpha" branch of the test in Factor, let's treat them in a separate procedure. Modify Factor to read: {---------------------------------------------------------------} { Parse and Translate a Math Factor } procedure Expression; Forward; procedure Factor; begin if Look = '(' then begin Match('('); Expression; Match(')'); end else if IsAlpha(Look) then Ident else EmitLn('MOVE #' + GetNum + ',D0'); end; {--------------------------------------------------------------} and insert before it the new procedure {---------------------------------------------------------------} { Parse and Translate an Identifier } procedure Ident; var Name: char; begin Name := GetName; if Look = '(' then begin Match('('); Match(')'); EmitLn('BSR ' + Name); end else EmitLn('MOVE ' + Name + '(PC),D0') end; {---------------------------------------------------------------} OK, compile and test this version. Does it parse all legal expressions? Does it correctly flag badly formed ones? The important thing to notice is that even though we no longer have a predictive parser, there is little or no complication added with the recursive descent approach that we're using. At the point where Factor finds an identifier (letter), it doesn't know whether it's a variable name or a function name, nor does it really care. It simply passes it on to Ident and leaves it up to that procedure to figure it out. Ident, in turn, simply tucks away the identifier and then reads one more character to decide which kind of identifier it's dealing with. Keep this approach in mind. It's a very powerful concept, and it should be used whenever you encounter an ambiguous situation requiring further lookahead. Even if you had to look several tokens ahead, the principle would still work. MORE ON ERROR HANDLING As long as we're talking philosophy, there's another important issue to point out: error handling. Notice that although the parser correctly rejects (almost) every malformed expression we can throw at it, with a meaningful error message, we haven't really had to do much work to make that happen. In fact, in the whole parser per se (from Ident through Expression) there are only two calls to the error routine, Expected. Even those aren't necessary ... if you'll look again in Term and Expression, you'll see that those statements can't be reached. I put them in early on as a bit of insurance, but they're no longer needed. Why don't you delete them now? So how did we get this nice error handling virtually for free? It's simply that I've carefully avoided reading a character directly using GetChar. Instead, I've relied on the error handling in GetName, GetNum, and Match to do all the error checking for me. Astute readers will notice that some of the calls to Match (for example, the ones in Add and Subtract) are also unnecessary ... we already know what the character is by the time we get there ... but it maintains a certain symmetry to leave them in, and the general rule to always use Match instead of GetChar is a good one. I mentioned an "almost" above. There is a case where our error handling leaves a bit to be desired. So far we haven't told our parser what and end-of-line looks like, or what to do with embedded white space. So a space character (or any other character not part of the recognized character set) simply causes the parser to terminate, ignoring the unrecognized characters. It could be argued that this is reasonable behavior at this point. In a "real" compiler, there is usually another statement following the one we're working on, so any characters not treated as part of our expression will either be used for or rejected as part of the next one. But it's also a very easy thing to fix up, even if it's only temporary. All we have to do is assert that the expression should end with an end-of-line , i.e., a carriage return. To see what I'm talking about, try the input line 1+2 3+4 See how the space was treated as a terminator? Now, to make the compiler properly flag this, add the line if Look <> CR then Expected('Newline'); in the main program, just after the call to Expression. That catches anything left over in the input stream. Don't forget to define CR in the const statement: CR = ^M; As usual, recompile the program and verify that it does what it's supposed to. ASSIGNMENT STATEMENTS OK, at this point we have a parser that works very nicely. I'd like to point out that we got it using only 88 lines of executable code, not counting what was in the cradle. The compiled object file is a whopping 4752 bytes. Not bad, considering we weren't trying very hard to save either source code or object size. We just stuck to the KISS principle. Of course, parsing an expression is not much good without having something to do with it afterwards. Expressions USUALLY (but not always) appear in assignment statements, in the form = We're only a breath away from being able to parse an assignment statement, so let's take that last step. Just after procedure Expression, add the following new procedure: {--------------------------------------------------------------} { Parse and Translate an Assignment Statement } procedure Assignment; var Name: char; begin Name := GetName; Match('='); Expression; EmitLn('LEA ' + Name + '(PC),A0'); EmitLn('MOVE D0,(A0)') end; {--------------------------------------------------------------} Note again that the code exactly parallels the BNF. And notice further that the error checking was painless, handled by GetName and Match. The reason for the two lines of assembler has to do with a peculiarity in the 68000, which requires this kind of construct for PC-relative code. Now change the call to Expression, in the main program, to one to Assignment. That's all there is to it. Son of a gun! We are actually compiling assignment statements. If those were the only kind of statements in a language, all we'd have to do is put this in a loop and we'd have a full-fledged compiler! Well, of course they're not the only kind. There are also little items like control statements (IFs and loops), procedures, declarations, etc. But cheer up. The arithmetic expressions that we've been dealing with are among the most challenging in a language. Compared to what we've already done, control statements will be easy. I'll be covering them in the fifth installment. And the other statements will all fall in line, as long as we remember to KISS. MULTI-CHARACTER TOKENS Throughout this series, I've been carefully restricting everything we do to single-character tokens, all the while assuring you that it wouldn't be difficult to extend to multi- character ones. I don't know if you believed me or not ... I wouldn't really blame you if you were a bit skeptical. I'll continue to use that approach in the sessions which follow, because it helps keep complexity away. But I'd like to back up those assurances, and wrap up this portion of the parser, by showing you just how easy that extension really is. In the process, we'll also provide for embedded white space. Before you make the next few changes, though, save the current version of the parser away under another name. I have some more uses for it in the next installment, and we'll be working with the single- character version. Most compilers separate out the handling of the input stream into a separate module called the lexical scanner. The idea is that the scanner deals with all the character-by-character input, and returns the separate units (tokens) of the stream. There may come a time when we'll want to do something like that, too, but for now there is no need. We can handle the multi-character tokens that we need by very slight and very local modifications to GetName and GetNum. The usual definition of an identifier is that the first character must be a letter, but the rest can be alphanumeric (letters or numbers). To deal with this, we need one other recognizer function {--------------------------------------------------------------} { Recognize an Alphanumeric } function IsAlNum(c: char): boolean; begin IsAlNum := IsAlpha(c) or IsDigit(c); end; {--------------------------------------------------------------} Add this function to your parser. I put mine just after IsDigit. While you're at it, might as well include it as a permanent member of Cradle, too. Now, we need to modify function GetName to return a string instead of a character: {--------------------------------------------------------------} { Get an Identifier } function GetName: string; var Token: string; begin Token := ''; if not IsAlpha(Look) then Expected('Name'); while IsAlNum(Look) do begin Token := Token + UpCase(Look); GetChar; end; GetName := Token; end; {--------------------------------------------------------------} Similarly, modify GetNum to read: {--------------------------------------------------------------} { Get a Number } function GetNum: string; var Value: string; begin Value := ''; if not IsDigit(Look) then Expected('Integer'); while IsDigit(Look) do begin Value := Value + Look; GetChar; end; GetNum := Value; end; {--------------------------------------------------------------} Amazingly enough, that is virtually all the changes required to the parser! The local variable Name in procedures Ident and Assignment was originally declared as "char", and must now be declared string[8]. (Clearly, we could make the string length longer if we chose, but most assemblers limit the length anyhow.) Make this change, and then recompile and test. _NOW_ do you believe that it's a simple change? WHITE SPACE Before we leave this parser for awhile, let's address the issue of white space. As it stands now, the parser will barf (or simply terminate) on a single space character embedded anywhere in the input stream. That's pretty unfriendly behavior. So let's "productionize" the thing a bit by eliminating this last restriction. The key to easy handling of white space is to come up with a simple rule for how the parser should treat the input stream, and to enforce that rule everywhere. Up till now, because white space wasn't permitted, we've been able to assume that after each parsing action, the lookahead character Look contains the next meaningful character, so we could test it immediately. Our design was based upon this principle. It still sounds like a good rule to me, so that's the one we'll use. This means that every routine that advances the input stream must skip over white space, and leave the next non-white character in Look. Fortunately, because we've been careful to use GetName, GetNum, and Match for most of our input processing, it is only those three routines (plus Init) that we need to modify. Not surprisingly, we start with yet another new recognizer routine: {--------------------------------------------------------------} { Recognize White Space } function IsWhite(c: char): boolean; begin IsWhite := c in [' ', TAB]; end; {--------------------------------------------------------------} We also need a routine that will eat white-space characters, until it finds a non-white one: {--------------------------------------------------------------} { Skip Over Leading White Space } procedure SkipWhite; begin while IsWhite(Look) do GetChar; end; {--------------------------------------------------------------} Now, add calls to SkipWhite to Match, GetName, and GetNum as shown below: {--------------------------------------------------------------} { Match a Specific Input Character } procedure Match(x: char); begin if Look <> x then Expected('''' + x + '''') else begin GetChar; SkipWhite; end; end; {--------------------------------------------------------------} { Get an Identifier } function GetName: string; var Token: string; begin Token := ''; if not IsAlpha(Look) then Expected('Name'); while IsAlNum(Look) do begin Token := Token + UpCase(Look); GetChar; end; GetName := Token; SkipWhite; end; {--------------------------------------------------------------} { Get a Number } function GetNum: string; var Value: string; begin Value := ''; if not IsDigit(Look) then Expected('Integer'); while IsDigit(Look) do begin Value := Value + Look; GetChar; end; GetNum := Value; SkipWhite; end; {--------------------------------------------------------------} (Note that I rearranged Match a bit, without changing the functionality.) Finally, we need to skip over leading blanks where we "prime the pump" in Init: {--------------------------------------------------------------} { Initialize } procedure Init; begin GetChar; SkipWhite; end; {--------------------------------------------------------------} Make these changes and recompile the program. You will find that you will have to move Match below SkipWhite, to avoid an error message from the Pascal compiler. Test the program as always to make sure it works properly. Since we've made quite a few changes during this session, I'm reproducing the entire parser below: {--------------------------------------------------------------} program parse; {--------------------------------------------------------------} { Constant Declarations } const TAB = ^I; CR = ^M; {--------------------------------------------------------------} { Variable Declarations } var Look: char; { Lookahead Character } {--------------------------------------------------------------} { Read New Character From Input Stream } procedure GetChar; begin Read(Look); end; {--------------------------------------------------------------} { Report an Error } procedure Error(s: string); begin WriteLn; WriteLn(^G, 'Error: ', s, '.'); end; {--------------------------------------------------------------} { Report Error and Halt } procedure Abort(s: string); begin Error(s); Halt; end; {--------------------------------------------------------------} { Report What Was Expected } procedure Expected(s: string); begin Abort(s + ' Expected'); end; {--------------------------------------------------------------} { Recognize an Alpha Character } function IsAlpha(c: char): boolean; begin IsAlpha := UpCase(c) in ['A'..'Z']; end; {--------------------------------------------------------------} { Recognize a Decimal Digit } function IsDigit(c: char): boolean; begin IsDigit := c in ['0'..'9']; end; {--------------------------------------------------------------} { Recognize an Alphanumeric } function IsAlNum(c: char): boolean; begin IsAlNum := IsAlpha(c) or IsDigit(c); end; {--------------------------------------------------------------} { Recognize an Addop } function IsAddop(c: char): boolean; begin IsAddop := c in ['+', '-']; end; {--------------------------------------------------------------} { Recognize White Space } function IsWhite(c: char): boolean; begin IsWhite := c in [' ', TAB]; end; {--------------------------------------------------------------} { Skip Over Leading White Space } procedure SkipWhite; begin while IsWhite(Look) do GetChar; end; {--------------------------------------------------------------} { Match a Specific Input Character } procedure Match(x: char); begin if Look <> x then Expected('''' + x + '''') else begin GetChar; SkipWhite; end; end; {--------------------------------------------------------------} { Get an Identifier } function GetName: string; var Token: string; begin Token := ''; if not IsAlpha(Look) then Expected('Name'); while IsAlNum(Look) do begin Token := Token + UpCase(Look); GetChar; end; GetName := Token; SkipWhite; end; {--------------------------------------------------------------} { Get a Number } function GetNum: string; var Value: string; begin Value := ''; if not IsDigit(Look) then Expected('Integer'); while IsDigit(Look) do begin Value := Value + Look; GetChar; end; GetNum := Value; SkipWhite; end; {--------------------------------------------------------------} { Output a String with Tab } procedure Emit(s: string); begin Write(TAB, s); end; {--------------------------------------------------------------} { Output a String with Tab and CRLF } procedure EmitLn(s: string); begin Emit(s); WriteLn; end; {---------------------------------------------------------------} { Parse and Translate a Identifier } procedure Ident; var Name: string[8]; begin Name:= GetName; if Look = '(' then begin Match('('); Match(')'); EmitLn('BSR ' + Name); end else EmitLn('MOVE ' + Name + '(PC),D0'); end; {---------------------------------------------------------------} { Parse and Translate a Math Factor } procedure Expression; Forward; procedure Factor; begin if Look = '(' then begin Match('('); Expression; Match(')'); end else if IsAlpha(Look) then Ident else EmitLn('MOVE #' + GetNum + ',D0'); end; {--------------------------------------------------------------} { Recognize and Translate a Multiply } procedure Multiply; begin Match('*'); Factor; EmitLn('MULS (SP)+,D0'); end; {-------------------------------------------------------------} { Recognize and Translate a Divide } procedure Divide; begin Match('/'); Factor; EmitLn('MOVE (SP)+,D1'); EmitLn('EXS.L D0'); EmitLn('DIVS D1,D0'); end; {---------------------------------------------------------------} { Parse and Translate a Math Term } procedure Term; begin Factor; while Look in ['*', '/'] do begin EmitLn('MOVE D0,-(SP)'); case Look of '*': Multiply; '/': Divide; end; end; end; {--------------------------------------------------------------} { Recognize and Translate an Add } procedure Add; begin Match('+'); Term; EmitLn('ADD (SP)+,D0'); end; {-------------------------------------------------------------} { Recognize and Translate a Subtract } procedure Subtract; begin Match('-'); Term; EmitLn('SUB (SP)+,D0'); EmitLn('NEG D0'); end; {---------------------------------------------------------------} { Parse and Translate an Expression } procedure Expression; begin if IsAddop(Look) then EmitLn('CLR D0') else Term; while IsAddop(Look) do begin EmitLn('MOVE D0,-(SP)'); case Look of '+': Add; '-': Subtract; end; end; end; {--------------------------------------------------------------} { Parse and Translate an Assignment Statement } procedure Assignment; var Name: string[8]; begin Name := GetName; Match('='); Expression; EmitLn('LEA ' + Name + '(PC),A0'); EmitLn('MOVE D0,(A0)') end; {--------------------------------------------------------------} { Initialize } procedure Init; begin GetChar; SkipWhite; end; {--------------------------------------------------------------} { Main Program } begin Init; Assignment; If Look <> CR then Expected('NewLine'); end. {--------------------------------------------------------------} Now the parser is complete. It's got every feature we can put in a one-line "compiler." Tuck it away in a safe place. Next time we'll move on to a new subject, but we'll still be talking about expressions for quite awhile. Next installment, I plan to talk a bit about interpreters as opposed to compilers, and show you how the structure of the parser changes a bit as we change what sort of action has to be taken. The information we pick up there will serve us in good stead later on, even if you have no interest in interpreters. See you next time. ***************************************************************** * * * COPYRIGHT NOTICE * * * * Copyright (C) 1988 Jack W. Crenshaw. All rights reserved. * * * ***************************************************************** LET'S BUILD A COMPILER! By Jack W. Crenshaw, Ph.D. 24 July 1988 Part IV: INTERPRETERS ***************************************************************** * * * COPYRIGHT NOTICE * * * * Copyright (C) 1988 Jack W. Crenshaw. All rights reserved. * * * ***************************************************************** INTRODUCTION In the first three installments of this series, we've looked at parsing and compiling math expressions, and worked our way grad- ually and methodically from dealing with very simple one-term, one-character "expressions" up through more general ones, finally arriving at a very complete parser that could parse and translate complete assignment statements, with multi-character tokens, embedded white space, and function calls. This time, I'm going to walk you through the process one more time, only with the goal of interpreting rather than compiling object code. Since this is a series on compilers, why should we bother with interpreters? Simply because I want you to see how the nature of the parser changes as we change the goals. I also want to unify the concepts of the two types of translators, so that you can see not only the differences, but also the similarities. Consider the assignment statement x = 2 * y + 3 In a compiler, we want the target CPU to execute this assignment at EXECUTION time. The translator itself doesn't do any arith- metic ... it only issues the object code that will cause the CPU to do it when the code is executed. For the example above, the compiler would issue code to compute the expression and store the results in variable x. For an interpreter, on the other hand, no object code is gen- erated. Instead, the arithmetic is computed immediately, as the parsing is going on. For the example, by the time parsing of the statement is complete, x will have a new value. The approach we've been taking in this whole series is called "syntax-driven translation." As you are aware by now, the struc- ture of the parser is very closely tied to the syntax of the productions we parse. We have built Pascal procedures that rec- ognize every language construct. Associated with each of these constructs (and procedures) is a corresponding "action," which does whatever makes sense to do once a construct has been recognized. In our compiler so far, every action involves emitting object code, to be executed later at execution time. In an interpreter, every action involves something to be done im- mediately. What I'd like you to see here is that the layout ... the struc- ture ... of the parser doesn't change. It's only the actions that change. So if you can write an interpreter for a given language, you can also write a compiler, and vice versa. Yet, as you will see, there ARE differences, and significant ones. Because the actions are different, the procedures that do the recognizing end up being written differently. Specifically, in the interpreter the recognizing procedures end up being coded as FUNCTIONS that return numeric values to their callers. None of the parsing routines for our compiler did that. Our compiler, in fact, is what we might call a "pure" compiler. Each time a construct is recognized, the object code is emitted IMMEDIATELY. (That's one reason the code is not very efficient.) The interpreter we'll be building here is a pure interpreter, in the sense that there is no translation, such as "tokenizing," performed on the source code. These represent the two extremes of translation. In the real world, translators are rarely so pure, but tend to have bits of each technique. I can think of several examples. I've already mentioned one: most interpreters, such as Microsoft BASIC, for example, trans- late the source code (tokenize it) into an intermediate form so that it'll be easier to parse real time. Another example is an assembler. The purpose of an assembler, of course, is to produce object code, and it normally does that on a one-to-one basis: one object instruction per line of source code. But almost every assembler also permits expressions as arguments. In this case, the expressions are always constant expressions, and so the assembler isn't supposed to issue object code for them. Rather, it "interprets" the expressions and computes the corresponding constant result, which is what it actually emits as object code. As a matter of fact, we could use a bit of that ourselves. The translator we built in the previous installment will dutifully spit out object code for complicated expressions, even though every term in the expression is a constant. In that case it would be far better if the translator behaved a bit more like an interpreter, and just computed the equivalent constant result. There is a concept in compiler theory called "lazy" translation. The idea is that you typically don't just emit code at every action. In fact, at the extreme you don't emit anything at all, until you absolutely have to. To accomplish this, the actions associated with the parsing routines typically don't just emit code. Sometimes they do, but often they simply return in- formation back to the caller. Armed with such information, the caller can then make a better choice of what to do. For example, given the statement x = x + 3 - 2 - (5 - 4) , our compiler will dutifully spit out a stream of 18 instructions to load each parameter into registers, perform the arithmetic, and store the result. A lazier evaluation would recognize that the arithmetic involving constants can be evaluated at compile time, and would reduce the expression to x = x + 0 . An even lazier evaluation would then be smart enough to figure out that this is equivalent to x = x , which calls for no action at all. We could reduce 18 in- structions to zero! Note that there is no chance of optimizing this way in our trans- lator as it stands, because every action takes place immediately. Lazy expression evaluation can produce significantly better object code than we have been able to so far. I warn you, though: it complicates the parser code considerably, because each routine now has to make decisions as to whether to emit object code or not. Lazy evaluation is certainly not named that because it's easier on the compiler writer! Since we're operating mainly on the KISS principle here, I won't go into much more depth on this subject. I just want you to be aware that you can get some code optimization by combining the techniques of compiling and interpreting. In particular, you should know that the parsing routines in a smarter translator will generally return things to their caller, and sometimes expect things as well. That's the main reason for going over interpretation in this installment. THE INTERPRETER OK, now that you know WHY we're going into all this, let's do it. Just to give you practice, we're going to start over with a bare cradle and build up the translator all over again. This time, of course, we can go a bit faster. Since we're now going to do arithmetic, the first thing we need to do is to change function GetNum, which up till now has always returned a character (or string). Now, it's better for it to return an integer. MAKE A COPY of the cradle (for goodness's sake, don't change the version in Cradle itself!!) and modify GetNum as follows: {--------------------------------------------------------------} { Get a Number } function GetNum: integer; begin if not IsDigit(Look) then Expected('Integer'); GetNum := Ord(Look) - Ord('0'); GetChar; end; {--------------------------------------------------------------} Now, write the following version of Expression: {---------------------------------------------------------------} { Parse and Translate an Expression } function Expression: integer; begin Expression := GetNum; end; {--------------------------------------------------------------} Finally, insert the statement Writeln(Expression); at the end of the main program. Now compile and test. All this program does is to "parse" and translate a single integer "expression." As always, you should make sure that it does that with the digits 0..9, and gives an error message for anything else. Shouldn't take you very long! OK, now let's extend this to include addops. Change Expression to read: {---------------------------------------------------------------} { Parse and Translate an Expression } function Expression: integer; var Value: integer; begin if IsAddop(Look) then Value := 0 else Value := GetNum; while IsAddop(Look) do begin case Look of '+': begin Match('+'); Value := Value + GetNum; end; '-': begin Match('-'); Value := Value - GetNum; end; end; end; Expression := Value; end; {--------------------------------------------------------------} The structure of Expression, of course, parallels what we did before, so we shouldn't have too much trouble debugging it. There's been a SIGNIFICANT development, though, hasn't there? Procedures Add and Subtract went away! The reason is that the action to be taken requires BOTH arguments of the operation. I could have chosen to retain the procedures and pass into them the value of the expression to date, which is Value. But it seemed cleaner to me to keep Value as strictly a local variable, which meant that the code for Add and Subtract had to be moved in line. This result suggests that, while the structure we had developed was nice and clean for our simple-minded translation scheme, it probably wouldn't do for use with lazy evaluation. That's a little tidbit we'll probably want to keep in mind for later. OK, did the translator work? Then let's take the next step. It's not hard to figure out what procedure Term should now look like. Change every call to GetNum in function Expression to a call to Term, and then enter the following form for Term: {---------------------------------------------------------------} { Parse and Translate a Math Term } function Term: integer; var Value: integer; begin Value := GetNum; while Look in ['*', '/'] do begin case Look of '*': begin Match('*'); Value := Value * GetNum; end; '/': begin Match('/'); Value := Value div GetNum; end; end; end; Term := Value; end; {--------------------------------------------------------------} Now, try it out. Don't forget two things: first, we're dealing with integer division, so, for example, 1/3 should come out zero. Second, even though we can output multi-digit results, our input is still restricted to single digits. That seems like a silly restriction at this point, since we have already seen how easily function GetNum can be extended. So let's go ahead and fix it right now. The new version is {--------------------------------------------------------------} { Get a Number } function GetNum: integer; var Value: integer; begin Value := 0; if not IsDigit(Look) then Expected('Integer'); while IsDigit(Look) do begin Value := 10 * Value + Ord(Look) - Ord('0'); GetChar; end; GetNum := Value; end; {--------------------------------------------------------------} If you've compiled and tested this version of the interpreter, the next step is to install function Factor, complete with pa- renthesized expressions. We'll hold off a bit longer on the variable names. First, change the references to GetNum, in function Term, so that they call Factor instead. Now code the following version of Factor: {---------------------------------------------------------------} { Parse and Translate a Math Factor } function Expression: integer; Forward; function Factor: integer; begin if Look = '(' then begin Match('('); Factor := Expression; Match(')'); end else Factor := GetNum; end; {---------------------------------------------------------------} That was pretty easy, huh? We're rapidly closing in on a useful interpreter. A LITTLE PHILOSOPHY Before going any further, there's something I'd like to call to your attention. It's a concept that we've been making use of in all these sessions, but I haven't explicitly mentioned it up till now. I think it's time, because it's a concept so useful, and so powerful, that it makes all the difference between a parser that's trivially easy, and one that's too complex to deal with. In the early days of compiler technology, people had a terrible time figuring out how to deal with things like operator prece- dence ... the way that multiply and divide operators take precedence over add and subtract, etc. I remember a colleague of some thirty years ago, and how excited he was to find out how to do it. The technique used involved building two stacks, upon which you pushed each operator or operand. Associated with each operator was a precedence level, and the rules required that you only actually performed an operation ("reducing" the stack) if the precedence level showing on top of the stack was correct. To make life more interesting, an operator like ')' had different precedence levels, depending upon whether or not it was already on the stack. You had to give it one value before you put it on the stack, and another to decide when to take it off. Just for the experience, I worked all of this out for myself a few years ago, and I can tell you that it's very tricky. We haven't had to do anything like that. In fact, by now the parsing of an arithmetic statement should seem like child's play. How did we get so lucky? And where did the precedence stacks go? A similar thing is going on in our interpreter above. You just KNOW that in order for it to do the computation of arithmetic statements (as opposed to the parsing of them), there have to be numbers pushed onto a stack somewhere. But where is the stack? Finally, in compiler textbooks, there are a number of places where stacks and other structures are discussed. In the other leading parsing method (LR), an explicit stack is used. In fact, the technique is very much like the old way of doing arithmetic expressions. Another concept is that of a parse tree. Authors like to draw diagrams of the tokens in a statement, connected into a tree with operators at the internal nodes. Again, where are the trees and stacks in our technique? We haven't seen any. The answer in all cases is that the structures are implicit, not explicit. In any computer language, there is a stack involved every time you call a subroutine. Whenever a subroutine is called, the return address is pushed onto the CPU stack. At the end of the subroutine, the address is popped back off and control is transferred there. In a recursive language such as Pascal, there can also be local data pushed onto the stack, and it, too, returns when it's needed. For example, function Expression contains a local parameter called Value, which it fills by a call to Term. Suppose, in its next call to Term for the second argument, that Term calls Factor, which recursively calls Expression again. That "in- stance" of Expression gets another value for its copy of Value. What happens to the first Value? Answer: it's still on the stack, and will be there again when we return from our call sequence. In other words, the reason things look so simple is that we've been making maximum use of the resources of the language. The hierarchy levels and the parse trees are there, all right, but they're hidden within the structure of the parser, and they're taken care of by the order with which the various procedures are called. Now that you've seen how we do it, it's probably hard to imagine doing it any other way. But I can tell you that it took a lot of years for compiler writers to get that smart. The early compilers were too complex too imagine. Funny how things get easier with a little practice. The reason I've brought all this up is as both a lesson and a warning. The lesson: things can be easy when you do them right. The warning: take a look at what you're doing. If, as you branch out on your own, you begin to find a real need for a separate stack or tree structure, it may be time to ask yourself if you're looking at things the right way. Maybe you just aren't using the facilities of the language as well as you could be. The next step is to add variable names. Now, though, we have a slight problem. For the compiler, we had no problem in dealing with variable names ... we just issued the names to the assembler and let the rest of the program take care of allocating storage for them. Here, on the other hand, we need to be able to fetch the values of the variables and return them as the return values of Factor. We need a storage mechanism for these variables. Back in the early days of personal computing, Tiny BASIC lived. It had a grand total of 26 possible variables: one for each letter of the alphabet. This fits nicely with our concept of single-character tokens, so we'll try the same trick. In the beginning of your interpreter, just after the declaration of variable Look, insert the line: Table: Array['A'..'Z'] of integer; We also need to initialize the array, so add this procedure: {---------------------------------------------------------------} { Initialize the Variable Area } procedure InitTable; var i: char; begin for i := 'A' to 'Z' do Table[i] := 0; end; {---------------------------------------------------------------} You must also insert a call to InitTable, in procedure Init. DON'T FORGET to do that, or the results may surprise you! Now that we have an array of variables, we can modify Factor to use it. Since we don't have a way (so far) to set the variables, Factor will always return zero values for them, but let's go ahead and extend it anyway. Here's the new version: {---------------------------------------------------------------} { Parse and Translate a Math Factor } function Expression: integer; Forward; function Factor: integer; begin if Look = '(' then begin Match('('); Factor := Expression; Match(')'); end else if IsAlpha(Look) then Factor := Table[GetName] else Factor := GetNum; end; {---------------------------------------------------------------} As always, compile and test this version of the program. Even though all the variables are now zeros, at least we can correctly parse the complete expressions, as well as catch any badly formed expressions. I suppose you realize the next step: we need to do an assignment statement so we can put something INTO the variables. For now, let's stick to one-liners, though we will soon be handling multiple statements. The assignment statement parallels what we did before: {--------------------------------------------------------------} { Parse and Translate an Assignment Statement } procedure Assignment; var Name: char; begin Name := GetName; Match('='); Table[Name] := Expression; end; {--------------------------------------------------------------} To test this, I added a temporary write statement in the main program, to print out the value of A. Then I tested it with various assignments to it. Of course, an interpretive language that can only accept a single line of program is not of much value. So we're going to want to handle multiple statements. This merely means putting a loop around the call to Assignment. So let's do that now. But what should be the loop exit criterion? Glad you asked, because it brings up a point we've been able to ignore up till now. One of the most tricky things to handle in any translator is to determine when to bail out of a given construct and go look for something else. This hasn't been a problem for us so far because we've only allowed for a single kind of construct ... either an expression or an assignment statement. When we start adding loops and different kinds of statements, you'll find that we have to be very careful that things terminate properly. If we put our interpreter in a loop, we need a way to quit. Terminating on a newline is no good, because that's what sends us back for another line. We could always let an unrecognized character take us out, but that would cause every run to end in an error message, which certainly seems uncool. What we need is a termination character. I vote for Pascal's ending period ('.'). A minor complication is that Turbo ends every normal line with TWO characters, the carriage return (CR) and line feed (LF). At the end of each line, we need to eat these characters before processing the next one. A natural way to do this would be with procedure Match, except that Match's error message prints the character, which of course for the CR and/or LF won't look so great. What we need is a special proce- dure for this, which we'll no doubt be using over and over. Here it is: {--------------------------------------------------------------} { Recognize and Skip Over a Newline } procedure NewLine; begin if Look = CR then begin GetChar; if Look = LF then GetChar; end; end; {--------------------------------------------------------------} Insert this procedure at any convenient spot ... I put mine just after Match. Now, rewrite the main program to look like this: {--------------------------------------------------------------} { Main Program } begin Init; repeat Assignment; NewLine; until Look = '.'; end. {--------------------------------------------------------------} Note that the test for a CR is now gone, and that there are also no error tests within NewLine itself. That's OK, though ... whatever is left over in terms of bogus characters will be caught at the beginning of the next assignment statement. Well, we now have a functioning interpreter. It doesn't do us a lot of good, however, since we have no way to read data in or write it out. Sure would help to have some I/O! Let's wrap this session up, then, by adding the I/O routines. Since we're sticking to single-character tokens, I'll use '?' to stand for a read statement, and '!' for a write, with the char- acter immediately following them to be used as a one-token "parameter list." Here are the routines: {--------------------------------------------------------------} { Input Routine } procedure Input; begin Match('?'); Read(Table[GetName]); end; {--------------------------------------------------------------} { Output Routine } procedure Output; begin Match('!'); WriteLn(Table[GetName]); end; {--------------------------------------------------------------} They aren't very fancy, I admit ... no prompt character on input, for example ... but they get the job done. The corresponding changes in the main program are shown below. Note that we use the usual trick of a case statement based upon the current lookahead character, to decide what to do. {--------------------------------------------------------------} { Main Program } begin Init; repeat case Look of '?': Input; '!': Output; else Assignment; end; NewLine; until Look = '.'; end. {--------------------------------------------------------------} You have now completed a real, working interpreter. It's pretty sparse, but it works just like the "big boys." It includes three kinds of program statements (and can tell the difference!), 26 variables, and I/O statements. The only things that it lacks, really, are control statements, subroutines, and some kind of program editing function. The program editing part, I'm going to pass on. After all, we're not here to build a product, but to learn things. The control statements, we'll cover in the next installment, and the subroutines soon after. I'm anxious to get on with that, so we'll leave the interpreter as it stands. I hope that by now you're convinced that the limitation of sin- gle-character names and the processing of white space are easily taken care of, as we did in the last session. This time, if you'd like to play around with these extensions, be my guest ... they're "left as an exercise for the student." See you next time. ***************************************************************** * * * COPYRIGHT NOTICE * * * * Copyright (C) 1988 Jack W. Crenshaw. All rights reserved. * * * ***************************************************************** 1 -- LET'S BUILD A COMPILER! By Jack W. Crenshaw, Ph.D. 19 August 1988 Part V: CONTROL CONSTRUCTS ***************************************************************** * * * COPYRIGHT NOTICE * * * * Copyright (C) 1988 Jack W. Crenshaw. All rights reserved. * * * ***************************************************************** INTRODUCTION In the first four installments of this series, we've been concentrating on the parsing of math expressions and assignment statements. In this installment, we'll take off on a new and exciting tangent: that of parsing and translating control constructs such as IF statements. This subject is dear to my heart, because it represents a turning point for me. I had been playing with the parsing of expressions, just as we have done in this series, but I still felt that I was a LONG way from being able to handle a complete language. After all, REAL languages have branches and loops and subroutines and all that. Perhaps you've shared some of the same thoughts. Awhile back, though, I had to produce control constructs for a structured assembler preprocessor I was writing. Imagine my surprise to discover that it was far easier than the expression parsing I had already been through. I remember thinking, "Hey! This is EASY!" After we've finished this session, I'll bet you'll be thinking so, too. THE PLAN In what follows, we'll be starting over again with a bare cradle, and as we've done twice before now, we'll build things up one at a time. We'll also be retaining the concept of single-character tokens that has served us so well to date. This means that the "code" will look a little funny, with 'i' for IF, 'w' for WHILE, etc. But it helps us get the concepts down pat without fussing over lexical scanning. Fear not ... eventually we'll see something looking like "real" code. I also don't want to have us get bogged down in dealing with statements other than branches, such as the assignment statements we've been working on. We've already demonstrated that we can handle them, so there's no point carrying them around as excess baggage during this exercise. So what I'll do instead is to use an anonymous statement, "other", to take the place of the non- control statements and serve as a place-holder for them. We have to generate some kind of object code for them (we're back into compiling, not interpretation), so for want of anything else I'll just echo the character input. OK, then, starting with yet another copy of the cradle, let's define the procedure: {--------------------------------------------------------------} { Recognize and Translate an "Other" } procedure Other; begin EmitLn(GetName); end; {--------------------------------------------------------------} Now include a call to it in the main program, thus: {--------------------------------------------------------------} { Main Program } begin Init; Other; end. {--------------------------------------------------------------} Run the program and see what you get. Not very exciting, is it? But hang in there, it's a start, and things will get better. The first thing we need is the ability to deal with more than one statement, since a single-line branch is pretty limited. We did that in the last session on interpreting, but this time let's get a little more formal. Consider the following BNF: ::= END ::= [ ]* This says that, for our purposes here, a program is defined as a block, followed by an END statement. A block, in turn, consists of zero or more statements. We only have one kind of statement, so far. What signals the end of a block? It's simply any construct that isn't an "other" statement. For now, that means only the END statement. Armed with these ideas, we can proceed to build up our parser. The code for a program (we have to call it DoProgram, or Pascal will complain, is: {--------------------------------------------------------------} { Parse and Translate a Program } procedure DoProgram; begin Block; if Look <> 'e' then Expected('End'); EmitLn('END') end; {--------------------------------------------------------------} Notice that I've arranged to emit an "END" command to the assembler, which sort of punctuates the output code, and makes sense considering that we're parsing a complete program here. The code for Block is: {--------------------------------------------------------------} { Recognize and Translate a Statement Block } procedure Block; begin while not(Look in ['e']) do begin Other; end; end; {--------------------------------------------------------------} (From the form of the procedure, you just KNOW we're going to be adding to it in a bit!) OK, enter these routines into your program. Replace the call to Block in the main program, by a call to DoProgram. Now try it and see how it works. Well, it's still not much, but we're getting closer. SOME GROUNDWORK Before we begin to define the various control constructs, we need to lay a bit more groundwork. First, a word of warning: I won't be using the same syntax for these constructs as you're familiar with from Pascal or C. For example, the Pascal syntax for an IF is: IF THEN (where the statement, of course, may be compound). The C version is similar: IF ( ) Instead, I'll be using something that looks more like Ada: IF ENDIF In other words, the IF construct has a specific termination symbol. This avoids the dangling-else of Pascal and C and also precludes the need for the brackets {} or begin-end. The syntax I'm showing you here, in fact, is that of the language KISS that I'll be detailing in later installments. The other constructs will also be slightly different. That shouldn't be a real problem for you. Once you see how it's done, you'll realize that it really doesn't matter so much which specific syntax is involved. Once the syntax is defined, turning it into code is straightforward. Now, all of the constructs we'll be dealing with here involve transfer of control, which at the assembler-language level means conditional and/or unconditional branches. For example, the simple IF statement IF A ENDIF B .... must get translated into Branch if NOT condition to L A L: B ... It's clear, then, that we're going to need some more procedures to help us deal with these branches. I've defined two of them below. Procedure NewLabel generates unique labels. This is done via the simple expedient of calling every label 'Lnn', where nn is a label number starting from zero. Procedure PostLabel just outputs the labels at the proper place. Here are the two routines: {--------------------------------------------------------------} { Generate a Unique Label } function NewLabel: string; var S: string; begin Str(LCount, S); NewLabel := 'L' + S; Inc(LCount); end; {--------------------------------------------------------------} { Post a Label To Output } procedure PostLabel(L: string); begin WriteLn(L, ':'); end; {--------------------------------------------------------------} Notice that we've added a new global variable, LCount, so you need to change the VAR declarations at the top of the program to look like this: var Look : char; { Lookahead Character } Lcount: integer; { Label Counter } Also, add the following extra initialization to Init: LCount := 0; (DON'T forget that, or your labels can look really strange!) At this point I'd also like to show you a new kind of notation. If you compare the form of the IF statement above with the as- sembler code that must be produced, you can see that there are certain actions associated with each of the keywords in the statement: IF: First, get the condition and issue the code for it. Then, create a unique label and emit a branch if false. ENDIF: Emit the label. These actions can be shown very concisely if we write the syntax this way: IF { Condition; L = NewLabel; Emit(Branch False to L); } ENDIF { PostLabel(L) } This is an example of syntax-directed translation. We've been doing it all along ... we've just never written it down this way before. The stuff in curly brackets represents the ACTIONS to be taken. The nice part about this representation is that it not only shows what we have to recognize, but also the actions we have to perform, and in which order. Once we have this syntax, the code almost writes itself. About the only thing left to do is to be a bit more specific about what we mean by "Branch if false." I'm assuming that there will be code executed for that will perform Boolean algebra and compute some result. It should also set the condition flags corresponding to that result. Now, the usual convention for a Boolean variable is to let 0000 represent "false," and anything else (some use FFFF, some 0001) represent "true." On the 68000 the condition flags are set whenever any data is moved or calculated. If the data is a 0000 (corresponding to a false condition, remember), the zero flag will be set. The code for "Branch on zero" is BEQ. So for our purposes here, BEQ <=> Branch if false BNE <=> Branch if true It's the nature of the beast that most of the branches we see will be BEQ's ... we'll be branching AROUND the code that's supposed to be executed when the condition is true. THE IF STATEMENT With that bit of explanation out of the way, we're finally ready to begin coding the IF-statement parser. In fact, we've almost already done it! As usual, I'll be using our single-character approach, with the character 'i' for IF, and 'e' for ENDIF (as well as END ... that dual nature causes no confusion). I'll also, for now, skip completely the character for the branch con- dition, which we still have to define. The code for DoIf is: {--------------------------------------------------------------} { Recognize and Translate an IF Construct } procedure Block; Forward; procedure DoIf; var L: string; begin Match('i'); L := NewLabel; Condition; EmitLn('BEQ ' + L); Block; Match('e'); PostLabel(L); end; {--------------------------------------------------------------} Add this routine to your program, and change Block to reference it as follows: {--------------------------------------------------------------} { Recognize and Translate a Statement Block } procedure Block; begin while not(Look in ['e']) do begin case Look of 'i': DoIf; 'o': Other; end; end; end; {--------------------------------------------------------------} Notice the reference to procedure Condition. Eventually, we'll write a routine that can parse and translate any Boolean con- dition we care to give it. But that's a whole installment by itself (the next one, in fact). For now, let's just make it a dummy that emits some text. Write the following routine: {--------------------------------------------------------------} { Parse and Translate a Boolean Condition } { This version is a dummy } Procedure Condition; begin EmitLn(''); end; {--------------------------------------------------------------} Insert this procedure in your program just before DoIf. Now run the program. Try a string like aibece As you can see, the parser seems to recognize the construct and inserts the object code at the right places. Now try a set of nested IF's, like aibicedefe It's starting to look real, eh? Now that we have the general idea (and the tools such as the notation and the procedures NewLabel and PostLabel), it's a piece of cake to extend the parser to include other constructs. The first (and also one of the trickiest) is to add the ELSE clause to IF. The BNF is IF [ ELSE ] ENDIF The tricky part arises simply because there is an optional part, which doesn't occur in the other constructs. The corresponding output code should be BEQ L1 BRA L2 L1: L2: ... This leads us to the following syntax-directed translation: IF { L1 = NewLabel; L2 = NewLabel; Emit(BEQ L1) } ELSE { Emit(BRA L2); PostLabel(L1) } ENDIF { PostLabel(L2) } Comparing this with the case for an ELSE-less IF gives us a clue as to how to handle both situations. The code below does it. (Note that I use an 'l' for the ELSE, since 'e' is otherwise occupied): {--------------------------------------------------------------} { Recognize and Translate an IF Construct } procedure DoIf; var L1, L2: string; begin Match('i'); Condition; L1 := NewLabel; L2 := L1; EmitLn('BEQ ' + L1); Block; if Look = 'l' then begin Match('l'); L2 := NewLabel; EmitLn('BRA ' + L2); PostLabel(L1); Block; end; Match('e'); PostLabel(L2); end; {--------------------------------------------------------------} There you have it. A complete IF parser/translator, in 19 lines of code. Give it a try now. Try something like aiblcede Did it work? Now, just to be sure we haven't broken the ELSE- less case, try aibece Now try some nested IF's. Try anything you like, including some badly formed statements. Just remember that 'e' is not a legal "other" statement. THE WHILE STATEMENT The next type of statement should be easy, since we already have the process down pat. The syntax I've chosen for the WHILE statement is WHILE ENDWHILE I know, I know, we don't REALLY need separate kinds of ter- minators for each construct ... you can see that by the fact that in our one-character version, 'e' is used for all of them. But I also remember MANY debugging sessions in Pascal, trying to track down a wayward END that the compiler obviously thought I meant to put somewhere else. It's been my experience that specific and unique keywords, although they add to the vocabulary of the language, give a bit of error-checking that is worth the extra work for the compiler writer. Now, consider what the WHILE should be translated into. It should be: L1: BEQ L2 BRA L1 L2: As before, comparing the two representations gives us the actions needed at each point. WHILE { L1 = NewLabel; PostLabel(L1) } { Emit(BEQ L2) } ENDWHILE { Emit(BRA L1); PostLabel(L2) } The code follows immediately from the syntax: {--------------------------------------------------------------} { Parse and Translate a WHILE Statement } procedure DoWhile; var L1, L2: string; begin Match('w'); L1 := NewLabel; L2 := NewLabel; PostLabel(L1); Condition; EmitLn('BEQ ' + L2); Block; Match('e'); EmitLn('BRA ' + L1); PostLabel(L2); end; {--------------------------------------------------------------} Since we've got a new statement, we have to add a call to it within procedure Block: {--------------------------------------------------------------} { Recognize and Translate a Statement Block } procedure Block; begin while not(Look in ['e', 'l']) do begin case Look of 'i': DoIf; 'w': DoWhile; else Other; end; end; end; {--------------------------------------------------------------} No other changes are necessary. OK, try the new program. Note that this time, the code is INSIDE the upper label, which is just where we wanted it. Try some nested loops. Try some loops within IF's, and some IF's within loops. If you get a bit confused as to what you should type, don't be discouraged: you write bugs in other languages, too, don't you? It'll look a lot more meaningful when we get full keywords. I hope by now that you're beginning to get the idea that this really IS easy. All we have to do to accomodate a new construct is to work out the syntax-directed translation of it. The code almost falls out from there, and it doesn't affect any of the other routines. Once you've gotten the feel of the thing, you'll see that you can add new constructs about as fast as you can dream them up. THE LOOP STATEMENT We could stop right here, and have a language that works. It's been shown many times that a high-order language with only two constructs, the IF and the WHILE, is sufficient to write struc- tured code. But we're on a roll now, so let's richen up the repertoire a bit. This construct is even easier, since it has no condition test at all ... it's an infinite loop. What's the point of such a loop? Not much, by itself, but later on we're going to add a BREAK command, that will give us a way out. This makes the language considerably richer than Pascal, which has no break, and also avoids the funny WHILE(1) or WHILE TRUE of C and Pascal. The syntax is simply LOOP ENDLOOP and the syntax-directed translation is: LOOP { L = NewLabel; PostLabel(L) } ENDLOOP { Emit(BRA L } The corresponding code is shown below. Since I've already used 'l' for the ELSE, I've used the last letter, 'p', as the "keyword" this time. {--------------------------------------------------------------} { Parse and Translate a LOOP Statement } procedure DoLoop; var L: string; begin Match('p'); L := NewLabel; PostLabel(L); Block; Match('e'); EmitLn('BRA ' + L); end; {--------------------------------------------------------------} When you insert this routine, don't forget to add a line in Block to call it. REPEAT-UNTIL Here's one construct that I lifted right from Pascal. The syntax is REPEAT UNTIL , and the syntax-directed translation is: REPEAT { L = NewLabel; PostLabel(L) } UNTIL { Emit(BEQ L) } As usual, the code falls out pretty easily: {--------------------------------------------------------------} { Parse and Translate a REPEAT Statement } procedure DoRepeat; var L: string; begin Match('r'); L := NewLabel; PostLabel(L); Block; Match('u'); Condition; EmitLn('BEQ ' + L); end; {--------------------------------------------------------------} As before, we have to add the call to DoRepeat within Block. This time, there's a difference, though. I decided to use 'r' for REPEAT (naturally), but I also decided to use 'u' for UNTIL. This means that the 'u' must be added to the set of characters in the while-test. These are the characters that signal an exit from the current block ... the "follow" characters, in compiler jargon. {--------------------------------------------------------------} { Recognize and Translate a Statement Block } procedure Block; begin while not(Look in ['e', 'l', 'u']) do begin case Look of 'i': DoIf; 'w': DoWhile; 'p': DoLoop; 'r': DoRepeat; else Other; end; end; end; {--------------------------------------------------------------} THE FOR LOOP The FOR loop is a very handy one to have around, but it's a bear to translate. That's not so much because the construct itself is hard ... it's only a loop after all ... but simply because it's hard to implement in assembler language. Once the code is figured out, the translation is straightforward enough. C fans love the FOR-loop of that language (and, in fact, it's easier to code), but I've chosen instead a syntax very much like the one from good ol' BASIC: FOR = TO ENDFOR The translation of a FOR loop can be just about as difficult as you choose to make it, depending upon the way you decide to define the rules as to how to handle the limits. Does expr2 get evaluated every time through the loop, for example, or is it treated as a constant limit? Do you always go through the loop at least once, as in FORTRAN, or not? It gets simpler if you adopt the point of view that the construct is equivalent to: = TEMP = WHILE <= TEMP ENDWHILE Notice that with this definition of the loop, will not be executed at all if is initially larger than . The 68000 code needed to do this is trickier than anything we've done so far. I had a couple of tries at it, putting both the counter and the upper limit on the stack, both in registers, etc. I finally arrived at a hybrid arrangement, in which the loop counter is in memory (so that it can be accessed within the loop), and the upper limit is on the stack. The translated code came out like this: get name of loop counter get initial value LEA (PC),A0 address the loop counter SUBQ #1,D0 predecrement it MOVE D0,(A0) save it get upper limit MOVE D0,-(SP) save it on stack L1: LEA (PC),A0 address loop counter MOVE (A0),D0 fetch it to D0 ADDQ #1,D0 bump the counter MOVE D0,(A0) save new value CMP (SP),D0 check for range BLE L2 skip out if D0 > (SP) BRA L1 loop for next pass L2: ADDQ #2,SP clean up the stack Wow! That seems like a lot of code ... the line containing seems to almost get lost. But that's the best I could do with it. I guess it helps to keep in mind that it's really only sixteen words, after all. If anyone else can optimize this better, please let me know. Still, the parser routine is pretty easy now that we have the code: {--------------------------------------------------------------} { Parse and Translate a FOR Statement } procedure DoFor; var L1, L2: string; Name: char; begin Match('f'); L1 := NewLabel; L2 := NewLabel; Name := GetName; Match('='); Expression; EmitLn('SUBQ #1,D0'); EmitLn('LEA ' + Name + '(PC),A0'); EmitLn('MOVE D0,(A0)'); Expression; EmitLn('MOVE D0,-(SP)'); PostLabel(L1); EmitLn('LEA ' + Name + '(PC),A0'); EmitLn('MOVE (A0),D0'); EmitLn('ADDQ #1,D0'); EmitLn('MOVE D0,(A0)'); EmitLn('CMP (SP),D0'); EmitLn('BGT ' + L2); Block; Match('e'); EmitLn('BRA ' + L1); PostLabel(L2); EmitLn('ADDQ #2,SP'); end; {--------------------------------------------------------------} Since we don't have expressions in this parser, I used the same trick as for Condition, and wrote the routine {--------------------------------------------------------------} { Parse and Translate an Expression } { This version is a dummy } Procedure Expression; begin EmitLn(''); end; {--------------------------------------------------------------} Give it a try. Once again, don't forget to add the call in Block. Since we don't have any input for the dummy version of Expression, a typical input line would look something like afi=bece Well, it DOES generate a lot of code, doesn't it? But at least it's the RIGHT code. THE DO STATEMENT All this made me wish for a simpler version of the FOR loop. The reason for all the code above is the need to have the loop counter accessible as a variable within the loop. If all we need is a counting loop to make us go through something a specified number of times, but don't need access to the counter itself, there is a much easier solution. The 68000 has a "decrement and branch nonzero" instruction built in which is ideal for counting. For good measure, let's add this construct, too. This will be the last of our loop structures. The syntax and its translation is: DO { Emit(SUBQ #1,D0); L = NewLabel; PostLabel(L); Emit(MOVE D0,-(SP) } ENDDO { Emit(MOVE (SP)+,D0; Emit(DBRA D0,L) } That's quite a bit simpler! The loop will execute times. Here's the code: {--------------------------------------------------------------} { Parse and Translate a DO Statement } procedure Dodo; var L: string; begin Match('d'); L := NewLabel; Expression; EmitLn('SUBQ #1,D0'); PostLabel(L); EmitLn('MOVE D0,-(SP)'); Block; EmitLn('MOVE (SP)+,D0'); EmitLn('DBRA D0,' + L); end; {--------------------------------------------------------------} I think you'll have to agree, that's a whole lot simpler than the classical FOR. Still, each construct has its place. THE BREAK STATEMENT Earlier I promised you a BREAK statement to accompany LOOP. This is one I'm sort of proud of. On the face of it a BREAK seems really tricky. My first approach was to just use it as an extra terminator to Block, and split all the loops into two parts, just as I did with the ELSE half of an IF. That turns out not to work, though, because the BREAK statement is almost certainly not going to show up at the same level as the loop itself. The most likely place for a BREAK is right after an IF, which would cause it to exit to the IF construct, not the enclosing loop. WRONG. The BREAK has to exit the inner LOOP, even if it's nested down into several levels of IFs. My next thought was that I would just store away, in some global variable, the ending label of the innermost loop. That doesn't work either, because there may be a break from an inner loop followed by a break from an outer one. Storing the label for the inner loop would clobber the label for the outer one. So the global variable turned into a stack. Things were starting to get messy. Then I decided to take my own advice. Remember in the last session when I pointed out how well the implicit stack of a recursive descent parser was serving our needs? I said that if you begin to see the need for an external stack you might be doing something wrong. Well, I was. It is indeed possible to let the recursion built into our parser take care of everything, and the solution is so simple that it's surprising. The secret is to note that every BREAK statement has to occur within a block ... there's no place else for it to be. So all we have to do is to pass into Block the exit address of the innermost loop. Then it can pass the address to the routine that translates the break instruction. Since an IF statement doesn't change the loop level, procedure DoIf doesn't need to do anything except pass the label into ITS blocks (both of them). Since loops DO change the level, each loop construct simply ignores whatever label is above it and passes its own exit label along. All this is easier to show you than it is to describe. I'll demonstrate with the easiest loop, which is LOOP: {--------------------------------------------------------------} { Parse and Translate a LOOP Statement } procedure DoLoop; var L1, L2: string; begin Match('p'); L1 := NewLabel; L2 := NewLabel; PostLabel(L1); Block(L2); Match('e'); EmitLn('BRA ' + L1); PostLabel(L2); end; {--------------------------------------------------------------} Notice that DoLoop now has TWO labels, not just one. The second is to give the BREAK instruction a target to jump to. If there is no BREAK within the loop, we've wasted a label and cluttered up things a bit, but there's no harm done. Note also that Block now has a parameter, which for loops will always be the exit address. The new version of Block is: {--------------------------------------------------------------} { Recognize and Translate a Statement Block } procedure Block(L: string); begin while not(Look in ['e', 'l', 'u']) do begin case Look of 'i': DoIf(L); 'w': DoWhile; 'p': DoLoop; 'r': DoRepeat; 'f': DoFor; 'd': DoDo; 'b': DoBreak(L); else Other; end; end; end; {--------------------------------------------------------------} Again, notice that all Block does with the label is to pass it into DoIf and DoBreak. The loop constructs don't need it, because they are going to pass their own label anyway. The new version of DoIf is: {--------------------------------------------------------------} { Recognize and Translate an IF Construct } procedure Block(L: string); Forward; procedure DoIf(L: string); var L1, L2: string; begin Match('i'); Condition; L1 := NewLabel; L2 := L1; EmitLn('BEQ ' + L1); Block(L); if Look = 'l' then begin Match('l'); L2 := NewLabel; EmitLn('BRA ' + L2); PostLabel(L1); Block(L); end; Match('e'); PostLabel(L2); end; {--------------------------------------------------------------} Here, the only thing that changes is the addition of the parameter to procedure Block. An IF statement doesn't change the loop nesting level, so DoIf just passes the label along. No matter how many levels of IF nesting we have, the same label will be used. Now, remember that DoProgram also calls Block, so it now needs to pass it a label. An attempt to exit the outermost block is an error, so DoProgram passes a null label which is caught by DoBreak: {--------------------------------------------------------------} { Recognize and Translate a BREAK } procedure DoBreak(L: string); begin Match('b'); if L <> '' then EmitLn('BRA ' + L) else Abort('No loop to break from'); end; {--------------------------------------------------------------} { Parse and Translate a Program } procedure DoProgram; begin Block(''); if Look <> 'e' then Expected('End'); EmitLn('END') end; {--------------------------------------------------------------} That ALMOST takes care of everything. Give it a try, see if you can "break" it . Careful, though. By this time we've used so many letters, it's hard to think of characters that aren't now representing reserved words. Remember: before you try the program, you're going to have to edit every occurence of Block in the other loop constructs to include the new parameter. Do it just like I did for LOOP. I said ALMOST above. There is one slight problem: if you take a hard look at the code generated for DO, you'll see that if you break out of this loop, the value of the loop counter is still left on the stack. We're going to have to fix that! A shame ... that was one of our smaller routines, but it can't be helped. Here's a version that doesn't have the problem: {--------------------------------------------------------------} { Parse and Translate a DO Statement } procedure Dodo; var L1, L2: string; begin Match('d'); L1 := NewLabel; L2 := NewLabel; Expression; EmitLn('SUBQ #1,D0'); PostLabel(L1); EmitLn('MOVE D0,-(SP)'); Block(L2); EmitLn('MOVE (SP)+,D0'); EmitLn('DBRA D0,' + L1); EmitLn('SUBQ #2,SP'); PostLabel(L2); EmitLn('ADDQ #2,SP'); end; {--------------------------------------------------------------} The two extra instructions, the SUBQ and ADDQ, take care of leaving the stack in the right shape. CONCLUSION At this point we have created a number of control constructs ... a richer set, really, than that provided by almost any other pro- gramming language. And, except for the FOR loop, it was pretty easy to do. Even that one was tricky only because it's tricky in assembler language. I'll conclude this session here. To wrap the thing up with a red ribbon, we really should have a go at having real keywords instead of these mickey-mouse single-character things. You've already seen that the extension to multi-character words is not difficult, but in this case it will make a big difference in the appearance of our input code. I'll save that little bit for the next installment. In that installment we'll also address Boolean expressions, so we can get rid of the dummy version of Condition that we've used here. See you then. For reference purposes, here is the completed parser for this session: {--------------------------------------------------------------} program Branch; {--------------------------------------------------------------} { Constant Declarations } const TAB = ^I; CR = ^M; {--------------------------------------------------------------} { Variable Declarations } var Look : char; { Lookahead Character } Lcount: integer; { Label Counter } {--------------------------------------------------------------} { Read New Character From Input Stream } procedure GetChar; begin Read(Look); end; {--------------------------------------------------------------} { Report an Error } procedure Error(s: string); begin WriteLn; WriteLn(^G, 'Error: ', s, '.'); end; {--------------------------------------------------------------} { Report Error and Halt } procedure Abort(s: string); begin Error(s); Halt; end; {--------------------------------------------------------------} { Report What Was Expected } procedure Expected(s: string); begin Abort(s + ' Expected'); end; {--------------------------------------------------------------} { Match a Specific Input Character } procedure Match(x: char); begin if Look = x then GetChar else Expected('''' + x + ''''); end; {--------------------------------------------------------------} { Recognize an Alpha Character } function IsAlpha(c: char): boolean; begin IsAlpha := UpCase(c) in ['A'..'Z']; end; {--------------------------------------------------------------} { Recognize a Decimal Digit } function IsDigit(c: char): boolean; begin IsDigit := c in ['0'..'9']; end; {--------------------------------------------------------------} { Recognize an Addop } function IsAddop(c: char): boolean; begin IsAddop := c in ['+', '-']; end; {--------------------------------------------------------------} { Recognize White Space } function IsWhite(c: char): boolean; begin IsWhite := c in [' ', TAB]; end; {--------------------------------------------------------------} { Skip Over Leading White Space } procedure SkipWhite; begin while IsWhite(Look) do GetChar; end; {--------------------------------------------------------------} { Get an Identifier } function GetName: char; begin if not IsAlpha(Look) then Expected('Name'); GetName := UpCase(Look); GetChar; end; {--------------------------------------------------------------} { Get a Number } function GetNum: char; begin if not IsDigit(Look) then Expected('Integer'); GetNum := Look; GetChar; end; {--------------------------------------------------------------} { Generate a Unique Label } function NewLabel: string; var S: string; begin Str(LCount, S); NewLabel := 'L' + S; Inc(LCount); end; {--------------------------------------------------------------} { Post a Label To Output } procedure PostLabel(L: string); begin WriteLn(L, ':'); end; {--------------------------------------------------------------} { Output a String with Tab } procedure Emit(s: string); begin Write(TAB, s); end; {--------------------------------------------------------------} { Output a String with Tab and CRLF } procedure EmitLn(s: string); begin Emit(s); WriteLn; end; {--------------------------------------------------------------} { Parse and Translate a Boolean Condition } procedure Condition; begin EmitLn(''); end; {--------------------------------------------------------------} { Parse and Translate a Math Expression } procedure Expression; begin EmitLn(''); end; {--------------------------------------------------------------} { Recognize and Translate an IF Construct } procedure Block(L: string); Forward; procedure DoIf(L: string); var L1, L2: string; begin Match('i'); Condition; L1 := NewLabel; L2 := L1; EmitLn('BEQ ' + L1); Block(L); if Look = 'l' then begin Match('l'); L2 := NewLabel; EmitLn('BRA ' + L2); PostLabel(L1); Block(L); end; Match('e'); PostLabel(L2); end; {--------------------------------------------------------------} { Parse and Translate a WHILE Statement } procedure DoWhile; var L1, L2: string; begin Match('w'); L1 := NewLabel; L2 := NewLabel; PostLabel(L1); Condition; EmitLn('BEQ ' + L2); Block(L2); Match('e'); EmitLn('BRA ' + L1); PostLabel(L2); end; {--------------------------------------------------------------} { Parse and Translate a LOOP Statement } procedure DoLoop; var L1, L2: string; begin Match('p'); L1 := NewLabel; L2 := NewLabel; PostLabel(L1); Block(L2); Match('e'); EmitLn('BRA ' + L1); PostLabel(L2); end; {--------------------------------------------------------------} { Parse and Translate a REPEAT Statement } procedure DoRepeat; var L1, L2: string; begin Match('r'); L1 := NewLabel; L2 := NewLabel; PostLabel(L1); Block(L2); Match('u'); Condition; EmitLn('BEQ ' + L1); PostLabel(L2); end; {--------------------------------------------------------------} { Parse and Translate a FOR Statement } procedure DoFor; var L1, L2: string; Name: char; begin Match('f'); L1 := NewLabel; L2 := NewLabel; Name := GetName; Match('='); Expression; EmitLn('SUBQ #1,D0'); EmitLn('LEA ' + Name + '(PC),A0'); EmitLn('MOVE D0,(A0)'); Expression; EmitLn('MOVE D0,-(SP)'); PostLabel(L1); EmitLn('LEA ' + Name + '(PC),A0'); EmitLn('MOVE (A0),D0'); EmitLn('ADDQ #1,D0'); EmitLn('MOVE D0,(A0)'); EmitLn('CMP (SP),D0'); EmitLn('BGT ' + L2); Block(L2); Match('e'); EmitLn('BRA ' + L1); PostLabel(L2); EmitLn('ADDQ #2,SP'); end; {--------------------------------------------------------------} { Parse and Translate a DO Statement } procedure Dodo; var L1, L2: string; begin Match('d'); L1 := NewLabel; L2 := NewLabel; Expression; EmitLn('SUBQ #1,D0'); PostLabel(L1); EmitLn('MOVE D0,-(SP)'); Block(L2); EmitLn('MOVE (SP)+,D0'); EmitLn('DBRA D0,' + L1); EmitLn('SUBQ #2,SP'); PostLabel(L2); EmitLn('ADDQ #2,SP'); end; {--------------------------------------------------------------} { Recognize and Translate a BREAK } procedure DoBreak(L: string); begin Match('b'); EmitLn('BRA ' + L); end; {--------------------------------------------------------------} { Recognize and Translate an "Other" } procedure Other; begin EmitLn(GetName); end; {--------------------------------------------------------------} { Recognize and Translate a Statement Block } procedure Block(L: string); begin while not(Look in ['e', 'l', 'u']) do begin case Look of 'i': DoIf(L); 'w': DoWhile; 'p': DoLoop; 'r': DoRepeat; 'f': DoFor; 'd': DoDo; 'b': DoBreak(L); else Other; end; end; end; {--------------------------------------------------------------} { Parse and Translate a Program } procedure DoProgram; begin Block(''); if Look <> 'e' then Expected('End'); EmitLn('END') end; {--------------------------------------------------------------} { Initialize } procedure Init; begin LCount := 0; GetChar; end; {--------------------------------------------------------------} { Main Program } begin Init; DoProgram; end. {--------------------------------------------------------------} ***************************************************************** * * * COPYRIGHT NOTICE * * * * Copyright (C) 1988 Jack W. Crenshaw. All rights reserved. * * * ***************************************************************** LET'S BUILD A COMPILER! By Jack W. Crenshaw, Ph.D. 31 August 1988 Part VI: BOOLEAN EXPRESSIONS ***************************************************************** * * * COPYRIGHT NOTICE * * * * Copyright (C) 1988 Jack W. Crenshaw. All rights reserved. * * * ***************************************************************** INTRODUCTION In Part V of this series, we took a look at control constructs, and developed parsing routines to translate them into object code. We ended up with a nice, relatively rich set of constructs. As we left the parser, though, there was one big hole in our capabilities: we did not address the issue of the branch condition. To fill the void, I introduced to you a dummy parse routine called Condition, which only served as a place-keeper for the real thing. One of the things we'll do in this session is to plug that hole by expanding Condition into a true parser/translator. THE PLAN We're going to approach this installment a bit differently than any of the others. In those other installments, we started out immediately with experiments using the Pascal compiler, building up the parsers from very rudimentary beginnings to their final forms, without spending much time in planning beforehand. That's called coding without specs, and it's usually frowned upon. We could get away with it before because the rules of arithmetic are pretty well established ... we know what a '+' sign is supposed to mean without having to discuss it at length. The same is true for branches and loops. But the ways in which programming languages implement logic vary quite a bit from language to language. So before we begin serious coding, we'd better first make up our minds what it is we want. And the way to do that is at the level of the BNF syntax rules (the GRAMMAR). THE GRAMMAR For some time now, we've been implementing BNF syntax equations for arithmetic expressions, without ever actually writing them down all in one place. It's time that we did so. They are: ::= [ ]* ::= [ factor]* ::= | | ( ) (Remember, the nice thing about this grammar is that it enforces the operator precedence hierarchy that we normally expect for algebra.) Actually, while we're on the subject, I'd like to amend this grammar a bit right now. The way we've handled the unary minus is a bit awkward. I've found that it's better to write the grammar this way: ::= [ ]* ::= [ factor]* ::= [] ::= | | () This puts the job of handling the unary minus onto Factor, which is where it really belongs. This doesn't mean that you have to go back and recode the programs you've already written, although you're free to do so if you like. But I will be using the new syntax from now on. Now, it probably won't come as a shock to you to learn that we can define an analogous grammar for Boolean algebra. A typical set or rules is: ::= [ ]* ::= [AND ]* ::= [NOT] ::= | | () Notice that in this grammar, the operator AND is analogous to '*', and OR (and exclusive OR) to '+'. The NOT operator is analogous to a unary minus. This hierarchy is not absolutely standard ... some languages, notably Ada, treat all logical operators as having the same precedence level ... but it seems natural. Notice also the slight difference between the way the NOT and the unary minus are handled. In algebra, the unary minus is considered to go with the whole term, and so never appears but once in a given term. So an expression like a * -b or worse yet, a - -b is not allowed. In Boolean algebra, though, the expression a AND NOT b makes perfect sense, and the syntax shown allows for that. RELOPS OK, assuming that you're willing to accept the grammar I've shown here, we now have syntax rules for both arithmetic and Boolean algebra. The sticky part comes in when we have to combine the two. Why do we have to do that? Well, the whole subject came up because of the need to process the "predicates" (conditions) associated with control statements such as the IF. The predicate is required to have a Boolean value; that is, it must evaluate to either TRUE or FALSE. The branch is then taken or not taken, depending on that value. What we expect to see going on in procedure Condition, then, is the evaluation of a Boolean expression. But there's more to it than that. A pure Boolean expression can indeed be the predicate of a control statement ... things like IF a AND NOT b THEN .... But more often, we see Boolean algebra show up in such things as IF (x >= 0) and (x <= 100) THEN ... Here, the two terms in parens are Boolean expressions, but the individual terms being compared: x, 0, and 100, are NUMERIC in nature. The RELATIONAL OPERATORS >= and <= are the catalysts by which the Boolean and the arithmetic ingredients get merged together. Now, in the example above, the terms being compared are just that: terms. However, in general each side can be a math expression. So we can define a RELATION to be: ::= , where the expressions we're talking about here are the old numeric type, and the relops are any of the usual symbols =, <> (or !=), <, >, <=, and >= If you think about it a bit, you'll agree that, since this kind of predicate has a single Boolean value, TRUE or FALSE, as its result, it is really just another kind of factor. So we can expand the definition of a Boolean factor above to read: ::= | | () | THAT's the connection! The relops and the relation they define serve to wed the two kinds of algebra. It is worth noting that this implies a hierarchy where the arithmetic expression has a HIGHER precedence that a Boolean factor, and therefore than all the Boolean operators. If you write out the precedence levels for all the operators, you arrive at the following list: Level Syntax Element Operator 0 factor literal, variable 1 signed factor unary minus 2 term *, / 3 expression +, - 4 b-factor literal, variable, relop 5 not-factor NOT 6 b-term AND 7 b-expression OR, XOR If we're willing to accept that many precedence levels, this grammar seems reasonable. Unfortunately, it won't work! The grammar may be great in theory, but it's no good at all in the practice of a top-down parser. To see the problem, consider the code fragment: IF ((((((A + B + C) < 0 ) AND .... When the parser is parsing this code, it knows after it sees the IF token that a Boolean expression is supposed to be next. So it can set up to begin evaluating such an expression. But the first expression in the example is an ARITHMETIC expression, A + B + C. What's worse, at the point that the parser has read this much of the input line: IF ((((((A , it still has no way of knowing which kind of expression it's dealing with. That won't do, because we must have different recognizers for the two cases. The situation can be handled without changing any of our definitions, but only if we're willing to accept an arbitrary amount of backtracking to work our way out of bad guesses. No compiler writer in his right mind would agree to that. What's going on here is that the beauty and elegance of BNF grammar has met face to face with the realities of compiler technology. To deal with this situation, compiler writers have had to make compromises so that a single parser can handle the grammar without backtracking. FIXING THE GRAMMAR The problem that we've encountered comes up because our definitions of both arithmetic and Boolean factors permit the use of parenthesized expressions. Since the definitions are recursive, we can end up with any number of levels of parentheses, and the parser can't know which kind of expression it's dealing with. The solution is simple, although it ends up causing profound changes to our grammar. We can only allow parentheses in one kind of factor. The way to do that varies considerably from language to language. This is one place where there is NO agreement or convention to help us. When Niklaus Wirth designed Pascal, the desire was to limit the number of levels of precedence (fewer parse routines, after all). So the OR and exclusive OR operators are treated just like an Addop and processed at the level of a math expression. Similarly, the AND is treated like a Mulop and processed with Term. The precedence levels are Level Syntax Element Operator 0 factor literal, variable 1 signed factor unary minus, NOT 2 term *, /, AND 3 expression +, -, OR Notice that there is only ONE set of syntax rules, applying to both kinds of operators. According to this grammar, then, expressions like x + (y AND NOT z) DIV 3 are perfectly legal. And, in fact, they ARE ... as far as the parser is concerned. Pascal doesn't allow the mixing of arithmetic and Boolean variables, and things like this are caught at the SEMANTIC level, when it comes time to generate code for them, rather than at the syntax level. The authors of C took a diametrically opposite approach: they treat the operators as different, and have something much more akin to our seven levels of precedence. In fact, in C there are no fewer than 17 levels! That's because C also has the operators '=', '+=' and its kin, '<<', '>>', '++', '--', etc. Ironically, although in C the arithmetic and Boolean operators are treated separately, the variables are NOT ... there are no Boolean or logical variables in C, so a Boolean test can be made on any integer value. We'll do something that's sort of in-between. I'm tempted to stick mostly with the Pascal approach, since that seems the simplest from an implementation point of view, but it results in some funnies that I never liked very much, such as the fact that, in the expression IF (c >= 'A') and (c <= 'Z') then ... the parens above are REQUIRED. I never understood why before, and neither my compiler nor any human ever explained it very well, either. But now, we can all see that the 'and' operator, having the precedence of a multiply, has a higher one than the relational operators, so without the parens the expression is equivalent to IF c >= ('A' and c) <= 'Z' then which doesn't make sense. In any case, I've elected to separate the operators into different levels, although not as many as in C. ::= [ ]* ::= [AND ]* ::= [NOT] ::= | | ::= | [ ::= [ ]* ::= [ factor]* ::= [] ::= | | () This grammar results in the same set of seven levels that I showed earlier. Really, it's almost the same grammar ... I just removed the option of parenthesized b-expressions as a possible b-factor, and added the relation as a legal form of b-factor. There is one subtle but crucial difference, which is what makes the whole thing work. Notice the square brackets in the definition of a relation. This means that the relop and the second expression are OPTIONAL. A strange consequence of this grammar (and one shared by C) is that EVERY expression is potentially a Boolean expression. The parser will always be looking for a Boolean expression, but will "settle" for an arithmetic one. To be honest, that's going to slow down the parser, because it has to wade through more layers of procedure calls. That's one reason why Pascal compilers tend to compile faster than C compilers. If it's raw speed you want, stick with the Pascal syntax. THE PARSER Now that we've gotten through the decision-making process, we can press on with development of a parser. You've done this with me several times now, so you know the drill: we begin with a fresh copy of the cradle, and begin adding procedures one by one. So let's do it. We begin, as we did in the arithmetic case, by dealing only with Boolean literals rather than variables. This gives us a new kind of input token, so we're also going to need a new recognizer, and a new procedure to read instances of that token type. Let's start by defining the two new procedures: {--------------------------------------------------------------} { Recognize a Boolean Literal } function IsBoolean(c: char): Boolean; begin IsBoolean := UpCase(c) in ['T', 'F']; end; {--------------------------------------------------------------} { Get a Boolean Literal } function GetBoolean: Boolean; var c: char; begin if not IsBoolean(Look) then Expected('Boolean Literal'); GetBoolean := UpCase(Look) = 'T'; GetChar; end; {--------------------------------------------------------------} Type these routines into your program. You can test them by adding into the main program the print statement WriteLn(GetBoolean); OK, compile the program and test it. As usual, it's not very impressive so far, but it soon will be. Now, when we were dealing with numeric data we had to arrange to generate code to load the values into D0. We need to do the same for Boolean data. The usual way to encode Boolean variables is to let 0 stand for FALSE, and some other value for TRUE. Many languages, such as C, use an integer 1 to represent it. But I prefer FFFF hex (or -1), because a bitwise NOT also becomes a Boolean NOT. So now we need to emit the right assembler code to load those values. The first cut at the Boolean expression parser (BoolExpression, of course) is: {---------------------------------------------------------------} { Parse and Translate a Boolean Expression } procedure BoolExpression; begin if not IsBoolean(Look) then Expected('Boolean Literal'); if GetBoolean then EmitLn('MOVE #-1,D0') else EmitLn('CLR D0'); end; {---------------------------------------------------------------} Add this procedure to your parser, and call it from the main program (replacing the print statement you had just put there). As you can see, we still don't have much of a parser, but the output code is starting to look more realistic. Next, of course, we have to expand the definition of a Boolean expression. We already have the BNF rule: ::= [ ]* I prefer the Pascal versions of the "orops", OR and XOR. But since we are keeping to single-character tokens here, I'll encode those with '|' and '~'. The next version of BoolExpression is almost a direct copy of the arithmetic procedure Expression: {--------------------------------------------------------------} { Recognize and Translate a Boolean OR } procedure BoolOr; begin Match('|'); BoolTerm; EmitLn('OR (SP)+,D0'); end; {--------------------------------------------------------------} { Recognize and Translate an Exclusive Or } procedure BoolXor; begin Match('~'); BoolTerm; EmitLn('EOR (SP)+,D0'); end; {---------------------------------------------------------------} { Parse and Translate a Boolean Expression } procedure BoolExpression; begin BoolTerm; while IsOrOp(Look) do begin EmitLn('MOVE D0,-(SP)'); case Look of '|': BoolOr; '~': BoolXor; end; end; end; {---------------------------------------------------------------} Note the new recognizer IsOrOp, which is also a copy, this time of IsAddOp: {--------------------------------------------------------------} { Recognize a Boolean Orop } function IsOrop(c: char): Boolean; begin IsOrop := c in ['|', '~']; end; {--------------------------------------------------------------} OK, rename the old version of BoolExpression to BoolTerm, then enter the code above. Compile and test this version. At this point, the output code is starting to look pretty good. Of course, it doesn't make much sense to do a lot of Boolean algebra on constant values, but we'll soon be expanding the types of Booleans we deal with. You've probably already guessed what the next step is: The Boolean version of Term. Rename the current procedure BoolTerm to NotFactor, and enter the following new version of BoolTerm. Note that is is much simpler than the numeric version, since there is no equivalent of division. {---------------------------------------------------------------} { Parse and Translate a Boolean Term } procedure BoolTerm; begin NotFactor; while Look = '&' do begin EmitLn('MOVE D0,-(SP)'); Match('&'); NotFactor; EmitLn('AND (SP)+,D0'); end; end; {--------------------------------------------------------------} Now, we're almost home. We are translating complex Boolean expressions, although only for constant values. The next step is to allow for the NOT. Write the following procedure: {--------------------------------------------------------------} { Parse and Translate a Boolean Factor with NOT } procedure NotFactor; begin if Look = '!' then begin Match('!'); BoolFactor; EmitLn('EOR #-1,D0'); end else BoolFactor; end; {--------------------------------------------------------------} And rename the earlier procedure to BoolFactor. Now try that. At this point the parser should be able to handle any Boolean expression you care to throw at it. Does it? Does it trap badly formed expressions? If you've been following what we did in the parser for math expressions, you know that what we did next was to expand the definition of a factor to include variables and parens. We don't have to do that for the Boolean factor, because those little items get taken care of by the next step. It takes just a one line addition to BoolFactor to take care of relations: {--------------------------------------------------------------} { Parse and Translate a Boolean Factor } procedure BoolFactor; begin if IsBoolean(Look) then if GetBoolean then EmitLn('MOVE #-1,D0') else EmitLn('CLR D0') else Relation; end; {--------------------------------------------------------------} You might be wondering when I'm going to provide for Boolean variables and parenthesized Boolean expressions. The answer is, I'm NOT! Remember, we took those out of the grammar earlier. Right now all I'm doing is encoding the grammar we've already agreed upon. The compiler itself can't tell the difference between a Boolean variable or expression and an arithmetic one ... all of those will be handled by Relation, either way. Of course, it would help to have some code for Relation. I don't feel comfortable, though, adding any more code without first checking out what we already have. So for now let's just write a dummy version of Relation that does nothing except eat the current character, and write a little message: {---------------------------------------------------------------} { Parse and Translate a Relation } procedure Relation; begin WriteLn(''); GetChar; end; {--------------------------------------------------------------} OK, key in this code and give it a try. All the old things should still work ... you should be able to generate the code for ANDs, ORs, and NOTs. In addition, if you type any alphabetic character you should get a little place-holder, where a Boolean factor should be. Did you get that? Fine, then let's move on to the full-blown version of Relation. To get that, though, there is a bit of groundwork that we must lay first. Recall that a relation has the form ::= | [ ']; end; {--------------------------------------------------------------} Now, recall that we're using a zero or a -1 in register D0 to represent a Boolean value, and also that the loop constructs expect the flags to be set to correspond. In implementing all this on the 68000, things get a a little bit tricky. Since the loop constructs operate only on the flags, it would be nice (and also quite efficient) just to set up those flags, and not load anything into D0 at all. This would be fine for the loops and branches, but remember that the relation can be used ANYWHERE a Boolean factor could be used. We may be storing its result to a Boolean variable. Since we can't know at this point how the result is going to be used, we must allow for BOTH cases. Comparing numeric data is easy enough ... the 68000 has an operation for that ... but it sets the flags, not a value. What's more, the flags will always be set the same (zero if equal, etc.), while we need the zero flag set differently for the each of the different relops. The solution is found in the 68000 instruction Scc, which sets a byte value to 0000 or FFFF (funny how that works!) depending upon the result of the specified condition. If we make the destination byte to be D0, we get the Boolean value needed. Unfortunately, there's one final complication: unlike almost every other instruction in the 68000 set, Scc does NOT reset the condition flags to match the data being stored. So we have to do one last step, which is to test D0 and set the flags to match it. It must seem to be a trip around the moon to get what we want: we first perform the test, then test the flags to set data into D0, then test D0 to set the flags again. It is sort of roundabout, but it's the most straightforward way to get the flags right, and after all it's only a couple of instructions. I might mention here that this area is, in my opinion, the one that represents the biggest difference between the efficiency of hand-coded assembler language and compiler-generated code. We have seen already that we lose efficiency in arithmetic operations, although later I plan to show you how to improve that a bit. We've also seen that the control constructs themselves can be done quite efficiently ... it's usually very difficult to improve on the code generated for an IF or a WHILE. But virtually every compiler I've ever seen generates terrible code, compared to assembler, for the computation of a Boolean function, and particularly for relations. The reason is just what I've hinted at above. When I'm writing code in assembler, I go ahead and perform the test the most convenient way I can, and then set up the branch so that it goes the way it should. In effect, I "tailor" every branch to the situation. The compiler can't do that (practically), and it also can't know that we don't want to store the result of the test as a Boolean variable. So it must generate the code in a very strict order, and it often ends up loading the result as a Boolean that never gets used for anything. In any case, we're now ready to look at the code for Relation. It's shown below with its companion procedures: {---------------------------------------------------------------} { Recognize and Translate a Relational "Equals" } procedure Equals; begin Match('='); Expression; EmitLn('CMP (SP)+,D0'); EmitLn('SEQ D0'); end; {---------------------------------------------------------------} { Recognize and Translate a Relational "Not Equals" } procedure NotEquals; begin Match('#'); Expression; EmitLn('CMP (SP)+,D0'); EmitLn('SNE D0'); end; {---------------------------------------------------------------} { Recognize and Translate a Relational "Less Than" } procedure Less; begin Match('<'); Expression; EmitLn('CMP (SP)+,D0'); EmitLn('SGE D0'); end; {---------------------------------------------------------------} { Recognize and Translate a Relational "Greater Than" } procedure Greater; begin Match('>'); Expression; EmitLn('CMP (SP)+,D0'); EmitLn('SLE D0'); end; {---------------------------------------------------------------} { Parse and Translate a Relation } procedure Relation; begin Expression; if IsRelop(Look) then begin EmitLn('MOVE D0,-(SP)'); case Look of '=': Equals; '#': NotEquals; '<': Less; '>': Greater; end; EmitLn('TST D0'); end; end; {---------------------------------------------------------------} Now, that call to Expression looks familiar! Here is where the editor of your system comes in handy. We have already generated code for Expression and its buddies in previous sessions. You can copy them into your file now. Remember to use the single- character versions. Just to be certain, I've duplicated the arithmetic procedures below. If you're observant, you'll also see that I've changed them a little to make them correspond to the latest version of the syntax. This change is NOT necessary, so you may prefer to hold off on that until you're sure everything is working. {---------------------------------------------------------------} { Parse and Translate an Identifier } procedure Ident; var Name: char; begin Name:= GetName; if Look = '(' then begin Match('('); Match(')'); EmitLn('BSR ' + Name); end else EmitLn('MOVE ' + Name + '(PC),D0'); end; {---------------------------------------------------------------} { Parse and Translate a Math Factor } procedure Expression; Forward; procedure Factor; begin if Look = '(' then begin Match('('); Expression; Match(')'); end else if IsAlpha(Look) then Ident else EmitLn('MOVE #' + GetNum + ',D0'); end; {---------------------------------------------------------------} { Parse and Translate the First Math Factor } procedure SignedFactor; begin if Look = '+' then GetChar; if Look = '-' then begin GetChar; if IsDigit(Look) then EmitLn('MOVE #-' + GetNum + ',D0') else begin Factor; EmitLn('NEG D0'); end; end else Factor; end; {--------------------------------------------------------------} { Recognize and Translate a Multiply } procedure Multiply; begin Match('*'); Factor; EmitLn('MULS (SP)+,D0'); end; {-------------------------------------------------------------} { Recognize and Translate a Divide } procedure Divide; begin Match('/'); Factor; EmitLn('MOVE (SP)+,D1'); EmitLn('EXS.L D0'); EmitLn('DIVS D1,D0'); end; {---------------------------------------------------------------} { Parse and Translate a Math Term } procedure Term; begin SignedFactor; while Look in ['*', '/'] do begin EmitLn('MOVE D0,-(SP)'); case Look of '*': Multiply; '/': Divide; end; end; end; {---------------------------------------------------------------} { Recognize and Translate an Add } procedure Add; begin Match('+'); Term; EmitLn('ADD (SP)+,D0'); end; {---------------------------------------------------------------} { Recognize and Translate a Subtract } procedure Subtract; begin Match('-'); Term; EmitLn('SUB (SP)+,D0'); EmitLn('NEG D0'); end; {---------------------------------------------------------------} { Parse and Translate an Expression } procedure Expression; begin Term; while IsAddop(Look) do begin EmitLn('MOVE D0,-(SP)'); case Look of '+': Add; '-': Subtract; end; end; end; {---------------------------------------------------------------} There you have it ... a parser that can handle both arithmetic AND Boolean algebra, and things that combine the two through the use of relops. I suggest you file away a copy of this parser in a safe place for future reference, because in our next step we're going to be chopping it up. MERGING WITH CONTROL CONSTRUCTS At this point, let's go back to the file we had previously built that parses control constructs. Remember those little dummy procedures called Condition and Expression? Now you know what goes in their places! I warn you, you're going to have to do some creative editing here, so take your time and get it right. What you need to do is to copy all of the procedures from the logic parser, from Ident through BoolExpression, into the parser for control constructs. Insert them at the current location of Condition. Then delete that procedure, as well as the dummy Expression. Next, change every call to Condition to refer to BoolExpression instead. Finally, copy the procedures IsMulop, IsOrOp, IsRelop, IsBoolean, and GetBoolean into place. That should do it. Compile the resulting program and give it a try. Since we haven't used this program in awhile, don't forget that we used single-character tokens for IF, WHILE, etc. Also don't forget that any letter not a keyword just gets echoed as a block. Try ia=bxlye which stands for "IF a=b X ELSE Y ENDIF". What do you think? Did it work? Try some others. ADDING ASSIGNMENTS As long as we're this far, and we already have the routines for expressions in place, we might as well replace the "blocks" with real assignment statements. We've already done that before, so it won't be too hard. Before taking that step, though, we need to fix something else. We're soon going to find that the one-line "programs" that we're having to write here will really cramp our style. At the moment we have no cure for that, because our parser doesn't recognize the end-of-line characters, the carriage return (CR) and the line feed (LF). So before going any further let's plug that hole. There are a couple of ways to deal with the CR/LFs. One (the C/Unix approach) is just to treat them as additional white space characters and ignore them. That's actually not such a bad approach, but it does sort of produce funny results for our parser as it stands now. If it were reading its input from a source file as any self-respecting REAL compiler does, there would be no problem. But we're reading input from the keyboard, and we're sort of conditioned to expect something to happen when we hit the return key. It won't, if we just skip over the CR and LF (try it). So I'm going to use a different method here, which is NOT necessarily the best approach in the long run. Consider it a temporary kludge until we're further along. Instead of skipping the CR/LF, We'll let the parser go ahead and catch them, then introduce a special procedure, analogous to SkipWhite, that skips them only in specified "legal" spots. Here's the procedure: {--------------------------------------------------------------} { Skip a CRLF } procedure Fin; begin if Look = CR then GetChar; if Look = LF then GetChar; end; {--------------------------------------------------------------} Now, add two calls to Fin in procedure Block, like this: {--------------------------------------------------------------} { Recognize and Translate a Statement Block } procedure Block(L: string); begin while not(Look in ['e', 'l', 'u']) do begin Fin; case Look of 'i': DoIf(L); 'w': DoWhile; 'p': DoLoop; 'r': DoRepeat; 'f': DoFor; 'd': DoDo; 'b': DoBreak(L); else Other; end; Fin; end; end; {--------------------------------------------------------------} Now, you'll find that you can use multiple-line "programs." The only restriction is that you can't separate an IF or WHILE token from its predicate. Now we're ready to include the assignment statements. Simply change that call to Other in procedure Block to a call to Assignment, and add the following procedure, copied from one of our earlier programs. Note that Assignment now calls BoolExpression, so that we can assign Boolean variables. {--------------------------------------------------------------} { Parse and Translate an Assignment Statement } procedure Assignment; var Name: char; begin Name := GetName; Match('='); BoolExpression; EmitLn('LEA ' + Name + '(PC),A0'); EmitLn('MOVE D0,(A0)'); end; {--------------------------------------------------------------} With that change, you should now be able to write reasonably realistic-looking programs, subject only to our limitation on single-character tokens. My original intention was to get rid of that limitation for you, too. However, that's going to require a fairly major change to what we've done so far. We need a true lexical scanner, and that requires some structural changes. They are not BIG changes that require us to throw away all of what we've done so far ... with care, it can be done with very minimal changes, in fact. But it does require that care. This installment has already gotten pretty long, and it contains some pretty heavy stuff, so I've decided to leave that step until next time, when you've had a little more time to digest what we've done and are ready to start fresh. In the next installment, then, we'll build a lexical scanner and eliminate the single-character barrier once and for all. We'll also write our first complete compiler, based on what we've done in this session. See you then. ***************************************************************** * * * COPYRIGHT NOTICE * * * * Copyright (C) 1988 Jack W. Crenshaw. All rights reserved. * * * *****************************************************************  LET'S BUILD A COMPILER! By Jack W. Crenshaw, Ph.D. 7 November 1988 Part VII: LEXICAL SCANNING ***************************************************************** * * * COPYRIGHT NOTICE * * * * Copyright (C) 1988 Jack W. Crenshaw. All rights reserved. * * * ***************************************************************** INTRODUCTION In the last installment, I left you with a compiler that would ALMOST work, except that we were still limited to single- character tokens. The purpose of this session is to get rid of that restriction, once and for all. This means that we must deal with the concept of the lexical scanner. Maybe I should mention why we need a lexical scanner at all ... after all, we've been able to manage all right without one, up till now, even when we provided for multi-character tokens. The ONLY reason, really, has to do with keywords. It's a fact of computer life that the syntax for a keyword has the same form as that for any other identifier. We can't tell until we get the complete word whether or not it IS a keyword. For example, the variable IFILE and the keyword IF look just alike, until you get to the third character. In the examples to date, we were always able to make a decision based upon the first character of the token, but that's no longer possible when keywords are present. We need to know that a given string is a keyword BEFORE we begin to process it. And that's why we need a scanner. In the last session, I also promised that we would be able to provide for normal tokens without making wholesale changes to what we have already done. I didn't lie ... we can, as you will see later. But every time I set out to install these elements of the software into the parser we have already built, I had bad feelings about it. The whole thing felt entirely too much like a band-aid. I finally figured out what was causing the problem: I was installing lexical scanning software without first explaining to you what scanning is all about, and what the alternatives are. Up till now, I have studiously avoided giving you a lot of theory, and certainly not alternatives. I generally don't respond well to the textbooks that give you twenty-five different ways to do something, but no clue as to which way best fits your needs. I've tried to avoid that pitfall by just showing you ONE method, that WORKS. But this is an important area. While the lexical scanner is hardly the most exciting part of a compiler, it often has the most profound effect on the general "look & feel" of the language, since after all it's the part closest to the user. I have a particular structure in mind for the scanner to be used with KISS. It fits the look & feel that I want for that language. But it may not work at all for the language YOU'RE cooking up, so in this one case I feel that it's important for you to know your options. So I'm going to depart, again, from my usual format. In this session we'll be getting much deeper than usual into the basic theory of languages and grammars. I'll also be talking about areas OTHER than compilers in which lexical scanning plays an important role. Finally, I will show you some alternatives for the structure of the lexical scanner. Then, and only then, will we get back to our parser from the last installment. Bear with me ... I think you'll find it's worth the wait. In fact, since scanners have many applications outside of compilers, you may well find this to be the most useful session for you. LEXICAL SCANNING Lexical scanning is the process of scanning the stream of input characters and separating it into strings called tokens. Most compiler texts start here, and devote several chapters to discussing various ways to build scanners. This approach has its place, but as you have already seen, there is a lot you can do without ever even addressing the issue, and in fact the scanner we'll end up with here won't look much like what the texts describe. The reason? Compiler theory and, consequently, the programs resulting from it, must deal with the most general kind of parsing rules. We don't. In the real world, it is possible to specify the language syntax in such a way that a pretty simple scanner will suffice. And as always, KISS is our motto. Typically, lexical scanning is done in a separate part of the compiler, so that the parser per se sees only a stream of input tokens. Now, theoretically it is not necessary to separate this function from the rest of the parser. There is only one set of syntax equations that define the whole language, so in theory we could write the whole parser in one module. Why the separation? The answer has both practical and theoretical bases. In 1956, Noam Chomsky defined the "Chomsky Hierarchy" of grammars. They are: o Type 0: Unrestricted (e.g., English) o Type 1: Context-Sensitive o Type 2: Context-Free o Type 3: Regular A few features of the typical programming language (particularly the older ones, such as FORTRAN) are Type 1, but for the most part all modern languages can be described using only the last two types, and those are all we'll be dealing with here. The neat part about these two types is that there are very specific ways to parse them. It has been shown that any regular grammar can be parsed using a particular form of abstract machine called the state machine (finite automaton). We have already implemented state machines in some of our recognizers. Similarly, Type 2 (context-free) grammars can always be parsed using a push-down automaton (a state machine augmented by a stack). We have also implemented these machines. Instead of implementing a literal stack, we have relied on the built-in stack associated with recursive coding to do the job, and that in fact is the preferred approach for top-down parsing. Now, it happens that in real, practical grammars, the parts that qualify as regular expressions tend to be the lower-level parts, such as the definition of an identifier: ::= [ | ]* Since it takes a different kind of abstract machine to parse the two types of grammars, it makes sense to separate these lower- level functions into a separate module, the lexical scanner, which is built around the idea of a state machine. The idea is to use the simplest parsing technique needed for the job. There is another, more practical reason for separating scanner from parser. We like to think of the input source file as a stream of characters, which we process right to left without backtracking. In practice that isn't possible. Almost every language has certain keywords such as IF, WHILE, and END. As I mentioned earlier, we can't really know whether a given character string is a keyword, until we've reached the end of it, as defined by a space or other delimiter. So in that sense, we MUST save the string long enough to find out whether we have a keyword or not. That's a limited form of backtracking. So the structure of a conventional compiler involves splitting up the functions of the lower-level and higher-level parsing. The lexical scanner deals with things at the character level, collecting characters into strings, etc., and passing them along to the parser proper as indivisible tokens. It's also considered normal to let the scanner have the job of identifying keywords. STATE MACHINES AND ALTERNATIVES I mentioned that the regular expressions can be parsed using a state machine. In most compiler texts, and indeed in most compilers as well, you will find this taken literally. There is typically a real implementation of the state machine, with integers used to define the current state, and a table of actions to take for each combination of current state and input character. If you write a compiler front end using the popular Unix tools LEX and YACC, that's what you'll get. The output of LEX is a state machine implemented in C, plus a table of actions corresponding to the input grammar given to LEX. The YACC output is similar ... a canned table-driven parser, plus the table corresponding to the language syntax. That is not the only choice, though. In our previous installments, you have seen over and over that it is possible to implement parsers without dealing specifically with tables, stacks, or state variables. In fact, in Installment V I warned you that if you find yourself needing these things you might be doing something wrong, and not taking advantage of the power of Pascal. There are basically two ways to define a state machine's state: explicitly, with a state number or code, and implicitly, simply by virtue of the fact that I'm at a certain place in the code (if it's Tuesday, this must be Belgium). We've relied heavily on the implicit approaches before, and I think you'll find that they work well here, too. In practice, it may not even be necessary to HAVE a well-defined lexical scanner. This isn't our first experience at dealing with multi-character tokens. In Installment III, we extended our parser to provide for them, and we didn't even NEED a lexical scanner. That was because in that narrow context, we could always tell, just by looking at the single lookahead character, whether we were dealing with a number, a variable, or an operator. In effect, we built a distributed lexical scanner, using procedures GetName and GetNum. With keywords present, we can't know anymore what we're dealing with, until the entire token is read. This leads us to a more localized scanner; although, as you will see, the idea of a distributed scanner still has its merits. SOME EXPERIMENTS IN SCANNING Before getting back to our compiler, it will be useful to experiment a bit with the general concepts. Let's begin with the two definitions most often seen in real programming languages: ::= [ | ]* ]+ (Remember, the '*' indicates zero or more occurences of the terms in brackets, and the '+', one or more.) We have already dealt with similar items in Installment III. Let's begin (as usual) with a bare cradle. Not surprisingly, we are going to need a new recognizer: {--------------------------------------------------------------} { Recognize an Alphanumeric Character } function IsAlNum(c: char): boolean; begin IsAlNum := IsAlpha(c) or IsDigit(c); end; {--------------------------------------------------------------} Using this let's write the following two routines, which are very similar to those we've used before: {--------------------------------------------------------------} { Get an Identifier } function GetName: string; var x: string[8]; begin x := ''; if not IsAlpha(Look) then Expected('Name'); while IsAlNum(Look) do begin x := x + UpCase(Look); GetChar; end; GetName := x; end; {--------------------------------------------------------------} { Get a Number } function GetNum: string; var x: string[16]; begin x := ''; if not IsDigit(Look) then Expected('Integer'); while IsDigit(Look) do begin x := x + Look; GetChar; end; GetNum := x; end; {--------------------------------------------------------------} (Notice that this version of GetNum returns a string, not an integer as before.) You can easily verify that these routines work by calling them from the main program, as in WriteLn(GetName); This program will print any legal name typed in (maximum eight characters, since that's what we told GetName). It will reject anything else. Test the other routine similarly. WHITE SPACE We also have dealt with embedded white space before, using the two routines IsWhite and SkipWhite. Make sure that these routines are in your current version of the cradle, and add the the line SkipWhite; at the end of both GetName and GetNum. Now, let's define the new procedure: {--------------------------------------------------------------} { Lexical Scanner } Function Scan: string; begin if IsAlpha(Look) then Scan := GetName else if IsDigit(Look) then Scan := GetNum else begin Scan := Look; GetChar; end; SkipWhite; end; {--------------------------------------------------------------} We can call this from the new main program: {--------------------------------------------------------------} { Main Program } begin Init; repeat Token := Scan; writeln(Token); until Token = CR; end. {--------------------------------------------------------------} (You will have to add the declaration of the string Token at the beginning of the program. Make it any convenient length, say 16 characters.) Now, run the program. Note how the input string is, indeed, separated into distinct tokens. STATE MACHINES For the record, a parse routine like GetName does indeed implement a state machine. The state is implicit in the current position in the code. A very useful trick for visualizing what's going on is the syntax diagram, or "railroad-track" diagram. It's a little difficult to draw one in this medium, so I'll use them very sparingly, but the figure below should give you the idea: |-----> Other---------------------------> Error | Start -------> Letter ---------------> Other -----> Finish ^ V | | |<----- Letter <---------| | | |<----- Digit <---------- As you can see, this diagram shows how the logic flows as characters are read. Things begin, of course, in the start state, and end when a character other than an alphanumeric is found. If the first character is not alpha, an error occurs. Otherwise the machine will continue looping until the terminating delimiter is found. Note that at any point in the flow, our position is entirely dependent on the past history of the input characters. At that point, the action to be taken depends only on the current state, plus the current input character. That's what make this a state machine. Because of the difficulty of drawing railroad-track diagrams in this medium, I'll continue to stick to syntax equations from now on. But I highly recommend the diagrams to you for anything you do that involves parsing. After a little practice you can begin to see how to write a parser directly from the diagrams. Parallel paths get coded into guarded actions (guarded by IF's or CASE statements), serial paths into sequential calls. It's almost like working from a schematic. We didn't even discuss SkipWhite, which was introduced earlier, but it also is a simple state machine, as is GetNum. So is their parent procedure, Scan. Little machines make big machines. The neat thing that I'd like you to note is how painlessly this implicit approach creates these state machines. I personally prefer it a lot over the table-driven approach. It also results is a small, tight, and fast scanner. NEWLINES Moving right along, let's modify our scanner to handle more than one line. As I mentioned last time, the most straightforward way to do this is to simply treat the newline characters, carriage return and line feed, as white space. This is, in fact, the way the C standard library routine, iswhite, works. We didn't actually try this before. I'd like to do it now, so you can get a feel for the results. To do this, simply modify the single executable line of IsWhite to read: IsWhite := c in [' ', TAB, CR, LF]; We need to give the main program a new stop condition, since it will never see a CR. Let's just use: until Token = '.'; OK, compile this program and run it. Try a couple of lines, terminated by the period. I used: now is the time for all good men. Hey, what happened? When I tried it, I didn't get the last token, the period. The program didn't halt. What's more, when I pressed the 'enter' key a few times, I still didn't get the period. If you're still stuck in your program, you'll find that typing a period on a new line will terminate it. What's going on here? The answer is that we're hanging up in SkipWhite. A quick look at that routine will show that as long as we're typing null lines, we're going to just continue to loop. After SkipWhite encounters an LF, it tries to execute a GetChar. But since the input buffer is now empty, GetChar's read statement insists on having another line. Procedure Scan gets the terminating period, all right, but it calls SkipWhite to clean up, and SkipWhite won't return until it gets a non-null line. This kind of behavior is not quite as bad as it seems. In a real compiler, we'd be reading from an input file instead of the console, and as long as we have some procedure for dealing with end-of-files, everything will come out OK. But for reading data from the console, the behavior is just too bizarre. The fact of the matter is that the C/Unix convention is just not compatible with the structure of our parser, which calls for a lookahead character. The code that the Bell wizards have implemented doesn't use that convention, which is why they need 'ungetc'. OK, let's fix the problem. To do that, we need to go back to the old definition of IsWhite (delete the CR and LF characters) and make use of the procedure Fin that I introduced last time. If it's not in your current version of the cradle, put it there now. Also, modify the main program to read: {--------------------------------------------------------------} { Main Program } begin Init; repeat Token := Scan; writeln(Token); if Token = CR then Fin; until Token = '.'; end. {--------------------------------------------------------------} Note the "guard" test preceding the call to Fin. That's what makes the whole thing work, and ensures that we don't try to read a line ahead. Try the code now. I think you'll like it better. If you refer to the code we did in the last installment, you'll find that I quietly sprinkled calls to Fin throughout the code, wherever a line break was appropriate. This is one of those areas that really affects the look & feel that I mentioned. At this point I would urge you to experiment with different arrangements and see how you like them. If you want your language to be truly free-field, then newlines should be transparent. In this case, the best approach is to put the following lines at the BEGINNING of Scan: while Look = CR do Fin; If, on the other hand, you want a line-oriented language like Assembler, BASIC, or FORTRAN (or even Ada... note that it has comments terminated by newlines), then you'll need for Scan to return CR's as tokens. It must also eat the trailing LF. The best way to do that is to use this line, again at the beginning of Scan: if Look = LF then Fin; For other conventions, you'll have to use other arrangements. In my example of the last session, I allowed newlines only at specific places, so I was somewhere in the middle ground. In the rest of these sessions, I'll be picking ways to handle newlines that I happen to like, but I want you to know how to choose other ways for yourselves. OPERATORS We could stop now and have a pretty useful scanner for our purposes. In the fragments of KISS that we've built so far, the only tokens that have multiple characters are the identifiers and numbers. All operators were single characters. The only exception I can think of is the relops <=, >=, and <>, but they could be dealt with as special cases. Still, other languages have multi-character operators, such as the ':=' of Pascal or the '++' and '>>' of C. So while we may not need multi-character operators, it's nice to know how to get them if necessary. Needless to say, we can handle operators very much the same way as the other tokens. Let's start with a recognizer: {--------------------------------------------------------------} { Recognize Any Operator } function IsOp(c: char): boolean; begin IsOp := c in ['+', '-', '*', '/', '<', '>', ':', '=']; end; {--------------------------------------------------------------} It's important to note that we DON'T have to include every possible operator in this list. For example, the paretheses aren't included, nor is the terminating period. The current version of Scan handles single-character operators just fine as it is. The list above includes only those characters that can appear in multi-character operators. (For specific languages, of course, the list can always be edited.) Now, let's modify Scan to read: {--------------------------------------------------------------} { Lexical Scanner } Function Scan: string; begin while Look = CR do Fin; if IsAlpha(Look) then Scan := GetName else if IsDigit(Look) then Scan := GetNum else if IsOp(Look) then Scan := GetOp else begin Scan := Look; GetChar; end; SkipWhite; end; {--------------------------------------------------------------} Try the program now. You will find that any code fragments you care to throw at it will be neatly broken up into individual tokens. LISTS, COMMAS AND COMMAND LINES Before getting back to the main thrust of our study, I'd like to get on my soapbox for a moment. How many times have you worked with a program or operating system that had rigid rules about how you must separate items in a list? (Try, the last time you used MSDOS!) Some programs require spaces as delimiters, and some require commas. Worst of all, some require both, in different places. Most are pretty unforgiving about violations of their rules. I think this is inexcusable. It's too easy to write a parser that will handle both spaces and commas in a flexible way. Consider the following procedure: {--------------------------------------------------------------} { Skip Over a Comma } procedure SkipComma; begin SkipWhite; if Look = ',' then begin GetChar; SkipWhite; end; end; {--------------------------------------------------------------} This eight-line procedure will skip over a delimiter consisting of any number (including zero) of spaces, with zero or one comma embedded in the string. TEMPORARILY, change the call to SkipWhite in Scan to a call to SkipComma, and try inputting some lists. Works nicely, eh? Don't you wish more software authors knew about SkipComma? For the record, I found that adding the equivalent of SkipComma to my Z80 assembler-language programs took all of 6 (six) extra bytes of code. Even in a 64K machine, that's not a very high price to pay for user-friendliness! I think you can see where I'm going here. Even if you never write a line of a compiler code in your life, there are places in every program where you can use the concepts of parsing. Any program that processes a command line needs them. In fact, if you think about it for a bit, you'll have to conclude that any time you write a program that processes user inputs, you're defining a language. People communicate with languages, and the syntax implicit in your program defines that language. The real question is: are you going to define it deliberately and explicitly, or just let it turn out to be whatever the program ends up parsing? I claim that you'll have a better, more user-friendly program if you'll take the time to define the syntax explicitly. Write down the syntax equations or draw the railroad-track diagrams, and code the parser using the techniques I've shown you here. You'll end up with a better program, and it will be easier to write, to boot. GETTING FANCY OK, at this point we have a pretty nice lexical scanner that will break an input stream up into tokens. We could use it as it stands and have a servicable compiler. But there are some other aspects of lexical scanning that we need to cover. The main consideration is efficiency. Remember when we were dealing with single-character tokens, every test was a comparison of a single character, Look, with a byte constant. We also used the Case statement heavily. With the multi-character tokens being returned by Scan, all those tests now become string comparisons. Much slower. And not only slower, but more awkward, since there is no string equivalent of the Case statement in Pascal. It seems especially wasteful to test for what used to be single characters ... the '=', '+', and other operators ... using string comparisons. Using string comparison is not impossible ... Ron Cain used just that approach in writing Small C. Since we're sticking to the KISS principle here, we would be truly justified in settling for this approach. But then I would have failed to tell you about one of the key approaches used in "real" compilers. You have to remember: the lexical scanner is going to be called a _LOT_! Once for every token in the whole source program, in fact. Experiments have indicated that the average compiler spends anywhere from 20% to 40% of its time in the scanner routines. If there were ever a place where efficiency deserves real consideration, this is it. For this reason, most compiler writers ask the lexical scanner to do a little more work, by "tokenizing" the input stream. The idea is to match every token against a list of acceptable keywords and operators, and return unique codes for each one recognized. In the case of ordinary variable names or numbers, we just return a code that says what kind of token they are, and save the actual string somewhere else. One of the first things we're going to need is a way to identify keywords. We can always do it with successive IF tests, but it surely would be nice if we had a general-purpose routine that could compare a given string with a table of keywords. (By the way, we're also going to need such a routine later, for dealing with symbol tables.) This usually presents a problem in Pascal, because standard Pascal doesn't allow for arrays of variable lengths. It's a real bother to have to declare a different search routine for every table. Standard Pascal also doesn't allow for initializing arrays, so you tend to see code like Table[1] := 'IF'; Table[2] := 'ELSE'; . . Table[n] := 'END'; which can get pretty old if there are many keywords. Fortunately, Turbo Pascal 4.0 has extensions that eliminate both of these problems. Constant arrays can be declared using TP's "typed constant" facility, and the variable dimensions can be handled with its C-like extensions for pointers. First, modify your declarations like this: {--------------------------------------------------------------} { Type Declarations } type Symbol = string[8]; SymTab = array[1..1000] of Symbol; TabPtr = ^SymTab; {--------------------------------------------------------------} (The dimension used in SymTab is not real ... no storage is allocated by the declaration itself, and the number need only be "big enough.") Now, just beneath those declarations, add the following: {--------------------------------------------------------------} { Definition of Keywords and Token Types } const KWlist: array [1..4] of Symbol = ('IF', 'ELSE', 'ENDIF', 'END'); {--------------------------------------------------------------} Next, insert the following new function: {--------------------------------------------------------------} { Table Lookup } { If the input string matches a table entry, return the entry index. If not, return a zero. } function Lookup(T: TabPtr; s: string; n: integer): integer; var i: integer; found: boolean; begin found := false; i := n; while (i > 0) and not found do if s = T^[i] then found := true else dec(i); Lookup := i; end; {--------------------------------------------------------------} To test it, you can temporarily change the main program as follows: {--------------------------------------------------------------} { Main Program } begin ReadLn(Token); WriteLn(Lookup(Addr(KWList), Token, 4)); end. {--------------------------------------------------------------} Notice how Lookup is called: The Addr function sets up a pointer to KWList, which gets passed to Lookup. OK, give this a try. Since we're bypassing Scan here, you'll have to type the keywords in upper case to get any matches. Now that we can recognize keywords, the next thing is to arrange to return codes for them. So what kind of code should we return? There are really only two reasonable choices. This seems like an ideal application for the Pascal enumerated type. For example, you can define something like SymType = (IfSym, ElseSym, EndifSym, EndSym, Ident, Number, Operator); and arrange to return a variable of this type. Let's give it a try. Insert the line above into your type definitions. Now, add the two variable declarations: Token: Symtype; { Current Token } Value: String[16]; { String Token of Look } Modify the scanner to read: {--------------------------------------------------------------} { Lexical Scanner } procedure Scan; var k: integer; begin while Look = CR do Fin; if IsAlpha(Look) then begin Value := GetName; k := Lookup(Addr(KWlist), Value, 4); if k = 0 then Token := Ident else Token := SymType(k - 1); end else if IsDigit(Look) then begin Value := GetNum; Token := Number; end else if IsOp(Look) then begin Value := GetOp; Token := Operator; end else begin Value := Look; Token := Operator; GetChar; end; SkipWhite; end; {--------------------------------------------------------------} (Notice that Scan is now a procedure, not a function.) Finally, modify the main program to read: {--------------------------------------------------------------} { Main Program } begin Init; repeat Scan; case Token of Ident: write('Ident '); Number: Write('Number '); Operator: Write('Operator '); IfSym, ElseSym, EndifSym, EndSym: Write('Keyword '); end; Writeln(Value); until Token = EndSym; end. {--------------------------------------------------------------} What we've done here is to replace the string Token used earlier with an enumerated type. Scan returns the type in variable Token, and returns the string itself in the new variable Value. OK, compile this and give it a whirl. If everything goes right, you should see that we are now recognizing keywords. What we have now is working right, and it was easy to generate from what we had earlier. However, it still seems a little "busy" to me. We can simplify things a bit by letting GetName, GetNum, GetOp, and Scan be procedures working with the global variables Token and Value, thereby eliminating the local copies. It also seems a little cleaner to move the table lookup into GetName. The new form for the four procedures is, then: {--------------------------------------------------------------} { Get an Identifier } procedure GetName; var k: integer; begin Value := ''; if not IsAlpha(Look) then Expected('Name'); while IsAlNum(Look) do begin Value := Value + UpCase(Look); GetChar; end; k := Lookup(Addr(KWlist), Value, 4); if k = 0 then Token := Ident else Token := SymType(k-1); end; {--------------------------------------------------------------} { Get a Number } procedure GetNum; begin Value := ''; if not IsDigit(Look) then Expected('Integer'); while IsDigit(Look) do begin Value := Value + Look; GetChar; end; Token := Number; end; {--------------------------------------------------------------} { Get an Operator } procedure GetOp; begin Value := ''; if not IsOp(Look) then Expected('Operator'); while IsOp(Look) do begin Value := Value + Look; GetChar; end; Token := Operator; end; {--------------------------------------------------------------} { Lexical Scanner } procedure Scan; var k: integer; begin while Look = CR do Fin; if IsAlpha(Look) then GetName else if IsDigit(Look) then GetNum else if IsOp(Look) then GetOp else begin Value := Look; Token := Operator; GetChar; end; SkipWhite; end; {--------------------------------------------------------------} RETURNING A CHARACTER Essentially every scanner I've ever seen that was written in Pascal used the mechanism of an enumerated type that I've just described. It is certainly a workable mechanism, but it doesn't seem the simplest approach to me. For one thing, the list of possible symbol types can get pretty long. Here, I've used just one symbol, "Operator," to stand for all of the operators, but I've seen other designs that actually return different codes for each one. There is, of course, another simple type that can be returned as a code: the character. Instead of returning the enumeration value 'Operator' for a '+' sign, what's wrong with just returning the character itself? A character is just as good a variable for encoding the different token types, it can be used in case statements easily, and it's sure a lot easier to type. What could be simpler? Besides, we've already had experience with the idea of encoding keywords as single characters. Our previous programs are already written that way, so using this approach will minimize the changes to what we've already done. Some of you may feel that this idea of returning character codes is too mickey-mouse. I must admit it gets a little awkward for multi-character operators like '<='. If you choose to stay with the enumerated type, fine. For the rest, I'd like to show you how to change what we've done above to support that approach. First, you can delete the SymType declaration now ... we won't be needing that. And you can change the type of Token to char. Next, to replace SymType, add the following constant string: const KWcode: string[5] = 'xilee'; (I'll be encoding all idents with the single character 'x'.) Lastly, modify Scan and its relatives as follows: {--------------------------------------------------------------} { Get an Identifier } procedure GetName; begin Value := ''; if not IsAlpha(Look) then Expected('Name'); while IsAlNum(Look) do begin Value := Value + UpCase(Look); GetChar; end; Token := KWcode[Lookup(Addr(KWlist), Value, 4) + 1]; end; {--------------------------------------------------------------} { Get a Number } procedure GetNum; begin Value := ''; if not IsDigit(Look) then Expected('Integer'); while IsDigit(Look) do begin Value := Value + Look; GetChar; end; Token := '#'; end; {--------------------------------------------------------------} { Get an Operator } procedure GetOp; begin Value := ''; if not IsOp(Look) then Expected('Operator'); while IsOp(Look) do begin Value := Value + Look; GetChar; end; if Length(Value) = 1 then Token := Value[1] else Token := '?'; end; {--------------------------------------------------------------} { Lexical Scanner } procedure Scan; var k: integer; begin while Look = CR do Fin; if IsAlpha(Look) then GetName else if IsDigit(Look) then GetNum else if IsOp(Look) then begin GetOp else begin Value := Look; Token := '?'; GetChar; end; SkipWhite; end; {--------------------------------------------------------------} { Main Program } begin Init; repeat Scan; case Token of 'x': write('Ident '); '#': Write('Number '); 'i', 'l', 'e': Write('Keyword '); else Write('Operator '); end; Writeln(Value); until Value = 'END'; end. {--------------------------------------------------------------} This program should work the same as the previous version. A minor difference in structure, maybe, but it seems more straightforward to me. DISTRIBUTED vs CENTRALIZED SCANNERS The structure for the lexical scanner that I've just shown you is very conventional, and about 99% of all compilers use something very close to it. This is not, however, the only possible structure, or even always the best one. The problem with the conventional approach is that the scanner has no knowledge of context. For example, it can't distinguish between the assignment operator '=' and the relational operator '=' (perhaps that's why both C and Pascal use different strings for the two). All the scanner can do is to pass the operator along to the parser, which can hopefully tell from the context which operator is meant. Similarly, a keyword like 'IF' has no place in the middle of a math expression, but if one happens to appear there, the scanner will see no problem with it, and will return it to the parser, properly encoded as an 'IF'. With this kind of approach, we are not really using all the information at our disposal. In the middle of an expression, for example, the parser "knows" that there is no need to look for keywords, but it has no way of telling the scanner that. So the scanner continues to do so. This, of course, slows down the compilation. In real-world compilers, the designers often arrange for more information to be passed between parser and scanner, just to avoid this kind of problem. But that can get awkward, and certainly destroys a lot of the modularity of the structure. The alternative is to seek some way to use the contextual information that comes from knowing where we are in the parser. This leads us back to the notion of a distributed scanner, in which various portions of the scanner are called depending upon the context. In KISS, as in most languages, keywords ONLY appear at the beginning of a statement. In places like expressions, they are not allowed. Also, with one minor exception (the multi-character relops) that is easily handled, all operators are single characters, which means that we don't need GetOp at all. So it turns out that even with multi-character tokens, we can still always tell from the current lookahead character exactly what kind of token is coming, except at the very beginning of a statement. Even at that point, the ONLY kind of token we can accept is an identifier. We need only to determine if that identifier is a keyword or the target of an assignment statement. We end up, then, still needing only GetName and GetNum, which are used very much as we've used them in earlier installments. It may seem at first to you that this is a step backwards, and a rather primitive approach. In fact, it is an improvement over the classical scanner, since we're using the scanning routines only where they're really needed. In places where keywords are not allowed, we don't slow things down by looking for them. MERGING SCANNER AND PARSER Now that we've covered all of the theory and general aspects of lexical scanning that we'll be needing, I'm FINALLY ready to back up my claim that we can accomodate multi-character tokens with minimal change to our previous work. To keep things short and simple I will restrict myself here to a subset of what we've done before; I'm allowing only one control construct (the IF) and no Boolean expressions. That's enough to demonstrate the parsing of both keywords and expressions. The extension to the full set of constructs should be pretty apparent from what we've already done. All the elements of the program to parse this subset, using single-character tokens, exist already in our previous programs. I built it by judicious copying of these files, but I wouldn't dare try to lead you through that process. Instead, to avoid any confusion, the whole program is shown below: {--------------------------------------------------------------} program KISS; {--------------------------------------------------------------} { Constant Declarations } const TAB = ^I; CR = ^M; LF = ^J; {--------------------------------------------------------------} { Type Declarations } type Symbol = string[8]; SymTab = array[1..1000] of Symbol; TabPtr = ^SymTab; {--------------------------------------------------------------} { Variable Declarations } var Look : char; { Lookahead Character } Lcount: integer; { Label Counter } {--------------------------------------------------------------} { Read New Character From Input Stream } procedure GetChar; begin Read(Look); end; {--------------------------------------------------------------} { Report an Error } procedure Error(s: string); begin WriteLn; WriteLn(^G, 'Error: ', s, '.'); end; {--------------------------------------------------------------} { Report Error and Halt } procedure Abort(s: string); begin Error(s); Halt; end; {--------------------------------------------------------------} { Report What Was Expected } procedure Expected(s: string); begin Abort(s + ' Expected'); end; {--------------------------------------------------------------} { Recognize an Alpha Character } function IsAlpha(c: char): boolean; begin IsAlpha := UpCase(c) in ['A'..'Z']; end; {--------------------------------------------------------------} { Recognize a Decimal Digit } function IsDigit(c: char): boolean; begin IsDigit := c in ['0'..'9']; end; {--------------------------------------------------------------} { Recognize an AlphaNumeric Character } function IsAlNum(c: char): boolean; begin IsAlNum := IsAlpha(c) or IsDigit(c); end; {--------------------------------------------------------------} { Recognize an Addop } function IsAddop(c: char): boolean; begin IsAddop := c in ['+', '-']; end; {--------------------------------------------------------------} { Recognize a Mulop } function IsMulop(c: char): boolean; begin IsMulop := c in ['*', '/']; end; {--------------------------------------------------------------} { Recognize White Space } function IsWhite(c: char): boolean; begin IsWhite := c in [' ', TAB]; end; {--------------------------------------------------------------} { Skip Over Leading White Space } procedure SkipWhite; begin while IsWhite(Look) do GetChar; end; {--------------------------------------------------------------} { Match a Specific Input Character } procedure Match(x: char); begin if Look <> x then Expected('''' + x + ''''); GetChar; SkipWhite; end; {--------------------------------------------------------------} { Skip a CRLF } procedure Fin; begin if Look = CR then GetChar; if Look = LF then GetChar; SkipWhite; end; {--------------------------------------------------------------} { Get an Identifier } function GetName: char; begin while Look = CR do Fin; if not IsAlpha(Look) then Expected('Name'); Getname := UpCase(Look); GetChar; SkipWhite; end; {--------------------------------------------------------------} { Get a Number } function GetNum: char; begin if not IsDigit(Look) then Expected('Integer'); GetNum := Look; GetChar; SkipWhite; end; {--------------------------------------------------------------} { Generate a Unique Label } function NewLabel: string; var S: string; begin Str(LCount, S); NewLabel := 'L' + S; Inc(LCount); end; {--------------------------------------------------------------} { Post a Label To Output } procedure PostLabel(L: string); begin WriteLn(L, ':'); end; {--------------------------------------------------------------} { Output a String with Tab } procedure Emit(s: string); begin Write(TAB, s); end; {--------------------------------------------------------------} { Output a String with Tab and CRLF } procedure EmitLn(s: string); begin Emit(s); WriteLn; end; {---------------------------------------------------------------} { Parse and Translate an Identifier } procedure Ident; var Name: char; begin Name := GetName; if Look = '(' then begin Match('('); Match(')'); EmitLn('BSR ' + Name); end else EmitLn('MOVE ' + Name + '(PC),D0'); end; {---------------------------------------------------------------} { Parse and Translate a Math Factor } procedure Expression; Forward; procedure Factor; begin if Look = '(' then begin Match('('); Expression; Match(')'); end else if IsAlpha(Look) then Ident else EmitLn('MOVE #' + GetNum + ',D0'); end; {---------------------------------------------------------------} { Parse and Translate the First Math Factor } procedure SignedFactor; var s: boolean; begin s := Look = '-'; if IsAddop(Look) then begin GetChar; SkipWhite; end; Factor; if s then EmitLn('NEG D0'); end; {--------------------------------------------------------------} { Recognize and Translate a Multiply } procedure Multiply; begin Match('*'); Factor; EmitLn('MULS (SP)+,D0'); end; {-------------------------------------------------------------} { Recognize and Translate a Divide } procedure Divide; begin Match('/'); Factor; EmitLn('MOVE (SP)+,D1'); EmitLn('EXS.L D0'); EmitLn('DIVS D1,D0'); end; {---------------------------------------------------------------} { Completion of Term Processing (called by Term and FirstTerm } procedure Term1; begin while IsMulop(Look) do begin EmitLn('MOVE D0,-(SP)'); case Look of '*': Multiply; '/': Divide; end; end; end; {---------------------------------------------------------------} { Parse and Translate a Math Term } procedure Term; begin Factor; Term1; end; {---------------------------------------------------------------} { Parse and Translate a Math Term with Possible Leading Sign } procedure FirstTerm; begin SignedFactor; Term1; end; {---------------------------------------------------------------} { Recognize and Translate an Add } procedure Add; begin Match('+'); Term; EmitLn('ADD (SP)+,D0'); end; {---------------------------------------------------------------} { Recognize and Translate a Subtract } procedure Subtract; begin Match('-'); Term; EmitLn('SUB (SP)+,D0'); EmitLn('NEG D0'); end; {---------------------------------------------------------------} { Parse and Translate an Expression } procedure Expression; begin FirstTerm; while IsAddop(Look) do begin EmitLn('MOVE D0,-(SP)'); case Look of '+': Add; '-': Subtract; end; end; end; {---------------------------------------------------------------} { Parse and Translate a Boolean Condition } { This version is a dummy } Procedure Condition; begin EmitLn('Condition'); end; {---------------------------------------------------------------} { Recognize and Translate an IF Construct } procedure Block; Forward; procedure DoIf; var L1, L2: string; begin Match('i'); Condition; L1 := NewLabel; L2 := L1; EmitLn('BEQ ' + L1); Block; if Look = 'l' then begin Match('l'); L2 := NewLabel; EmitLn('BRA ' + L2); PostLabel(L1); Block; end; PostLabel(L2); Match('e'); end; {--------------------------------------------------------------} { Parse and Translate an Assignment Statement } procedure Assignment; var Name: char; begin Name := GetName; Match('='); Expression; EmitLn('LEA ' + Name + '(PC),A0'); EmitLn('MOVE D0,(A0)'); end; {--------------------------------------------------------------} { Recognize and Translate a Statement Block } procedure Block; begin while not(Look in ['e', 'l']) do begin case Look of 'i': DoIf; CR: while Look = CR do Fin; else Assignment; end; end; end; {--------------------------------------------------------------} { Parse and Translate a Program } procedure DoProgram; begin Block; if Look <> 'e' then Expected('END'); EmitLn('END') end; {--------------------------------------------------------------} { Initialize } procedure Init; begin LCount := 0; GetChar; end; {--------------------------------------------------------------} { Main Program } begin Init; DoProgram; end. {--------------------------------------------------------------} A couple of comments: (1) The form for the expression parser, using FirstTerm, etc., is a little different from what you've seen before. It's yet another variation on the same theme. Don't let it throw you ... the change is not required for what follows. (2) Note that, as usual, I had to add calls to Fin at strategic spots to allow for multiple lines. Before we proceed to adding the scanner, first copy this file and verify that it does indeed parse things correctly. Don't forget the "codes": 'i' for IF, 'l' for ELSE, and 'e' for END or ENDIF. If the program works, then let's press on. In adding the scanner modules to the program, it helps to have a systematic plan. In all the parsers we've written to date, we've stuck to a convention that the current lookahead character should always be a non-blank character. We preload the lookahead character in Init, and keep the "pump primed" after that. To keep the thing working right at newlines, we had to modify this a bit and treat the newline as a legal token. In the multi-character version, the rule is similar: The current lookahead character should always be left at the BEGINNING of the next token, or at a newline. The multi-character version is shown next. To get it, I've made the following changes: o Added the variables Token and Value, and the type definitions needed by Lookup. o Added the definitions of KWList and KWcode. o Added Lookup. o Replaced GetName and GetNum by their multi-character versions. (Note that the call to Lookup has been moved out of GetName, so that it will not be executed for calls within an expression.) o Created a new, vestigial Scan that calls GetName, then scans for keywords. o Created a new procedure, MatchString, that looks for a specific keyword. Note that, unlike Match, MatchString does NOT read the next keyword. o Modified Block to call Scan. o Changed the calls to Fin a bit. Fin is now called within GetName. Here is the program in its entirety: {--------------------------------------------------------------} program KISS; {--------------------------------------------------------------} { Constant Declarations } const TAB = ^I; CR = ^M; LF = ^J; {--------------------------------------------------------------} { Type Declarations } type Symbol = string[8]; SymTab = array[1..1000] of Symbol; TabPtr = ^SymTab; {--------------------------------------------------------------} { Variable Declarations } var Look : char; { Lookahead Character } Token : char; { Encoded Token } Value : string[16]; { Unencoded Token } Lcount: integer; { Label Counter } {--------------------------------------------------------------} { Definition of Keywords and Token Types } const KWlist: array [1..4] of Symbol = ('IF', 'ELSE', 'ENDIF', 'END'); const KWcode: string[5] = 'xilee'; {--------------------------------------------------------------} { Read New Character From Input Stream } procedure GetChar; begin Read(Look); end; {--------------------------------------------------------------} { Report an Error } procedure Error(s: string); begin WriteLn; WriteLn(^G, 'Error: ', s, '.'); end; {--------------------------------------------------------------} { Report Error and Halt } procedure Abort(s: string); begin Error(s); Halt; end; {--------------------------------------------------------------} { Report What Was Expected } procedure Expected(s: string); begin Abort(s + ' Expected'); end; {--------------------------------------------------------------} { Recognize an Alpha Character } function IsAlpha(c: char): boolean; begin IsAlpha := UpCase(c) in ['A'..'Z']; end; {--------------------------------------------------------------} { Recognize a Decimal Digit } function IsDigit(c: char): boolean; begin IsDigit := c in ['0'..'9']; end; {--------------------------------------------------------------} { Recognize an AlphaNumeric Character } function IsAlNum(c: char): boolean; begin IsAlNum := IsAlpha(c) or IsDigit(c); end; {--------------------------------------------------------------} { Recognize an Addop } function IsAddop(c: char): boolean; begin IsAddop := c in ['+', '-']; end; {--------------------------------------------------------------} { Recognize a Mulop } function IsMulop(c: char): boolean; begin IsMulop := c in ['*', '/']; end; {--------------------------------------------------------------} { Recognize White Space } function IsWhite(c: char): boolean; begin IsWhite := c in [' ', TAB]; end; {--------------------------------------------------------------} { Skip Over Leading White Space } procedure SkipWhite; begin while IsWhite(Look) do GetChar; end; {--------------------------------------------------------------} { Match a Specific Input Character } procedure Match(x: char); begin if Look <> x then Expected('''' + x + ''''); GetChar; SkipWhite; end; {--------------------------------------------------------------} { Skip a CRLF } procedure Fin; begin if Look = CR then GetChar; if Look = LF then GetChar; SkipWhite; end; {--------------------------------------------------------------} { Table Lookup } function Lookup(T: TabPtr; s: string; n: integer): integer; var i: integer; found: boolean; begin found := false; i := n; while (i > 0) and not found do if s = T^[i] then found := true else dec(i); Lookup := i; end; {--------------------------------------------------------------} { Get an Identifier } procedure GetName; begin while Look = CR do Fin; if not IsAlpha(Look) then Expected('Name'); Value := ''; while IsAlNum(Look) do begin Value := Value + UpCase(Look); GetChar; end; SkipWhite; end; {--------------------------------------------------------------} { Get a Number } procedure GetNum; begin if not IsDigit(Look) then Expected('Integer'); Value := ''; while IsDigit(Look) do begin Value := Value + Look; GetChar; end; Token := '#'; SkipWhite; end; {--------------------------------------------------------------} { Get an Identifier and Scan it for Keywords } procedure Scan; begin GetName; Token := KWcode[Lookup(Addr(KWlist), Value, 4) + 1]; end; {--------------------------------------------------------------} { Match a Specific Input String } procedure MatchString(x: string); begin if Value <> x then Expected('''' + x + ''''); end; {--------------------------------------------------------------} { Generate a Unique Label } function NewLabel: string; var S: string; begin Str(LCount, S); NewLabel := 'L' + S; Inc(LCount); end; {--------------------------------------------------------------} { Post a Label To Output } procedure PostLabel(L: string); begin WriteLn(L, ':'); end; {--------------------------------------------------------------} { Output a String with Tab } procedure Emit(s: string); begin Write(TAB, s); end; {--------------------------------------------------------------} { Output a String with Tab and CRLF } procedure EmitLn(s: string); begin Emit(s); WriteLn; end; {---------------------------------------------------------------} { Parse and Translate an Identifier } procedure Ident; begin GetName; if Look = '(' then begin Match('('); Match(')'); EmitLn('BSR ' + Value); end else EmitLn('MOVE ' + Value + '(PC),D0'); end; {---------------------------------------------------------------} { Parse and Translate a Math Factor } procedure Expression; Forward; procedure Factor; begin if Look = '(' then begin Match('('); Expression; Match(')'); end else if IsAlpha(Look) then Ident else begin GetNum; EmitLn('MOVE #' + Value + ',D0'); end; end; {---------------------------------------------------------------} { Parse and Translate the First Math Factor } procedure SignedFactor; var s: boolean; begin s := Look = '-'; if IsAddop(Look) then begin GetChar; SkipWhite; end; Factor; if s then EmitLn('NEG D0'); end; {--------------------------------------------------------------} { Recognize and Translate a Multiply } procedure Multiply; begin Match('*'); Factor; EmitLn('MULS (SP)+,D0'); end; {-------------------------------------------------------------} { Recognize and Translate a Divide } procedure Divide; begin Match('/'); Factor; EmitLn('MOVE (SP)+,D1'); EmitLn('EXS.L D0'); EmitLn('DIVS D1,D0'); end; {---------------------------------------------------------------} { Completion of Term Processing (called by Term and FirstTerm } procedure Term1; begin while IsMulop(Look) do begin EmitLn('MOVE D0,-(SP)'); case Look of '*': Multiply; '/': Divide; end; end; end; {---------------------------------------------------------------} { Parse and Translate a Math Term } procedure Term; begin Factor; Term1; end; {---------------------------------------------------------------} { Parse and Translate a Math Term with Possible Leading Sign } procedure FirstTerm; begin SignedFactor; Term1; end; {---------------------------------------------------------------} { Recognize and Translate an Add } procedure Add; begin Match('+'); Term; EmitLn('ADD (SP)+,D0'); end; {---------------------------------------------------------------} { Recognize and Translate a Subtract } procedure Subtract; begin Match('-'); Term; EmitLn('SUB (SP)+,D0'); EmitLn('NEG D0'); end; {---------------------------------------------------------------} { Parse and Translate an Expression } procedure Expression; begin FirstTerm; while IsAddop(Look) do begin EmitLn('MOVE D0,-(SP)'); case Look of '+': Add; '-': Subtract; end; end; end; {---------------------------------------------------------------} { Parse and Translate a Boolean Condition } { This version is a dummy } Procedure Condition; begin EmitLn('Condition'); end; {---------------------------------------------------------------} { Recognize and Translate an IF Construct } procedure Block; Forward; procedure DoIf; var L1, L2: string; begin Condition; L1 := NewLabel; L2 := L1; EmitLn('BEQ ' + L1); Block; if Token = 'l' then begin L2 := NewLabel; EmitLn('BRA ' + L2); PostLabel(L1); Block; end; PostLabel(L2); MatchString('ENDIF'); end; {--------------------------------------------------------------} { Parse and Translate an Assignment Statement } procedure Assignment; var Name: string; begin Name := Value; Match('='); Expression; EmitLn('LEA ' + Name + '(PC),A0'); EmitLn('MOVE D0,(A0)'); end; {--------------------------------------------------------------} { Recognize and Translate a Statement Block } procedure Block; begin Scan; while not (Token in ['e', 'l']) do begin case Token of 'i': DoIf; else Assignment; end; Scan; end; end; {--------------------------------------------------------------} { Parse and Translate a Program } procedure DoProgram; begin Block; MatchString('END'); EmitLn('END') end; {--------------------------------------------------------------} { Initialize } procedure Init; begin LCount := 0; GetChar; end; {--------------------------------------------------------------} { Main Program } begin Init; DoProgram; end. {--------------------------------------------------------------} Compare this program with its single-character counterpart. I think you will agree that the differences are minor. CONCLUSION At this point, you have learned how to parse and generate code for expressions, Boolean expressions, and control structures. You have now learned how to develop lexical scanners, and how to incorporate their elements into a translator. You have still not seen ALL the elements combined into one program, but on the basis of what we've done before you should find it a straightforward matter to extend our earlier programs to include scanners. We are very close to having all the elements that we need to build a real, functional compiler. There are still a few things missing, notably procedure calls and type definitions. We will deal with those in the next few sessions. Before doing so, however, I thought it would be fun to turn the translator above into a true compiler. That's what we'll be doing in the next installment. Up till now, we've taken a rather bottom-up approach to parsing, beginning with low-level constructs and working our way up. In the next installment, I'll also be taking a look from the top down, and we'll discuss how the structure of the translator is altered by changes in the language definition. See you then. ***************************************************************** * * * COPYRIGHT NOTICE * * * * Copyright (C) 1988 Jack W. Crenshaw. All rights reserved. * * * ***************************************************************** LET'S BUILD A COMPILER! By Jack W. Crenshaw, Ph.D. 2 April 1989 Part VIII: A LITTLE PHILOSOPHY ***************************************************************** * * * COPYRIGHT NOTICE * * * * Copyright (C) 1989 Jack W. Crenshaw. All rights reserved. * * * ***************************************************************** INTRODUCTION This is going to be a different kind of session than the others in our series on parsing and compiler construction. For this session, there won't be any experiments to do or code to write. This once, I'd like to just talk with you for a while. Mercifully, it will be a short session, and then we can take up where we left off, hopefully with renewed vigor. When I was in college, I found that I could always follow a prof's lecture a lot better if I knew where he was going with it. I'll bet you were the same. So I thought maybe it's about time I told you where we're going with this series: what's coming up in future installments, and in general what all this is about. I'll also share some general thoughts concerning the usefulness of what we've been doing. THE ROAD HOME So far, we've covered the parsing and translation of arithmetic expressions, Boolean expressions, and combinations connected by relational operators. We've also done the same for control constructs. In all of this we've leaned heavily on the use of top-down, recursive descent parsing, BNF definitions of the syntax, and direct generation of assembly-language code. We also learned the value of such tricks as single-character tokens to help us see the forest through the trees. In the last installment we dealt with lexical scanning, and I showed you simple but powerful ways to remove the single-character barriers. Throughout the whole study, I've emphasized the KISS philosophy ... Keep It Simple, Sidney ... and I hope by now you've realized just how simple this stuff can really be. While there are for sure areas of compiler theory that are truly intimidating, the ultimate message of this series is that in practice you can just politely sidestep many of these areas. If the language definition cooperates or, as in this series, if you can define the language as you go, it's possible to write down the language definition in BNF with reasonable ease. And, as we've seen, you can crank out parse procedures from the BNF just about as fast as you can type. As our compiler has taken form, it's gotten more parts, but each part is quite small and simple, and very much like all the others. At this point, we have many of the makings of a real, practical compiler. As a matter of fact, we already have all we need to build a toy compiler for a language as powerful as, say, Tiny BASIC. In the next couple of installments, we'll go ahead and define that language. To round out the series, we still have a few items to cover. These include: o Procedure calls, with and without parameters o Local and global variables o Basic types, such as character and integer types o Arrays o Strings o User-defined types and structures o Tree-structured parsers and intermediate languages o Optimization These will all be covered in future installments. When we're finished, you'll have all the tools you need to design and build your own languages, and the compilers to translate them. I can't design those languages for you, but I can make some comments and recommendations. I've already sprinkled some throughout past installments. You've seen, for example, the control constructs I prefer. These constructs are going to be part of the languages I build. I have three languages in mind at this point, two of which you will see in installments to come: TINY - A minimal, but usable language on the order of Tiny BASIC or Tiny C. It won't be very practical, but it will have enough power to let you write and run real programs that do something worthwhile. KISS - The language I'm building for my own use. KISS is intended to be a systems programming language. It won't have strong typing or fancy data structures, but it will support most of the things I want to do with a higher- order language (HOL), except perhaps writing compilers. I've also been toying for years with the idea of a HOL-like assembler, with structured control constructs and HOL-like assignment statements. That, in fact, was the impetus behind my original foray into the jungles of compiler theory. This one may never be built, simply because I've learned that it's actually easier to implement a language like KISS, that only uses a subset of the CPU instructions. As you know, assembly language can be bizarre and irregular in the extreme, and a language that maps one-for-one onto it can be a real challenge. Still, I've always felt that the syntax used in conventional assemblers is dumb ... why is MOVE.L A,B better, or easier to translate, than B=A ? I think it would be an interesting exercise to develop a "compiler" that would give the programmer complete access to and control over the full complement of the CPU instruction set, and would allow you to generate programs as efficient as assembly language, without the pain of learning a set of mnemonics. Can it be done? I don't know. The real question may be, "Will the resulting language be any easier to write than assembly"? If not, there's no point in it. I think that it can be done, but I'm not completely sure yet how the syntax should look. Perhaps you have some comments or suggestions on this one. I'd love to hear them. You probably won't be surprised to learn that I've already worked ahead in most of the areas that we will cover. I have some good news: Things never get much harder than they've been so far. It's possible to build a complete, working compiler for a real language, using nothing but the same kinds of techniques you've learned so far. And THAT brings up some interesting questions. WHY IS IT SO SIMPLE? Before embarking on this series, I always thought that compilers were just naturally complex computer programs ... the ultimate challenge. Yet the things we have done here have usually turned out to be quite simple, sometimes even trivial. For awhile, I thought is was simply because I hadn't yet gotten into the meat of the subject. I had only covered the simple parts. I will freely admit to you that, even when I began the series, I wasn't sure how far we would be able to go before things got too complex to deal with in the ways we have so far. But at this point I've already been down the road far enough to see the end of it. Guess what? THERE ARE NO HARD PARTS! Then, I thought maybe it was because we were not generating very good object code. Those of you who have been following the series and trying sample compiles know that, while the code works and is rather foolproof, its efficiency is pretty awful. I figured that if we were concentrating on turning out tight code, we would soon find all that missing complexity. To some extent, that one is true. In particular, my first few efforts at trying to improve efficiency introduced complexity at an alarming rate. But since then I've been tinkering around with some simple optimizations and I've found some that result in very respectable code quality, WITHOUT adding a lot of complexity. Finally, I thought that perhaps the saving grace was the "toy compiler" nature of the study. I have made no pretense that we were ever going to be able to build a compiler to compete with Borland and Microsoft. And yet, again, as I get deeper into this thing the differences are starting to fade away. Just to make sure you get the message here, let me state it flat out: USING THE TECHNIQUES WE'VE USED HERE, IT IS POSSIBLE TO BUILD A PRODUCTION-QUALITY, WORKING COMPILER WITHOUT ADDING A LOT OF COMPLEXITY TO WHAT WE'VE ALREADY DONE. Since the series began I've received some comments from you. Most of them echo my own thoughts: "This is easy! Why do the textbooks make it seem so hard?" Good question. Recently, I've gone back and looked at some of those texts again, and even bought and read some new ones. Each time, I come away with the same feeling: These guys have made it seem too hard. What's going on here? Why does the whole thing seem difficult in the texts, but easy to us? Are we that much smarter than Aho, Ullman, Brinch Hansen, and all the rest? Hardly. But we are doing some things differently, and more and more I'm starting to appreciate the value of our approach, and the way that it simplifies things. Aside from the obvious shortcuts that I outlined in Part I, like single-character tokens and console I/O, we have made some implicit assumptions and done some things differently from those who have designed compilers in the past. As it turns out, our approach makes life a lot easier. So why didn't all those other guys use it? You have to remember the context of some of the earlier compiler development. These people were working with very small computers of limited capacity. Memory was very limited, the CPU instruction set was minimal, and programs ran in batch mode rather than interactively. As it turns out, these caused some key design decisions that have really complicated the designs. Until recently, I hadn't realized how much of classical compiler design was driven by the available hardware. Even in cases where these limitations no longer apply, people have tended to structure their programs in the same way, since that is the way they were taught to do it. In our case, we have started with a blank sheet of paper. There is a danger there, of course, that you will end up falling into traps that other people have long since learned to avoid. But it also has allowed us to take different approaches that, partly by design and partly by pure dumb luck, have allowed us to gain simplicity. Here are the areas that I think have led to complexity in the past: o Limited RAM Forcing Multiple Passes I just read "Brinch Hansen on Pascal Compilers" (an excellent book, BTW). He developed a Pascal compiler for a PC, but he started the effort in 1981 with a 64K system, and so almost every design decision he made was aimed at making the compiler fit into RAM. To do this, his compiler has three passes, one of which is the lexical scanner. There is no way he could, for example, use the distributed scanner I introduced in the last installment, because the program structure wouldn't allow it. He also required not one but two intermediate languages, to provide the communication between phases. All the early compiler writers had to deal with this issue: Break the compiler up into enough parts so that it will fit in memory. When you have multiple passes, you need to add data structures to support the information that each pass leaves behind for the next. That adds complexity, and ends up driving the design. Lee's book, "The Anatomy of a Compiler," mentions a FORTRAN compiler developed for an IBM 1401. It had no fewer than 63 separate passes! Needless to say, in a compiler like this the separation into phases would dominate the design. Even in situations where RAM is plentiful, people have tended to use the same techniques because that is what they're familiar with. It wasn't until Turbo Pascal came along that we found how simple a compiler could be if you started with different assumptions. o Batch Processing In the early days, batch processing was the only choice ... there was no interactive computing. Even today, compilers run in essentially batch mode. In a mainframe compiler as well as many micro compilers, considerable effort is expended on error recovery ... it can consume as much as 30-40% of the compiler and completely drive the design. The idea is to avoid halting on the first error, but rather to keep going at all costs, so that you can tell the programmer about as many errors in the whole program as possible. All of that harks back to the days of the early mainframes, where turnaround time was measured in hours or days, and it was important to squeeze every last ounce of information out of each run. In this series, I've been very careful to avoid the issue of error recovery, and instead our compiler simply halts with an error message on the first error. I will frankly admit that it was mostly because I wanted to take the easy way out and keep things simple. But this approach, pioneered by Borland in Turbo Pascal, also has a lot going for it anyway. Aside from keeping the compiler simple, it also fits very well with the idea of an interactive system. When compilation is fast, and especially when you have an editor such as Borland's that will take you right to the point of the error, then it makes a lot of sense to stop there, and just restart the compilation after the error is fixed. o Large Programs Early compilers were designed to handle large programs ... essentially infinite ones. In those days there was little choice; the idea of subroutine libraries and separate compilation were still in the future. Again, this assumption led to multi-pass designs and intermediate files to hold the results of partial processing. Brinch Hansen's stated goal was that the compiler should be able to compile itself. Again, because of his limited RAM, this drove him to a multi-pass design. He needed as little resident compiler code as possible, so that the necessary tables and other data structures would fit into RAM. I haven't stated this one yet, because there hasn't been a need ... we've always just read and written the data as streams, anyway. But for the record, my plan has always been that, in a production compiler, the source and object data should all coexist in RAM with the compiler, a la the early Turbo Pascals. That's why I've been careful to keep routines like GetChar and Emit as separate routines, in spite of their small size. It will be easy to change them to read to and write from memory. o Emphasis on Efficiency John Backus has stated that, when he and his colleagues developed the original FORTRAN compiler, they KNEW that they had to make it produce tight code. In those days, there was a strong sentiment against HOLs and in favor of assembly language, and efficiency was the reason. If FORTRAN didn't produce very good code by assembly standards, the users would simply refuse to use it. For the record, that FORTRAN compiler turned out to be one of the most efficient ever built, in terms of code quality. But it WAS complex! Today, we have CPU power and RAM size to spare, so code efficiency is not so much of an issue. By studiously ignoring this issue, we have indeed been able to Keep It Simple. Ironically, though, as I have said, I have found some optimizations that we can add to the basic compiler structure, without having to add a lot of complexity. So in this case we get to have our cake and eat it too: we will end up with reasonable code quality, anyway. o Limited Instruction Sets The early computers had primitive instruction sets. Things that we take for granted, such as stack operations and indirect addressing, came only with great difficulty. Example: In most compiler designs, there is a data structure called the literal pool. The compiler typically identifies all literals used in the program, and collects them into a single data structure. All references to the literals are done indirectly to this pool. At the end of the compilation, the compiler issues commands to set aside storage and initialize the literal pool. We haven't had to address that issue at all. When we want to load a literal, we just do it, in line, as in MOVE #3,D0 There is something to be said for the use of a literal pool, particularly on a machine like the 8086 where data and code can be separated. Still, the whole thing adds a fairly large amount of complexity with little in return. Of course, without the stack we would be lost. In a micro, both subroutine calls and temporary storage depend heavily on the stack, and we have used it even more than necessary to ease expression parsing. o Desire for Generality Much of the content of the typical compiler text is taken up with issues we haven't addressed here at all ... things like automated translation of grammars, or generation of LALR parse tables. This is not simply because the authors want to impress you. There are good, practical reasons why the subjects are there. We have been concentrating on the use of a recursive-descent parser to parse a deterministic grammar, i.e., a grammar that is not ambiguous and, therefore, can be parsed with one level of lookahead. I haven't made much of this limitation, but the fact is that this represents a small subset of possible grammars. In fact, there is an infinite number of grammars that we can't parse using our techniques. The LR technique is a more powerful one, and can deal with grammars that we can't. In compiler theory, it's important to know how to deal with these other grammars, and how to transform them into grammars that are easier to deal with. For example, many (but not all) ambiguous grammars can be transformed into unambiguous ones. The way to do this is not always obvious, though, and so many people have devoted years to develop ways to transform them automatically. In practice, these issues turn out to be considerably less important. Modern languages tend to be designed to be easy to parse, anyway. That was a key motivation in the design of Pascal. Sure, there are pathological grammars that you would be hard pressed to write unambiguous BNF for, but in the real world the best answer is probably to avoid those grammars! In our case, of course, we have sneakily let the language evolve as we go, so we haven't painted ourselves into any corners here. You may not always have that luxury. Still, with a little care you should be able to keep the parser simple without having to resort to automatic translation of the grammar. We have taken a vastly different approach in this series. We started with a clean sheet of paper, and developed techniques that work in the context that we are in; that is, a single-user PC with rather ample CPU power and RAM space. We have limited ourselves to reasonable grammars that are easy to parse, we have used the instruction set of the CPU to advantage, and we have not concerned ourselves with efficiency. THAT's why it's been easy. Does this mean that we are forever doomed to be able to build only toy compilers? No, I don't think so. As I've said, we can add certain optimizations without changing the compiler structure. If we want to process large files, we can always add file buffering to do that. These things do not affect the overall program design. And I think that's a key factor. By starting with small and limited cases, we have been able to concentrate on a structure for the compiler that is natural for the job. Since the structure naturally fits the job, it is almost bound to be simple and transparent. Adding capability doesn't have to change that basic structure. We can simply expand things like the file structure or add an optimization layer. I guess my feeling is that, back when resources were tight, the structures people ended up with were artificially warped to make them work under those conditions, and weren't optimum structures for the problem at hand. CONCLUSION Anyway, that's my arm-waving guess as to how we've been able to keep things simple. We started with something simple and let it evolve naturally, without trying to force it into some traditional mold. We're going to press on with this. I've given you a list of the areas we'll be covering in future installments. With those installments, you should be able to build complete, working compilers for just about any occasion, and build them simply. If you REALLY want to build production-quality compilers, you'll be able to do that, too. For those of you who are chafing at the bit for more parser code, I apologize for this digression. I just thought you'd like to have things put into perspective a bit. Next time, we'll get back to the mainstream of the tutorial. So far, we've only looked at pieces of compilers, and while we have many of the makings of a complete language, we haven't talked about how to put it all together. That will be the subject of our next two installments. Then we'll press on into the new subjects I listed at the beginning of this installment. See you then. ***************************************************************** * * * COPYRIGHT NOTICE * * * * Copyright (C) 1989 Jack W. Crenshaw. All rights reserved. * * * ***************************************************************** LET'S BUILD A COMPILER! By Jack W. Crenshaw, Ph.D. 16 April 1989 Part IX: A TOP VIEW ***************************************************************** * * * COPYRIGHT NOTICE * * * * Copyright (C) 1989 Jack W. Crenshaw. All rights reserved. * * * ***************************************************************** INTRODUCTION In the previous installments, we have learned many of the techniques required to build a full-blown compiler. We've done both assignment statements (with Boolean and arithmetic expressions), relational operators, and control constructs. We still haven't addressed procedure or function calls, but even so we could conceivably construct a mini-language without them. I've always thought it would be fun to see just how small a language one could build that would still be useful. We're ALMOST in a position to do that now. The problem is: though we know how to parse and translate the constructs, we still don't know quite how to put them all together into a language. In those earlier installments, the development of our programs had a decidedly bottom-up flavor. In the case of expression parsing, for example, we began with the very lowest level constructs, the individual constants and variables, and worked our way up to more complex expressions. Most people regard the top-down design approach as being better than the bottom-up one. I do too, but the way we did it certainly seemed natural enough for the kinds of things we were parsing. You mustn't get the idea, though, that the incremental approach that we've been using in all these tutorials is inherently bottom-up. In this installment I'd like to show you that the approach can work just as well when applied from the top down ... maybe better. We'll consider languages such as C and Pascal, and see how complete compilers can be built starting from the top. In the next installment, we'll apply the same technique to build a complete translator for a subset of the KISS language, which I'll be calling TINY. But one of my goals for this series is that you will not only be able to see how a compiler for TINY or KISS works, but that you will also be able to design and build compilers for your own languages. The C and Pascal examples will help. One thing I'd like you to see is that the natural structure of the compiler depends very much on the language being translated, so the simplicity and ease of construction of the compiler depends very much on letting the language set the program structure. It's a bit much to produce a full C or Pascal compiler here, and we won't try. But we can flesh out the top levels far enough so that you can see how it goes. Let's get started. THE TOP LEVEL One of the biggest mistakes people make in a top-down design is failing to start at the true top. They think they know what the overall structure of the design should be, so they go ahead and write it down. Whenever I start a new design, I always like to do it at the absolute beginning. In program design language (PDL), this top level looks something like: begin solve the problem end OK, I grant you that this doesn't give much of a hint as to what the next level is, but I like to write it down anyway, just to give me that warm feeling that I am indeed starting at the top. For our problem, the overall function of a compiler is to compile a complete program. Any definition of the language, written in BNF, begins here. What does the top level BNF look like? Well, that depends quite a bit on the language to be translated. Let's take a look at Pascal. THE STRUCTURE OF PASCAL Most texts for Pascal include a BNF or "railroad-track" definition of the language. Here are the first few lines of one: ::= '.' ::= PROGRAM ::= We can write recognizers to deal with each of these elements, just as we've done before. For each one, we'll use our familiar single-character tokens to represent the input, then flesh things out a little at a time. Let's begin with the first recognizer: the program itself. To translate this, we'll start with a fresh copy of the Cradle. Since we're back to single-character names, we'll just use a 'p' to stand for 'PROGRAM.' To a fresh copy of the cradle, add the following code, and insert a call to it from the main program: {--------------------------------------------------------------} { Parse and Translate A Program } procedure Prog; var Name: char; begin Match('p'); { Handles program header part } Name := GetName; Prolog(Name); Match('.'); Epilog(Name); end; {--------------------------------------------------------------} The procedures Prolog and Epilog perform whatever is required to let the program interface with the operating system, so that it can execute as a program. Needless to say, this part will be VERY OS-dependent. Remember, I've been emitting code for a 68000 running under the OS I use, which is SK*DOS. I realize most of you are using PC's and would rather see something else, but I'm in this thing too deep to change now! Anyhow, SK*DOS is a particularly easy OS to interface to. Here is the code for Prolog and Epilog: {--------------------------------------------------------------} { Write the Prolog } procedure Prolog; begin EmitLn('WARMST EQU $A01E'); end; {--------------------------------------------------------------} { Write the Epilog } procedure Epilog(Name: char); begin EmitLn('DC WARMST'); EmitLn('END ' + Name); end; {--------------------------------------------------------------} As usual, add this code and try out the "compiler." At this point, there is only one legal input: px. (where x is any single letter, the program name) Well, as usual our first effort is rather unimpressive, but by now I'm sure you know that things will get more interesting. There is one important thing to note: THE OUTPUT IS A WORKING, COMPLETE, AND EXECUTABLE PROGRAM (at least after it's assembled). This is very important. The nice feature of the top-down approach is that at any stage you can compile a subset of the complete language and get a program that will run on the target machine. From here on, then, we need only add features by fleshing out the language constructs. It's all very similar to what we've been doing all along, except that we're approaching it from the other end. FLESHING IT OUT To flesh out the compiler, we only have to deal with language features one by one. I like to start with a stub procedure that does nothing, then add detail in incremental fashion. Let's begin by processing a block, in accordance with its PDL above. We can do this in two stages. First, add the null procedure: {--------------------------------------------------------------} { Parse and Translate a Pascal Block } procedure DoBlock(Name: char); begin end; {--------------------------------------------------------------} and modify Prog to read: {--------------------------------------------------------------} { Parse and Translate A Program } procedure Prog; var Name: char; begin Match('p'); Name := GetName; Prolog; DoBlock(Name); Match('.'); Epilog(Name); end; {--------------------------------------------------------------} That certainly shouldn't change the behavior of the program, and it doesn't. But now the definition of Prog is complete, and we can proceed to flesh out DoBlock. That's done right from its BNF definition: {--------------------------------------------------------------} { Parse and Translate a Pascal Block } procedure DoBlock(Name: char); begin Declarations; PostLabel(Name); Statements; end; {--------------------------------------------------------------} The procedure PostLabel was defined in the installment on branches. Copy it into your cradle. I probably need to explain the reason for inserting the label where I have. It has to do with the operation of SK*DOS. Unlike some OS's, SK*DOS allows the entry point to the main program to be anywhere in the program. All you have to do is to give that point a name. The call to PostLabel puts that name just before the first executable statement in the main program. How does SK*DOS know which of the many labels is the entry point, you ask? It's the one that matches the END statement at the end of the program. OK, now we need stubs for the procedures Declarations and Statements. Make them null procedures as we did before. Does the program still run the same? Then we can move on to the next stage. DECLARATIONS The BNF for Pascal declarations is: ::= (