diff --git a/tutorial/en/0-Preface.md b/tutorial/en/0-Preface.md index cdd10ed..e07b121 100644 --- a/tutorial/en/0-Preface.md +++ b/tutorial/en/0-Preface.md @@ -1,104 +1,126 @@ -This series of articles is a tutorial for building a C compiler from scratch. +# Preface -I lied a little in the above sentence: it is actually an _interpreter_ instead -of _compiler_. I lied because what the hell is a "C interpreter"? You will -however, understand compilers better by building an interpreter. +This is multi-part tutorial on how to build a C compiler from scratch. -Yeah, I wish you can get a basic understanding of how a compiler is -constructed, and realize it is not that hard to build one. Good Luck! +Well, I lied a little in the previous sentence: it's actually an _interpreter_, +not a _compiler_. I had to lie, because what on earth is a "C interpreter"? +You will however gain a better understanding of compilers by building an +interpreter. -Finally, this series is written in Chinese in the first place, feel free to -correct me if you are confused by my English. And I would like it very much if -you could teach me some "native" English :) +Yeah, I want to provide you with a basic understanding of how a compiler is +constructed, and realize that it's not that hard to build one, after all. +Good Luck! -We won't write any code in this chapter, feel free to skip it if you are -desperate to see some code... +This tutorial was originally written in Chinese, so feel free to correct me if +you're confused by my English. Also, I would really appreciate it if you could +teach me some "native" English. :smile: -## Why you should care about compiler theory? +We won't be writing any code in this chapter; so if you're eager to see some code, feel free to skip it. -Because it is **COOL**! -And it is very useful. Programs are built to do something for us, when they -are used to translate some forms of data into another form, we can call them -a compiler. Thus by learning some compiler theory we are trying to master a very -powerful technique of solving problems. Isn't that cool enough to you? +## Why Should I Care about Compiler Theory? + +Because it's **COOL**! + +And it's also very useful. Programs are designed to do something for us; when +they are used to translate some form of data into another form, we can call +them compilers. Thus, by learning some compiler theory, we are trying to +master a very powerful problem solving technique. Doesn't this sound cool +enough to you? + +People used to say that understanding how a compiler works would help you to +write better code. Some would argue that modern compilers are so good at +optimizing that you shouldn't care any more. Well, that's true, most people +don't need to learn compiler theory to improve code performance — and by "most +people" I mean _you_! -People used to say understanding how a compiler works would help you to write -better code. Some would argue that modern compilers are so good at -optimization that you should not care any more. Well, that's true, most people -don't need to learn compiler theory only to improve the efficency of the code. -And by most people, I mean you! ## We Don't Like Theory Either -I have always been in awe of compiler theory because that's what makes -programing easy. Anyway can you imaging building a web browser in only -assembly language? So when I got a chance to learn compiler theory in college, -I was so excited! And then... I quit, not understanding what that it. +I've always been in awe of compiler theory because that's what makes programing +easy. Anyway, can you imagine building a web browser entirely in assembly +language? So when I got a chance to learn compiler theory in college, I was so +excited! And then ... I quit! And left without understanding what it's all +about. -Normally a course of compiler will cover: +Normally compiler course covers the following topics: -1. How to represent syntax (such as BNF, etc.) -2. Lexer, with somewhat NFA(Nondeterministic Finite Automata), - DFA(Deterministic Finite Automata). -3. Parser, such as recursive descent, LL(k), LALR, etc. +1. How to represent syntaxes (i.e. BNF, etc.) +2. Lexers, using NFA (Nondeterministic Finite Automata) and + DFA (Deterministic Finite Automata). +3. Parsers, such as recursive descent, LL(k), LALR, etc. 4. Intermediate Languages. 5. Code generation. 6. Code optimization. -Perhaps more than 90% students will not care anything beyond the parser, and -what's more, we still don't know how to build a compiler! Even after all the -effort learning the theories. Well the main reason is that what "Compiler -Thoery" trys to teach is "How to build a parser generator", namely a tool that -consumes syntax gramer and generates a compiler for you. lex/yacc or -flex/bison or things like that. +Perhaps more than 90% of the students won't really care about any of that, +except for the parser, and what's more, we'd still won't know how to actually +build a compiler! even after all the effort of learning the theory. Well, the +main reason is that what "Compiler Theory" tries to teach is "how to build a +parser generator" — i.e. a tool that consumes a syntax grammar and generates a +compiler for you, like lex/yacc or flex/bison, or similar tools. + +These theories try to teach us how to solve the general challenges of +generating compilers automatically. Once you've mastered them, you're able to +deal with all kinds of grammars. They are indeed useful in the industry. +Nevertheless, they are too powerful and too complicated for students and most +programmers. If you try to read lex/yacc's source code you'll understand what +I mean. -These theories try to teach us how to solve the general problems of generating -compilers automatically. That means once you've mastered them, you are able to -deal with all kinds of grammars. They are indeed useful in industry. -Nevertheless they are too powerful and too complicated for students and most -programmers. You will understand that if you try to read lex/yacc's source -code. +The good news is that building a compiler can be much simpler than you ever +imagined. I won't lie, it's not easy, but definitely not hard. -Good news is building a compiler can be much simpler than you ever imagined. -I won't lie, not easy, but definitely not hard. -## Birth of this project +## How This Project Began -One day I came across the project [c4](https://github.com/rswier/c4) on -Github. It is a small C interpreter which is claimed to be implemented by only -4 functions. The most amazing part is that it is bootstrapping (that interpret -itself). Also it is done with about 500 lines! +One day I came across the project [c4] on Github, a small C interpreter +claiming to be implemented with only 4 functions. The most amazing part is +that it's [bootstrapping] (i.e. it can interpret itself). Furthermore, it's +being done in around 500 lines of code! -Meanwhile I've read a lot of tutorials about compiler, they are either too -simple(such as implementing a simple calculator) or using automation -tools(such as flex/bison). c4 is however implemented all from scratch. The -sad thing is that it try to be minimal, that makes the code quite a mess, hard -to understand. So I started a new project to: +Meanwhile, I've read many tutorials on compilers design, and found them to be +either too simple (such as implementing a simple calculator) or using +automation tools (such as flex/bison). [C4], however, is implemented entirely +from scratch. The sad thing is that it aims to be "an exercise in minimalism," +which makes the code quite messy and hard to understand. So I started a new +project, in order to: -1. Implement a working C compiler(interpreter actually) -2. Write a tutorial of how it is built. +1. Implement a working C compiler (an interpreter, actually). +2. Write a step-by-step tutorial on how it was built. -It took me 1 week to re-write it, resulting 1400 lines including comments. The -project is hosted on Github: [Write a C Interpreter](https://github.com/lotabout/write-a-C-interpreter). +It took me one week to re-write it, resulting in 1400 lines of code (including +comments). The project is hosted on Github: [Write a C Interpreter]. -Thanks rswier for bringing us a wonderful project! +Thanks [@rswier] for sharing with us [c4], it's such a wonderful project! -## Before you go -Implementing a compiler could be boring and it is hard to debug. So I hope you -can spare enough time studying, as well as type the code. I am sure that you -will feel a great sense of accomplishment just like I do. +## Before You Begin + +Implementing a compiler can be boring and hard to debug. So I hope you can +spare enough time studying, and typing code. I'm sure that you will feel a +great sense of accomplishment, just like I do. + ## Good Resources -1. [Let’s Build a Compiler](http://compilers.iecc.com/crenshaw/): a very good - tutorial of building a compiler for fresh starters. -2. [Lemon Parser Generator](http://www.hwaci.com/sw/lemon/): the parser - generator that is used in SQLite. Good to read if you want to understand - compiler theory with code. +1. _[Let’s Build a Compiler]_: a very good tutorial of building a compiler, + written for beginners. +2. [Lemon Parser Generator]: the parser generator used by SQLite. + Good to read if you want to understand compiler theory with code. + +In the end, I am just a person with a general level of expertise, so there +will inevitably be some mistakes in my articles and code (and also in my +English). Feel free to correct me! + +I hope you'll enjoy it. -In the end, I am human with a general level, there will be inevitably wrong -with the articles and codes(also my English). Feel free to correct me! + -Hope you enjoy it. +[@rswier]: https://github.com/rswier "Visit @rswier's GitHub profile" +[bootstrapping]: https://en.wikipedia.org/wiki/Bootstrapping_(compilers) "Wikipedia » Bootstrapping (compilers)" +[c4]: https://github.com/rswier/c4 "Visit the c4 repository on GitHub" +[Lemon Parser Generator]: http://www.hwaci.com/sw/lemon/ "Visit Lemon homepage" +[Let’s Build a Compiler]: http://compilers.iecc.com/crenshaw/ "15-part tutorial series, by Jack Crenshaw" +[Write a C Interpreter]: https://github.com/lotabout/write-a-C-interpreter "Visit the 'Write a C Interpreter' repository on GitHub" diff --git a/tutorial/en/1-Skeleton.md b/tutorial/en/1-Skeleton.md index bf59e3a..48d8322 100644 --- a/tutorial/en/1-Skeleton.md +++ b/tutorial/en/1-Skeleton.md @@ -1,66 +1,69 @@ -In this chapter we will have an overview of the compiler's structure. +# 1. Skeleton -Before we start, I'd like to restress that it is **interperter** that we want -to build. That means we can run a C source file just like a script. It is -chosen mainly for two reasons: +In this chapter we'll present an overview of the compiler's structure. -1. Interpreter differs from Compiler only in code generation phase, thus we'll - still learn all the core techniques of building a compiler(such as lexical - analyzing and parsing). -2. We will build our own virtual machine and assembly instructions, that would - help us to understand how computers work. +Before we start, let me stress again that will be building an **interperter**. +This means we'll be able to run a C source file as if it was a script. The main +reasons behind this choice are twofold: -## Three Phases +1. An interpreter differs from a compiler only in the code generation phase, + thus we'll still learn all the core techniques of building a compiler + (such as lexical analyzing and parsing). +2. We will build our own virtual machine and [assembly instruction set]; + this will help us understand how computers work. -Given a source file, normally the compiler will cast three phases of -processing: -1. Lexical Analysis: converts source strings into internal token stream. -2. Parsing: consumes token stream and constructs syntax tree. -3. Code Generation: walk through the syntax tree and generate code for target - platform. +## The Three Phases of Compiling -Compiler Construction had been so mature that part 1 & 2 can be done by -automation tools. For example, flex can be used for lexical analysis, bison -for parsing. They are powerful but do thousands of things behind the scene. In -order to fully understand how to build a compiler, we are going to build them -all from scratch. +Given a source file, the compiler usually carries out three processing phases: -Thus we will build our interpreter in the following steps: +1. **Lexical Analysis**: + converts source strings into an internal stream of tokens. +2. **Parsing**: consumes the tokens stream and constructs a syntax tree. +3. **Code Generation**: + walks through the syntax tree and generates code for target platform. -1. Build our own virtual machine and instruction set. This is the target - platform that will be using in our code generation phase. -2. Build our own lexer for C compiler. -3. Write a recusion descent parser on our own. +Compiler Construction is so mature that phases one and two can be done by +automation tools. For example, flex can be used for lexical analysis, bison for +parsing. These are powerful tools, which do thousands of things behind the +scene. In order to fully understand how to build a compiler, we're going to +handcraft all three phases, from scratch. -## Skeleton of our compiler +Therefore, we'll build our interpreter in the following steps: +1. Build our own virtual machine and instruction set. + This will be our target platform in the code generation phase. +2. Build our own lexer for C compilers. +3. Write a [recursive descent parser] on our own. -Modeling after c4, our compiler includes 4 main functions: -1. `next()` for lexical analysis; get the next token; will ignore spaces tabs - etc. -2. `program()` main entrance for parser. -3. `expression(level)`: parser expression; level will be explained in later - chapter. -4. `eval()`: the entrance for virtual machine; used to interpret target - instructions. +## The Skeleton of Our Compiler -Why would `expression` exist when we have `program` for parser? That's because -the parser for expressions is relatively independent and complex, so we put it -into a single module(function). +Modeled after [c4], our compiler includes four main functions: -The code is as following: +1. `next()` — + for lexical analysis; fetches the next token; ignores spaces, tabs, etc. +2. `program()` — parser main entry point. +3. `expression(level)` — + expressions parser; it will be explained in a later chapter. +4. `eval()` — + virtual machine entry point; used to interpret target instructions. + +Why do we need `expression()` when we already have `program()` for the parser? +That's because the expressions parser is relatively independent and complex, +so we put it into a single module (function). + +The code is as follows: ```c #include #include #include #include -#define int long long // work with 64bit target +#define int long long // work with 64-bit target int token; // current token -char *src, *old_src; // pointer to source code string; +char *src, *old_src; // pointer to source code string int poolsize; // default size of text/data/stack int line; // line number @@ -119,34 +122,46 @@ int main(int argc, char **argv) } ``` -That's quite some code for the first chapter of the article. Nevertheless it -is actually simple enough. The code tries to reads in a source file, character -by character and print them out. +That's quite some code for the first chapter of the tutorial. Nevertheless it's +actually quite simple. The code tries to reads a source file, character by +character, and print them out. -Currently the lexer `next()` does nothing but returning the characters as they -are in the source file. The parser `program()` doesn't take care of its job -either, no syntax trees are generated, no target codes are generated. +Currently, the lexer function `next()` does nothing except returning the +characters as they are encountered in the source file. The parser's `program()` +doesn't take care of its job either — it doesn't generate any syntax trees, nor +target code. The important thing here is to understand the meaning of these functions and -how they are hooked together as they are the skeleton of our interpreter. -We'll fill them out step by step in later chapters. +how they are hooked together, since they constitute the skeleton of our +interpreter. We'll fill them out step by step, in the upcoming chapters. -## Code + +## Source Code The code for this chapter can be downloaded from -[Github](https://github.com/lotabout/write-a-C-interpreter/tree/step-0), or -clone by: +[GitHub](https://github.com/lotabout/write-a-C-interpreter/tree/step-0), +or cloned via: ``` git clone -b step-0 https://github.com/lotabout/write-a-C-interpreter ``` -Note that I might fix bugs later, and if there is any incosistance between the -artical and the code branches, follow the article. I would only update code in -the master branch. +> **NOTE** — I might fix bugs later; if you notice any inconsistencies between +the tutorial and the code branches, follow the tutorial. I will only update +code in the master branch. + ## Summary -After some boring typing, we have the simplest compiler: a do-nothing -compiler. In next chapter, we will implement the `eval` function, i.e. our own +After some boring typing, we now have the simplest compiler: a do-nothing +compiler. In next chapter, we'll implement the `eval` function, i.e. our own virtual machine. See you then. + + + + +[assembly instruction set]: https://en.wikipedia.org/wiki/Instruction_set_architecture "Wikipedia » Instruction set architecture" +[c4]: https://github.com/rswier/c4 "Visit the c4 repository on GitHub" +[recursive descent parser]: https://en.wikipedia.org/wiki/Recursive_descent_parser "Wikipedia » Recursive descent parser"