From 96506c2463adc6749d450a5f5cf3d0121744b0d3 Mon Sep 17 00:00:00 2001 From: isuckatcs <65320245+isuckatcs@users.noreply.github.com> Date: Tue, 30 Jul 2024 03:12:21 +0200 Subject: [PATCH] [www] use a spell-checker that catches mistakes that the previous one didn't catch --- www/index.html | 39 ++++++++++++----------- www/lexing.html | 64 +++++++++++++++++++------------------- www/parsing.html | 80 ++++++++++++++++++++++++------------------------ 3 files changed, 91 insertions(+), 92 deletions(-) diff --git a/www/index.html b/www/index.html index 4a0960b..f113882 100644 --- a/www/index.html +++ b/www/index.html @@ -49,8 +49,7 @@

How to Compile Your Language

This guide is intended to be a practical introduction to how to design your language and implement a modern - compiler for it. The source code of the compiler is - available on + compiler for it. The compiler's source code is available on How to Compile Your Language >.

- When designing a language it helps if there is an idea what - the language is going to be used for. Is it indented to be + When designing a language it helps if there is an idea of + what the language will be used for. Is it intended to be making systems programming safer like Rust? Is it targeting AI developers like Mojo?

- In this case the goal of the language is to showcase various - algorithms and techniques that are used in the + In this case, the goal of the language is to showcase + various algorithms and techniques that are used in the implementation of some of the most popular languages like - C++, Kotlin or Rust. + C++, Kotlin, or Rust.

- The guide also covers how to create a platform specific + The guide also covers how to create a platform-specific executable with the help of the LLVM compiler infrastructure, which all of the previously mentioned languages use for the same purpose. Yes, even Kotlin can be @@ -82,10 +81,10 @@

What Does Every Language Have in Common?

When creating a new language, the first question is how to get started. There is something that every existing language and your language must define too, which is the entry - point from which the execution starts. + point from which the execution begins.

- In scripting languages like JavaScript the execution of the + In scripting languages like JavaScript, the execution of the code usually starts from the first line of the source file, while most programming languages including your language treat the main() function @@ -99,16 +98,16 @@

What Does Every Language Have in Common?

already popular language.

- In the past 50 years the syntax of a function declaration + In the past 50 years, the syntax of a function declaration was the name of the function followed by the list of arguments enclosed by ( and ). At - first glance it is tempting to introduce some new exotic + first glance, it is tempting to introduce some new exotic syntax like main<> {}, but in many popular languages <> might mean something completely - different, in this case a generic argument list. Using such - syntax for a function definition would probably cause - confusion for developers who try to get familiar with this - new language, which is something to keep in mind. + different, in this case, a generic argument list. Using such + syntax for a function definition would probably confuse + developers who are trying to get familiar with this new + language, which is something to keep in mind.

How Is This Text Turned into an Executable?

@@ -121,7 +120,7 @@

How Is This Text Turned into an Executable?

The frontend contains the actual implementation of the language, it is responsible for ensuring that the program written in the specific language doesn't contain any - errors, and reporting every issue it finds to the developer. + errors and reporting every issue it finds to the developer.

After validating the program, it turns it into an @@ -141,9 +140,9 @@

How Is This Text Turned into an Executable?

Is It Possible to Learn All These Topics?

- Yes, with enough time. However there is no need to learn all - of them to create a successful language. In fact even a lot - of modern popular languages like C++, + Yes, with enough time. However, there is no need to learn + all of them to create a successful language. In fact, even a + lot of modern popular languages like C++, Rust, Swift, Haskell or Kotlin/Native rely on LLVM for optimization and code generation. diff --git a/www/lexing.html b/www/lexing.html index 2d72967..6836d56 100644 --- a/www/lexing.html +++ b/www/lexing.html @@ -50,7 +50,7 @@

Tokenization

The first step of the compilation process is to take the - textual representation of the program and brake it down into + textual representation of the program and break it down into a list of tokens. Like spoken languages have sentences that are composed of nouns, verbs, adjectives, etc., programming languages similarly are composed of a set of tokens. @@ -64,8 +64,8 @@

Tokenization

be named anything else like foo or bar. One thing these names have in common is that each of them uniquely identifies the given function, so - the token that represent such piece of source code is called - the Identifier token. + the token that represents such a piece of source code is + called the Identifier token.

enum class TokenKind : char {
   Identifier
@@ -79,8 +79,8 @@ 

Tokenization

functions called fn or void.

- Each keyword gets it's own unique token, so that it's easy - to differentiate between them. + Each keyword gets its unique token so that it's easy to + differentiate between them.

enum class TokenKind : char {
   ...
@@ -99,7 +99,7 @@ 

Tokenization

The rest of the tokens, including EOF are tokens composed of a single character. To make creating them easier, each of these tokens is placed into an array and - their respective enumerator values are the ascii code of + their respective enumerator values are the ASCII code of their corresponding character.

constexpr char singleCharTokens[] = {'\0', '(', ')', '{', '}', ':'};
@@ -115,10 +115,10 @@ 

Tokenization

Colon = singleCharTokens[5], };

- It might happen that a developer writes something in the - source code that cannot be represented by any of the known - tokens. In such cases an Unk token is used, - that represents every unknown piece of source code. + A developer might write something in the source code that + cannot be represented by any of the known tokens. In such + cases an Unk token is used, that represents + every unknown piece of source code.

enum class TokenKind : char {
   Unk = -128,
@@ -153,12 +153,12 @@ 

The Lexer

The lexer is the part of the compiler that is responsible for producing the tokens. It iterates over a source file - character by character and does it's best to select the + character by character and does its best to select the correct token for each piece of code.

- Within the compiler a source file is represented by it's - path and a buffer filled with it's content. + Within the compiler, a source file is represented by its + path and a buffer filled with its content.

struct SourceFile {
   std::string_view path;
@@ -171,7 +171,7 @@ 

The Lexer

traverses the buffer. Because initially none of the characters in the source file is processed, the lexer points to the first character of the buffer and starts at the - position of line 1 column 0, or with other words, before the + position of line 1 column 0, or in other words, before the first character of the first line. The next Token is returned on demand by the getNextToken() method. @@ -194,7 +194,7 @@

The Lexer

eatNextChar() helper methods are introduced. The former returns which character is to be processed next, while the latter returns that character and advances the - lexer to the next character, while updating the correct line + lexer to the next character while updating the correct line and column position in the source file.

class Lexer {
@@ -267,14 +267,14 @@ 

The Lexer

... }

- A for loop is used to iterate over the single - character tokens array and if the current character matches - one of them, the corresponding token is returned. This is - the benefit of storing the characters in an array and making - their corresponding TokenKind have the value of - the ascii code of the character the token represents. This - way the TokenKind can immediately be returned - with a simple cast. + A for loop is used to iterate over the + single-character tokens array and if the current character + matches one of them, the corresponding token is returned. + This is the benefit of storing the characters in an array + and making their corresponding TokenKind have + the value of the ASCII code of the character the token + represents. This way the TokenKind can + immediately be returned with a simple cast.

Token Lexer::getNextToken() {
   ...
@@ -288,7 +288,7 @@ 

The Lexer

Design Note

- In production grade compilers single character tokens + In production-grade compilers, single-character tokens are usually handled using hardcoded branches, as that will lead to the fastest running code in general.

@@ -310,7 +310,7 @@

Design Note

if (currentChar == '\0') return Token{tokenStartLocation, TokenKind::eof};

- In this compiler the goal is to use a representation + In this compiler, the goal is to use a representation that takes as little boilerplate code to implement and extend as possible.

@@ -352,7 +352,7 @@

Design Note

While comments are not important for this compiler, other compilers that convert one language to another (e.g.: Java to Kotlin) or formatting tools do need to - know about them. In such cases the lexer might return a + know about them. In such cases, the lexer might return a dedicated Comment token with the contents of the comment.

@@ -360,8 +360,8 @@

Design Note

Identifiers and Keywords

Identifiers consist of multiple characters in the form of - (a-z|A-Z)(a-z|A-Z|0-9)*. Initially keywords are - also lexed as identifiers but later their corresponding + (a-z|A-Z)(a-z|A-Z|0-9)*. Initially, keywords + are also lexed as identifiers but later their corresponding TokenKind is looked up from the map and the correct token representing them is returned.

@@ -390,12 +390,12 @@

Identifiers and Keywords

}

Notice how isSpace, isAlpha, etc. - are all custom functions, when the C++ standard library also + are all custom functions when the C++ standard library also provides std::isspace, std::isalpha, etc.

- These functions are dependant on the current locale, so if + These functions are dependent on the current locale, so if for example 'a' is not considered alphabetic in the current locale, the lexer will no longer work as expected. @@ -403,8 +403,8 @@

Identifiers and Keywords

If none of the above conditions matches the current character and the end of the function is reached, the lexer - wasn't able to figure out which token represents the piece - of code starting at the current character, so an + can't figure out which token represents the piece of code + starting at the current character, so an Unk token is returned.

Token Lexer::getNextToken() {
diff --git a/www/parsing.html b/www/parsing.html
index 7eebbbe..d60c931 100644
--- a/www/parsing.html
+++ b/www/parsing.html
@@ -54,7 +54,7 @@ 

The Abstract Syntax Tree

building blocks (nouns, verbs, etc.) of sentences in a spoken language. The This section talks about the parser. sentence - is valid in the english language, because the mentioned + is valid in the English language because the mentioned building blocks follow each other in the correct order. Similarly fn main(): void {} is a valid function declaration in your language for the same @@ -98,8 +98,8 @@

The Abstract Syntax Tree

virtual void dump(size_t level = 0) const = 0; };

- Currently the only Decl in the language is the - FunctionDecl, which additionally to what every + Currently, the only Decl in the language is the + FunctionDecl, which in addition to what every declaration has in common, also has a return type and a body.

@@ -121,7 +121,7 @@

The Abstract Syntax Tree

To make the dumping of the node easier the indent() helper is introduced, which returns the indentation of a given level. For the indentation of - each level 2 spaces are used. + each level, 2 spaces are used.

std::string indent(size_t level) { return std::string(level * 2, ' '); }

@@ -153,7 +153,7 @@

The Abstract Syntax Tree

};

Because a Block doesn't have any child nodes, - it's textual representation only includes the name of the + its textual representation only includes the name of the node.

void Block::dump(size_t level) const {
@@ -162,7 +162,7 @@ 

The Abstract Syntax Tree

Design Note

- Lately some compiler engineers started using + Lately, some compiler engineers started using std::variant instead of inheritance to model the AST, where the variant acts as a union of nodes. @@ -199,8 +199,8 @@

Design Note

Expr *innerExpr; };

- In this case the question is, who owns the memory for - the innerExpr field. Who allocates it, who + In this case, the question is, who owns the memory for + the innerExpr field? Who allocates it, who is responsible for freeing it, etc. The workaround for this problem is to use a std::unique_ptr.

@@ -208,9 +208,9 @@

Design Note

std::unique_ptr<Expr> innerExpr; };

- Now it's clear that the node is the owner of it's child - node. However to know the current type of the variant, - innerExpr needs to be type checked. The + Now it's clear that the node is the owner of its child + node. However, to know the current type of the variant, + innerExpr needs to be type-checked. The same type checking however could also be performed on the pointer itself if Expr was a polymorphic base class. To avoid complexities, this @@ -270,8 +270,8 @@

Types

Design Note

- Theoretically a function is also a separate type, so in - a more complex language with a more complex type system + Theoretically, a function is also a separate type, so in + a more complex language with a more complex type system, this should also be encapsulated somehow.

@@ -280,7 +280,7 @@

Design Note

function type. To be able to model the complexity of C++ types precisely, Clang uses a layer-based type system, where - each layer is a different higher level type. + each layer is a different higher-level type.

An int * is represented using 2 layers, one @@ -327,7 +327,7 @@

The Parser

nextToken(lexer.getNextToken()) {} };

- Once the parser finished processing the next token, it calls + Once the parser finishes processing the next token, it calls the eatNextToken() helper, which consumes it and calls the lexer for the following one.

@@ -440,9 +440,9 @@

The Parser

... }

- It might happen that the source code is invalid and the - parser fails to process it completely. In that case the AST - is incomplete, which is marked by the + The source code might be invalid and the parser fails to + process it completely. In that case, the AST is incomplete, + which is marked by the incompleteAST flag.

class Parser {
@@ -552,9 +552,9 @@ 

Parsing Functions

return report(nextToken.location, msg);

The parseFunctionDecl() method expects the - current token to be KwFn, saves it's location - as the beginning of the function and checks if the rest of - the tokens are in the correct order. + current token to be KwFn, saves its location as + the beginning of the function and checks if the rest of the + tokens are in the correct order.

// <functionDecl>
 //  ::= 'fn' <identifier> '(' ')' ':' <type> <block>
@@ -583,7 +583,7 @@ 

Parsing Functions

}

The next tokens denoting the start and end of the argument - list are single character tokens, which don't require any + list are single-character tokens, which don't require any special handling.

std::unique_ptr<FunctionDecl> Parser::parseFunctionDecl() {
@@ -612,7 +612,7 @@ 

Parsing Functions

... }

- Finally the Block is parsed by the + Finally, the Block is parsed by the parseBlock() method. Similarly to the current method, parseBlock() also expects the first token to be the start of the block, so that token is checked @@ -627,7 +627,7 @@

Parsing Functions

... }

- If everything was successful, the + If everything is successful, the FunctionDecl node is returned.

std::unique_ptr<FunctionDecl> Parser::parseFunctionDecl() {
@@ -637,9 +637,9 @@ 

Parsing Functions

}

Parsing the type has been extracted into a dedicated helper - method, so that it can be reused later when the language is + method so that it can be reused later when the language is extended. The number type is handled in a later - chapter as so far there is no token that represents it. + chapter as so far no token can represent it.

This method checks if the current token is @@ -713,10 +713,10 @@

Parsing Functions

}

If main() is not found and the AST is complete - an error is reported. In case of an incomplete AST it might - have been parsing the main() function that - caused the syntax error, so nothing is reported to avoid - false positives. + an error is reported. In the case of an incomplete AST it + might have been parsing the main() function + that caused the syntax error, so nothing is reported to + avoid false positives.

std::pair<std::vector<std::unique_ptr<FunctionDecl>>, bool>
 Parser::parseSourceFile() {
@@ -808,7 +808,7 @@ 

Language Design

the syntax of a language. It might be tempting to introduce a certain syntax, but it can easily increase the difficulty of parsing that language and can even make expanding a - grammar rule dependant on the semantics of the source code. + grammar rule dependent on the semantics of the source code.

As an example take a look at the function declaration syntax @@ -817,7 +817,7 @@

Language Design

int foo(int); declares a function named foo, which returns an int and - accepts an int as parameter. + accepts an int as a parameter. int foo(0); is also a valid C++ code, that declares an int variable and initializes it to 0. @@ -826,7 +826,7 @@

Language Design

The issue arises when int foo(x); is encountered by the parser. Since C++ allows the creation of user-defined types, - x can either be a type, or a value. If + x can either be a type or a value. If x is a type, the above sequence of tokens is a function declaration, if x is a value, it is a variable declaration. @@ -845,8 +845,8 @@

Language Design

When the same sequence of symbols can have a different meaning based on what context they appear in, the grammar is called ambiguous. C++ is known to have multiple ambiguities - in it's grammar, though some are inherited from C such as - the pointer syntax. + in its grammar, though some are inherited from C such as the + pointer syntax.

typedef char a;
 a * b; // declares 'b', a pointer to 'a'
@@ -864,7 +864,7 @@ 

Language Design

A well-known source of ambiguity in programming languages is the generic syntax. Consider the following generic function call, which can appear in both C++ and Kotlin - function<type>(argument). For the parser + function<type>(argument). For the parser, this is a sequence of Identifier, <, Identifier, >, (, Identifier and ). @@ -883,13 +883,13 @@

Language Design

The source of the problem is that < can - either mean the start of a generic argument list, or the + either mean the start of a generic argument list or the less-than operator. Rust resolved this ambiguity by introducing the turbofish (::<>). The Rust parser knows that < always means the - less-than operator in confusing situations, because a - generic argument list must begin with - :: followed by the <. + less-than operator in confusing situations because a generic + argument list must begin with :: followed by + the <.

fn f<T>() {}