From 96506c2463adc6749d450a5f5cf3d0121744b0d3 Mon Sep 17 00:00:00 2001 From: isuckatcs <65320245+isuckatcs@users.noreply.github.com> Date: Tue, 30 Jul 2024 03:12:21 +0200 Subject: [PATCH] [www] use a spell-checker that catches mistakes that the previous one didn't catch --- www/index.html | 39 ++++++++++++----------- www/lexing.html | 64 +++++++++++++++++++------------------- www/parsing.html | 80 ++++++++++++++++++++++++------------------------ 3 files changed, 91 insertions(+), 92 deletions(-) diff --git a/www/index.html b/www/index.html index 4a0960b..f113882 100644 --- a/www/index.html +++ b/www/index.html @@ -49,8 +49,7 @@
This guide is intended to be a practical introduction to how to design your language and implement a modern - compiler for it. The source code of the compiler is - available on + compiler for it. The compiler's source code is available on How to Compile Your Language >.
- When designing a language it helps if there is an idea what - the language is going to be used for. Is it indented to be + When designing a language it helps if there is an idea of + what the language will be used for. Is it intended to be making systems programming safer like Rust? Is it targeting AI developers like Mojo?
- In this case the goal of the language is to showcase various - algorithms and techniques that are used in the + In this case, the goal of the language is to showcase + various algorithms and techniques that are used in the implementation of some of the most popular languages like - C++, Kotlin or Rust. + C++, Kotlin, or Rust.
- The guide also covers how to create a platform specific + The guide also covers how to create a platform-specific executable with the help of the LLVM compiler infrastructure, which all of the previously mentioned languages use for the same purpose. Yes, even Kotlin can be @@ -82,10 +81,10 @@
- In scripting languages like JavaScript the execution of the
+ In scripting languages like JavaScript, the execution of the
code usually starts from the first line of the source file,
while most programming languages including
your language treat the main()
function
@@ -99,16 +98,16 @@
- In the past 50 years the syntax of a function declaration
+ In the past 50 years, the syntax of a function declaration
was the name of the function followed by the list of
arguments enclosed by (
and )
. At
- first glance it is tempting to introduce some new exotic
+ first glance, it is tempting to introduce some new exotic
syntax like main<> {}
, but in many popular
languages <>
might mean something completely
- different, in this case a generic argument list. Using such
- syntax for a function definition would probably cause
- confusion for developers who try to get familiar with this
- new language, which is something to keep in mind.
+ different, in this case, a generic argument list. Using such
+ syntax for a function definition would probably confuse
+ developers who are trying to get familiar with this new
+ language, which is something to keep in mind.
@@ -121,7 +120,7 @@
frontend
contains the actual implementation
of the language, it is responsible for ensuring that the
program written in the specific language doesn't contain any
- errors, and reporting every issue it finds to the developer.
+ errors and reporting every issue it finds to the developer.
After validating the program, it turns it into an @@ -141,9 +140,9 @@
- Yes, with enough time. However there is no need to learn all
- of them to create a successful language. In fact even a lot
- of modern popular languages like C++
,
+ Yes, with enough time. However, there is no need to learn
+ all of them to create a successful language. In fact, even a
+ lot of modern popular languages like C++
,
Rust
, Swift
,
Haskell
or Kotlin/Native
rely on
LLVM
for optimization and code generation.
diff --git a/www/lexing.html b/www/lexing.html
index 2d72967..6836d56 100644
--- a/www/lexing.html
+++ b/www/lexing.html
@@ -50,7 +50,7 @@
The first step of the compilation process is to take the - textual representation of the program and brake it down into + textual representation of the program and break it down into a list of tokens. Like spoken languages have sentences that are composed of nouns, verbs, adjectives, etc., programming languages similarly are composed of a set of tokens. @@ -64,8 +64,8 @@
foo
or
bar
. One thing these names have in common is
that each of them uniquely identifies the given function, so
- the token that represent such piece of source code is called
- the Identifier
token.
+ the token that represents such a piece of source code is
+ called the Identifier
token.
enum class TokenKind : char {
Identifier
@@ -79,8 +79,8 @@ Tokenization
functions called fn
or void
.
- Each keyword gets it's own unique token, so that it's easy
- to differentiate between them.
+ Each keyword gets its unique token so that it's easy to
+ differentiate between them.
enum class TokenKind : char {
...
@@ -99,7 +99,7 @@ Tokenization
The rest of the tokens, including EOF
are
tokens composed of a single character. To make creating them
easier, each of these tokens is placed into an array and
- their respective enumerator values are the ascii code of
+ their respective enumerator values are the ASCII code of
their corresponding character.
constexpr char singleCharTokens[] = {'\0', '(', ')', '{', '}', ':'};
@@ -115,10 +115,10 @@ Tokenization
Colon = singleCharTokens[5],
};
- It might happen that a developer writes something in the
- source code that cannot be represented by any of the known
- tokens. In such cases an Unk
token is used,
- that represents every unknown piece of source code.
+ A developer might write something in the source code that
+ cannot be represented by any of the known tokens. In such
+ cases an Unk
token is used, that represents
+ every unknown piece of source code.
enum class TokenKind : char {
Unk = -128,
@@ -153,12 +153,12 @@ The Lexer
The lexer is the part of the compiler that is responsible
for producing the tokens. It iterates over a source file
- character by character and does it's best to select the
+ character by character and does its best to select the
correct token for each piece of code.
- Within the compiler a source file is represented by it's
- path and a buffer filled with it's content.
+ Within the compiler, a source file is represented by its
+ path and a buffer filled with its content.
struct SourceFile {
std::string_view path;
@@ -171,7 +171,7 @@ The Lexer
traverses the buffer. Because initially none of the
characters in the source file is processed, the lexer points
to the first character of the buffer and starts at the
- position of line 1 column 0, or with other words, before the
+ position of line 1 column 0, or in other words, before the
first character of the first line. The next
Token
is returned on demand by the
getNextToken()
method.
@@ -194,7 +194,7 @@ The Lexer
eatNextChar()
helper methods are introduced.
The former returns which character is to be processed next,
while the latter returns that character and advances the
- lexer to the next character, while updating the correct line
+ lexer to the next character while updating the correct line
and column position in the source file.
class Lexer {
@@ -267,14 +267,14 @@ The Lexer
...
}
- A for
loop is used to iterate over the single
- character tokens array and if the current character matches
- one of them, the corresponding token is returned. This is
- the benefit of storing the characters in an array and making
- their corresponding TokenKind
have the value of
- the ascii code of the character the token represents. This
- way the TokenKind
can immediately be returned
- with a simple cast.
+ A for
loop is used to iterate over the
+ single-character tokens array and if the current character
+ matches one of them, the corresponding token is returned.
+ This is the benefit of storing the characters in an array
+ and making their corresponding TokenKind
have
+ the value of the ASCII code of the character the token
+ represents. This way the TokenKind
can
+ immediately be returned with a simple cast.
Token Lexer::getNextToken() {
...
@@ -288,7 +288,7 @@ The Lexer
Design Note
- In production grade compilers single character tokens
+ In production-grade compilers, single-character tokens
are usually handled using hardcoded branches, as that
will lead to the fastest running code in general.
@@ -310,7 +310,7 @@ Design Note
if (currentChar == '\0')
return Token{tokenStartLocation, TokenKind::eof};
- In this compiler the goal is to use a representation
+ In this compiler, the goal is to use a representation
that takes as little boilerplate code to implement and
extend as possible.
@@ -352,7 +352,7 @@ Design Note
While comments are not important for this compiler,
other compilers that convert one language to another
(e.g.: Java to Kotlin) or formatting tools do need to
- know about them. In such cases the lexer might return a
+ know about them. In such cases, the lexer might return a
dedicated Comment
token with the contents
of the comment.
@@ -360,8 +360,8 @@ Design Note
Identifiers and Keywords
Identifiers consist of multiple characters in the form of
- (a-z|A-Z)(a-z|A-Z|0-9)*
. Initially keywords are
- also lexed as identifiers but later their corresponding
+ (a-z|A-Z)(a-z|A-Z|0-9)*
. Initially, keywords
+ are also lexed as identifiers but later their corresponding
TokenKind
is looked up from the map and the
correct token representing them is returned.
@@ -390,12 +390,12 @@ Identifiers and Keywords
}
Notice how isSpace
, isAlpha
, etc.
- are all custom functions, when the C++ standard library also
+ are all custom functions when the C++ standard library also
provides std::isspace
,
std::isalpha
, etc.
- These functions are dependant on the current locale, so if
+ These functions are dependent on the current locale, so if
for example
'a'
is not considered alphabetic in the current
locale, the lexer will no longer work as expected.
@@ -403,8 +403,8 @@
Identifiers and Keywords
If none of the above conditions matches the current
character and the end of the function is reached, the lexer
- wasn't able to figure out which token represents the piece
- of code starting at the current character, so an
+ can't figure out which token represents the piece of code
+ starting at the current character, so an
Unk
token is returned.
Token Lexer::getNextToken() {
diff --git a/www/parsing.html b/www/parsing.html
index 7eebbbe..d60c931 100644
--- a/www/parsing.html
+++ b/www/parsing.html
@@ -54,7 +54,7 @@ The Abstract Syntax Tree
building blocks (nouns, verbs, etc.) of sentences in a
spoken language. The
This section talks about the parser.
sentence
- is valid in the english language, because the mentioned
+ is valid in the English language because the mentioned
building blocks follow each other in the correct order.
Similarly fn main(): void {}
is a valid
function declaration in your language for the same
@@ -98,8 +98,8 @@ The Abstract Syntax Tree
virtual void dump(size_t level = 0) const = 0;
};
- Currently the only Decl
in the language is the
- FunctionDecl
, which additionally to what every
+ Currently, the only Decl
in the language is the
+ FunctionDecl
, which in addition to what every
declaration has in common, also has a return type and a
body.
@@ -121,7 +121,7 @@ The Abstract Syntax Tree
To make the dumping of the node easier the
indent()
helper is introduced, which returns
the indentation of a given level. For the indentation of
- each level 2 spaces are used.
+ each level, 2 spaces are used.
std::string indent(size_t level) { return std::string(level * 2, ' '); }
@@ -153,7 +153,7 @@
The Abstract Syntax Tree
};
Because a Block
doesn't have any child nodes,
- it's textual representation only includes the name of the
+ its textual representation only includes the name of the
node.
void Block::dump(size_t level) const {
@@ -162,7 +162,7 @@ The Abstract Syntax Tree
Design Note
- Lately some compiler engineers started using
+ Lately, some compiler engineers started using
std::variant
instead of inheritance to
model the AST, where the variant acts as a union of
nodes.
@@ -199,8 +199,8 @@
Design Note
Expr *innerExpr;
};
- In this case the question is, who owns the memory for
- the innerExpr
field. Who allocates it, who
+ In this case, the question is, who owns the memory for
+ the innerExpr
field? Who allocates it, who
is responsible for freeing it, etc. The workaround for
this problem is to use a std::unique_ptr
.
@@ -208,9 +208,9 @@ Design Note
std::unique_ptr<Expr> innerExpr;
};
- Now it's clear that the node is the owner of it's child
- node. However to know the current type of the variant,
- innerExpr
needs to be type checked. The
+ Now it's clear that the node is the owner of its child
+ node. However, to know the current type of the variant,
+ innerExpr
needs to be type-checked. The
same type checking however could also be performed on
the pointer itself if Expr
was a
polymorphic base class. To avoid complexities, this
@@ -270,8 +270,8 @@
Types
Design Note
- Theoretically a function is also a separate type, so in
- a more complex language with a more complex type system
+ Theoretically, a function is also a separate type, so in
+ a more complex language with a more complex type system,
this should also be encapsulated somehow.
@@ -280,7 +280,7 @@
Design Note
function type. To be able to model the complexity of C++
types precisely,
Clang
uses a layer-based type system, where
- each layer is a different higher level type.
+ each layer is a different higher-level type.
An int *
is represented using 2 layers, one
@@ -327,7 +327,7 @@
The Parser
nextToken(lexer.getNextToken()) {}
};
- Once the parser finished processing the next token, it calls
+ Once the parser finishes processing the next token, it calls
the eatNextToken()
helper, which consumes it
and calls the lexer for the following one.
- It might happen that the source code is invalid and the
- parser fails to process it completely. In that case the AST
- is incomplete, which is marked by the
+ The source code might be invalid and the parser fails to
+ process it completely. In that case, the AST is incomplete,
+ which is marked by the
incompleteAST
flag.
class Parser {
@@ -552,9 +552,9 @@ Parsing Functions
return report(nextToken.location, msg);
The parseFunctionDecl()
method expects the
- current token to be KwFn
, saves it's location
- as the beginning of the function and checks if the rest of
- the tokens are in the correct order.
+ current token to be KwFn
, saves its location as
+ the beginning of the function and checks if the rest of the
+ tokens are in the correct order.
// <functionDecl>
// ::= 'fn' <identifier> '(' ')' ':' <type> <block>
@@ -583,7 +583,7 @@ Parsing Functions
}
The next tokens denoting the start and end of the argument - list are single character tokens, which don't require any + list are single-character tokens, which don't require any special handling.
std::unique_ptr<FunctionDecl> Parser::parseFunctionDecl() {
@@ -612,7 +612,7 @@ Parsing Functions
...
}
- Finally the Block
is parsed by the
+ Finally, the Block
is parsed by the
parseBlock()
method. Similarly to the current
method, parseBlock()
also expects the first
token to be the start of the block, so that token is checked
@@ -627,7 +627,7 @@
- If everything was successful, the
+ If everything is successful, the
FunctionDecl
node is returned.
std::unique_ptr<FunctionDecl> Parser::parseFunctionDecl() {
@@ -637,9 +637,9 @@ Parsing Functions
}
Parsing the type has been extracted into a dedicated helper
- method, so that it can be reused later when the language is
+ method so that it can be reused later when the language is
extended. The number
type is handled in a later
- chapter as so far there is no token that represents it.
+ chapter as so far no token can represent it.
This method checks if the current token is @@ -713,10 +713,10 @@
If main()
is not found and the AST is complete
- an error is reported. In case of an incomplete AST it might
- have been parsing the main()
function that
- caused the syntax error, so nothing is reported to avoid
- false positives.
+ an error is reported. In the case of an incomplete AST it
+ might have been parsing the main()
function
+ that caused the syntax error, so nothing is reported to
+ avoid false positives.
std::pair<std::vector<std::unique_ptr<FunctionDecl>>, bool>
Parser::parseSourceFile() {
@@ -808,7 +808,7 @@ Language Design
the syntax of a language. It might be tempting to introduce
a certain syntax, but it can easily increase the difficulty
of parsing that language and can even make expanding a
- grammar rule dependant on the semantics of the source code.
+ grammar rule dependent on the semantics of the source code.
As an example take a look at the function declaration syntax
@@ -817,7 +817,7 @@
Language Design
int foo(int);
declares a function named
foo
, which returns an int
and
- accepts an int
as parameter.
+ accepts an int
as a parameter.
int foo(0);
is also a valid C++ code, that
declares an int
variable and initializes it to
0
.
@@ -826,7 +826,7 @@
Language Design
The issue arises when
int foo(x);
is encountered by the parser. Since
C++ allows the creation of user-defined types,
- x
can either be a type, or a value. If
+ x
can either be a type or a value. If
x
is a type, the above sequence of tokens is a
function declaration, if x
is a value, it is a
variable declaration.
@@ -845,8 +845,8 @@ Language Design
When the same sequence of symbols can have a different
meaning based on what context they appear in, the grammar is
called ambiguous. C++ is known to have multiple ambiguities
- in it's grammar, though some are inherited from C such as
- the pointer syntax.
+ in its grammar, though some are inherited from C such as the
+ pointer syntax.
typedef char a;
a * b; // declares 'b', a pointer to 'a'
@@ -864,7 +864,7 @@ Language Design
A well-known source of ambiguity in programming languages is
the generic syntax. Consider the following generic function
call, which can appear in both C++ and Kotlin
- function<type>(argument)
. For the parser
+ function<type>(argument)
. For the parser,
this is a sequence of Identifier
,
<
, Identifier
, >
,
(
, Identifier
and )
.
@@ -883,13 +883,13 @@ Language Design
The source of the problem is that <
can
- either mean the start of a generic argument list, or the
+ either mean the start of a generic argument list or the
less-than operator. Rust resolved this ambiguity by
introducing the turbofish (::<>
). The Rust
parser knows that <
always means the
- less-than operator in confusing situations, because a
- generic argument list must begin with
- ::
followed by the <
.
+ less-than operator in confusing situations because a generic
+ argument list must begin with ::
followed by
+ the <
.
fn f<T>() {}