Skip to content

Commit

Permalink
[www] use a spell-checker that catches mistakes that the previous one…
Browse files Browse the repository at this point in the history
… didn't catch
  • Loading branch information
isuckatcs committed Jul 30, 2024
1 parent 5372513 commit 96506c2
Show file tree
Hide file tree
Showing 3 changed files with 91 additions and 92 deletions.
39 changes: 19 additions & 20 deletions www/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -49,28 +49,27 @@ <h1>How to Compile Your Language</h1>
<p>
This guide is intended to be a practical introduction to how
to design <i>your language</i> and implement a modern
compiler for it. The source code of the compiler is
available on
compiler for it. The compiler's source code is available on
<a
href="https://github.com/isuckatcs/how-to-compile-your-language"
target="_blank"
>GitHub</a
>.
</p>
<p>
When designing a language it helps if there is an idea what
the language is going to be used for. Is it indented to be
When designing a language it helps if there is an idea of
what the language will be used for. Is it intended to be
making systems programming safer like Rust? Is it targeting
AI developers like Mojo?
</p>
<p>
In this case the goal of the language is to showcase various
algorithms and techniques that are used in the
In this case, the goal of the language is to showcase
various algorithms and techniques that are used in the
implementation of some of the most popular languages like
C++, Kotlin or Rust.
C++, Kotlin, or Rust.
</p>
<p>
The guide also covers how to create a platform specific
The guide also covers how to create a platform-specific
executable with the help of the LLVM compiler
infrastructure, which all of the previously mentioned
languages use for the same purpose. Yes, even Kotlin can be
Expand All @@ -82,10 +81,10 @@ <h2>What Does Every Language Have in Common?</h2>
When creating a new language, the first question is how to
get started. There is something that every existing language
and <i>your language</i> must define too, which is the entry
point from which the execution starts.
point from which the execution begins.
</p>
<p>
In scripting languages like JavaScript the execution of the
In scripting languages like JavaScript, the execution of the
code usually starts from the first line of the source file,
while most programming languages including
<i>your language</i> treat the <code>main()</code> function
Expand All @@ -99,16 +98,16 @@ <h2>What Does Every Language Have in Common?</h2>
already popular language.
</p>
<p>
In the past 50 years the syntax of a function declaration
In the past 50 years, the syntax of a function declaration
was the name of the function followed by the list of
arguments enclosed by <code>(</code> and <code>)</code>. At
first glance it is tempting to introduce some new exotic
first glance, it is tempting to introduce some new exotic
syntax like <code>main<> {}</code>, but in many popular
languages <code><></code> might mean something completely
different, in this case a generic argument list. Using such
syntax for a function definition would probably cause
confusion for developers who try to get familiar with this
new language, which is something to keep in mind.
different, in this case, a generic argument list. Using such
syntax for a function definition would probably confuse
developers who are trying to get familiar with this new
language, which is something to keep in mind.
</p>
<h2>How Is This Text Turned into an Executable?</h2>
<p>
Expand All @@ -121,7 +120,7 @@ <h2>How Is This Text Turned into an Executable?</h2>
The <code>frontend</code> contains the actual implementation
of the language, it is responsible for ensuring that the
program written in the specific language doesn't contain any
errors, and reporting every issue it finds to the developer.
errors and reporting every issue it finds to the developer.
</p>
<p>
After validating the program, it turns it into an
Expand All @@ -141,9 +140,9 @@ <h2>How Is This Text Turned into an Executable?</h2>
</p>
<h2>Is It Possible to Learn All These Topics?</h2>
<p>
Yes, with enough time. However there is no need to learn all
of them to create a successful language. In fact even a lot
of modern popular languages like <code>C++</code>,
Yes, with enough time. However, there is no need to learn
all of them to create a successful language. In fact, even a
lot of modern popular languages like <code>C++</code>,
<code>Rust</code>, <code>Swift</code>,
<code>Haskell</code> or <code>Kotlin/Native</code> rely on
<code>LLVM</code> for optimization and code generation.
Expand Down
64 changes: 32 additions & 32 deletions www/lexing.html
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@
<h1>Tokenization</h1>
<p>
The first step of the compilation process is to take the
textual representation of the program and brake it down into
textual representation of the program and break it down into
a list of tokens. Like spoken languages have sentences that
are composed of nouns, verbs, adjectives, etc., programming
languages similarly are composed of a set of tokens.
Expand All @@ -64,8 +64,8 @@ <h1>Tokenization</h1>
be named anything else like <code>foo</code> or
<code>bar</code>. One thing these names have in common is
that each of them uniquely identifies the given function, so
the token that represent such piece of source code is called
the <code>Identifier</code> token.
the token that represents such a piece of source code is
called the <code>Identifier</code> token.
</p>
<pre><code>enum class TokenKind : char {
Identifier
Expand All @@ -79,8 +79,8 @@ <h1>Tokenization</h1>
functions called <code>fn</code> or <code>void</code>.
</p>
<p>
Each keyword gets it's own unique token, so that it's easy
to differentiate between them.
Each keyword gets its unique token so that it's easy to
differentiate between them.
</p>
<pre><code>enum class TokenKind : char {
...
Expand All @@ -99,7 +99,7 @@ <h1>Tokenization</h1>
The rest of the tokens, including <code>EOF</code> are
tokens composed of a single character. To make creating them
easier, each of these tokens is placed into an array and
their respective enumerator values are the ascii code of
their respective enumerator values are the ASCII code of
their corresponding character.
</p>
<pre><code>constexpr char singleCharTokens[] = {'\0', '(', ')', '{', '}', ':'};
Expand All @@ -115,10 +115,10 @@ <h1>Tokenization</h1>
Colon = singleCharTokens[5],
};</code></pre>
<p>
It might happen that a developer writes something in the
source code that cannot be represented by any of the known
tokens. In such cases an <code>Unk</code> token is used,
that represents every unknown piece of source code.
A developer might write something in the source code that
cannot be represented by any of the known tokens. In such
cases an <code>Unk</code> token is used, that represents
every unknown piece of source code.
</p>
<pre><code>enum class TokenKind : char {
Unk = -128,
Expand Down Expand Up @@ -153,12 +153,12 @@ <h2>The Lexer</h2>
<p>
The lexer is the part of the compiler that is responsible
for producing the tokens. It iterates over a source file
character by character and does it's best to select the
character by character and does its best to select the
correct token for each piece of code.
</p>
<p>
Within the compiler a source file is represented by it's
path and a buffer filled with it's content.
Within the compiler, a source file is represented by its
path and a buffer filled with its content.
</p>
<pre><code>struct SourceFile {
std::string_view path;
Expand All @@ -171,7 +171,7 @@ <h2>The Lexer</h2>
traverses the buffer. Because initially none of the
characters in the source file is processed, the lexer points
to the first character of the buffer and starts at the
position of line 1 column 0, or with other words, before the
position of line 1 column 0, or in other words, before the
first character of the first line. The next
<code>Token</code> is returned on demand by the
<code>getNextToken()</code> method.
Expand All @@ -194,7 +194,7 @@ <h2>The Lexer</h2>
<code>eatNextChar()</code> helper methods are introduced.
The former returns which character is to be processed next,
while the latter returns that character and advances the
lexer to the next character, while updating the correct line
lexer to the next character while updating the correct line
and column position in the source file.
</p>
<pre><code>class Lexer {
Expand Down Expand Up @@ -267,14 +267,14 @@ <h2>The Lexer</h2>
...
}</code></pre>
<p>
A <code>for</code> loop is used to iterate over the single
character tokens array and if the current character matches
one of them, the corresponding token is returned. This is
the benefit of storing the characters in an array and making
their corresponding <code>TokenKind</code> have the value of
the ascii code of the character the token represents. This
way the <code>TokenKind</code> can immediately be returned
with a simple cast.
A <code>for</code> loop is used to iterate over the
single-character tokens array and if the current character
matches one of them, the corresponding token is returned.
This is the benefit of storing the characters in an array
and making their corresponding <code>TokenKind</code> have
the value of the ASCII code of the character the token
represents. This way the <code>TokenKind</code> can
immediately be returned with a simple cast.
</p>
<pre><code>Token Lexer::getNextToken() {
...
Expand All @@ -288,7 +288,7 @@ <h2>The Lexer</h2>
<blockquote>
<h3>Design Note</h3>
<p>
In production grade compilers single character tokens
In production-grade compilers, single-character tokens
are usually handled using hardcoded branches, as that
will lead to the fastest running code in general.
</p>
Expand All @@ -310,7 +310,7 @@ <h3>Design Note</h3>
if (currentChar == '\0')
return Token{tokenStartLocation, TokenKind::eof};</code></pre>
<p>
In this compiler the goal is to use a representation
In this compiler, the goal is to use a representation
that takes as little boilerplate code to implement and
extend as possible.
</p>
Expand Down Expand Up @@ -352,16 +352,16 @@ <h3>Design Note</h3>
While comments are not important for this compiler,
other compilers that convert one language to another
(e.g.: Java to Kotlin) or formatting tools do need to
know about them. In such cases the lexer might return a
know about them. In such cases, the lexer might return a
dedicated <code>Comment</code> token with the contents
of the comment.
</p>
</blockquote>
<h2>Identifiers and Keywords</h2>
<p>
Identifiers consist of multiple characters in the form of
<code>(a-z|A-Z)(a-z|A-Z|0-9)*</code>. Initially keywords are
also lexed as identifiers but later their corresponding
<code>(a-z|A-Z)(a-z|A-Z|0-9)*</code>. Initially, keywords
are also lexed as identifiers but later their corresponding
<code>TokenKind</code> is looked up from the map and the
correct token representing them is returned.
</p>
Expand Down Expand Up @@ -390,21 +390,21 @@ <h2>Identifiers and Keywords</h2>
}</code></pre>
<p>
Notice how <code>isSpace</code>, <code>isAlpha</code>, etc.
are all custom functions, when the C++ standard library also
are all custom functions when the C++ standard library also
provides <code>std::isspace</code>,
<code>std::isalpha</code>, etc.
</p>
<p>
These functions are dependant on the current locale, so if
These functions are dependent on the current locale, so if
for example
<code>'a'</code> is not considered alphabetic in the current
locale, the lexer will no longer work as expected.
</p>
<p>
If none of the above conditions matches the current
character and the end of the function is reached, the lexer
wasn't able to figure out which token represents the piece
of code starting at the current character, so an
can't figure out which token represents the piece of code
starting at the current character, so an
<code>Unk</code> token is returned.
</p>
<pre><code>Token Lexer::getNextToken() {
Expand Down
Loading

0 comments on commit 96506c2

Please sign in to comment.