A language evolves by the introduction of new features (optimizations, new primitives, etc). If you want to use such a novel language feature in its own implementation, then you need to bootstrap it:
-
First, implement the support for it in your compiler and/or eval, and produce an executable that can already compile and/or eval this new version of the language.
-
After that, you can start using this feature, and now you may even rewrite the implementation of this very feature, and use/assume this feature in its own implementation.
It's a confusing enough process, therefore it makes sense to fork the codebase at the point between 1) and 2). Strictly speaking, it would be enough to git checkout and build a specific prior commit to provide an executable to execute the bootstrap process, but it's better to have separate branches.
Once it's working fine, the old branch becomes irrelevant/stale, except for:
-
Didactic purposes: to make it easier to understand how a self-hosted language grows.
-
Aesthetics: cherry-picking or backporting changes wouldn't be possible without having separate branches.
-
"Oh God, we have lost all the executables!" -- bootstrap again all the way up from a C implementation.
-
Sometimes the implementation of the new feature simply requires two parallel, wildly diverging instances of the codebase, until the new feature is fully implemented/debugged/bootstrapped.
-
Secure computing requires, above all, trusing your compiler. Reproducible builds, and being able to bootstrap your compiler on top of multiple platforms is useful for achieving it. See here, and here for more.
NOTE: do not confuse our notion of a stage (as in 'developmental stages') with e.g. the 3 bootstrap stages while compiling GCC. Our notion is an endless iterative process of evolving the language. Suggestions for a better nomenclature are welcome!
The bootstrap process in general is the following:
-
Stage
n
checks out and builds its parent/hosting stage underbuild/
(typically stage(n-1)
of the same language) to acquire aneval
executable. -
Using that executable and the compiler of the previous stage, it compiles a version of itself that can already load and compile the codebase in stage
n
, but the resulting executable may not be fully functional yet (in this phase theevolving?
variable is true). It's calledeval0
in the build process.Note that this phase is not always necessary, depending on the nature of the new features that are being bootstrapped, and in some earlier stages it is not done. It's useful to enjoy the benefits of the new features of this stage, and it's necessary when we introduce a new feature that the compiler itself needs to be aware of (either because its implementation relies on or uses this feature, or e.g. in the case of the introduction of modules it needs to reach through module boundaries during the compilation process).
Note that
eval0
is not automatically rebuilt when its source files change to speed up development. You can rebuild it usingmake eval0
. -
Then it uses the resulting, potentially only semi-functional
eval0
executable to now compile itself using its own compiler, which will yield the final, fully functionaleval1
executable. -
Optionally, the
test-bootstrap
makefile target runs one more cycle to produceeval2
, and checks if the compiler's output is identical with that of the previous step.
The boot.l
and emit.l
files are kept in the same branch with the eval.l
whose semantics they are assuming, i.e. the eval1
executable of the maru.2
stage is built by the eval
executable, the boot.l
, and the emit.l
files
of the previous, maru.1
stage.
The developmental stages of the language are kept in separate git branches. When a new stage needs to be opened, the readme is replaced in the branch that became stale to only document what's new/relevant for that specific stage (i.e. if you switch branches on the GitHub website you'll see it displayed).
Naming convention of the branches (no main
branch):
[language name].[bootstrap stage]
, e.g maru.1
.
Optionally, and typically for the first stage, it may also include the name of the parent language, from which this "bootstrap sprout" grows out:
[language name].[bootstrap stage].[parent language]
, e.g. maru.1.c99
,
which holds the bootstrap implementation written in C.
During the build the previous stage is git checkout
'ed locally under ./build/
,
and its own build process is invoked in that directory. Note that this potentially
becomes a recursive process until a stage is reached that can be built using some
external dependency. This may happen by reaching an eval.c
in the bottom stage/branch
called maru.1.c99
that can be built using a C compiler, or by reaching a
stage that has its build artifacts checked into the git repo (e.g. an eval.s
or
eval.ll
).
Starting with maru.5
, the LLVM IR output (eval2.ll
) is committed into the repo under
build/
. This effectively short-circuits the recursive bootstrap process by
straight away producing an executable from the checked-in eval2.ll
using llc
(see make eval-llvm
).
Deleting these files (note: make clean
retains them! see make veryclean
),
or touching the sources will force a normal bootstrap process hosted by
the previous stage.
It's possible to skip these shortcuts and run the bootstrap procedure all the way from
the/a bottom stage by make PREVIOUS_STAGE_EXTRA_TARGETS=veryclean veryclean test-bootstrap
.
In the bootstrap process most abstractions are present twice: the old versions in the host env, and the new versions loaded into in the slave env. At certain parts of the codebase these potentially incompatible definitions can mix:
-
The compiler is running in the host's environment, but compiles the definitions of the slave. Thus, it inherently needs to cross the host-slave boundary (ideally, always in a controlled and explicit way, guarded by asserts that fail early and loud).
-
A lot of the
forms
(macros) of the slave must be executed/expanded while building up the set of definitions that will be level-shifted by the compiler to the target universe. These forms sometimes need to deal with the lexical environment (of type<env>
) that is instantiated by the host. The object layout of these<env>
objects will be that of which was specified in the host's codebase at the time of generating an eval executable from it. (The constructor function of<env>
s is calledenvironment
ineval.l
. Wheneval.l
is compiled, it "captures" the object layout through the slot-index literals in the expansion of the accessor forms. Accessors expand tooop-at
forms with literal indexes, and these are then directly compiled to machine instructions).
A list of types and occasions where such leakage happens (meant to be exhaustive, but it's probably not yet):
-
objects in the slave's source code:
<pair>
,<long>
,<string>
,<symbol>
,()
, i.e. objects that are created by the host's reader while parsing the slave's codebase into an object graph. -
<primitive-function>
,<expr>
,<env>
,<fixed>
: the source code of the slave getsencode
d by the host, therefore it may also contain objects of these types besides the list above. -
<env>
: whenever environments are passed to slave code, e.g. the forms defined in the slave will receive instances of the host's<env>
type. -
<type>
,<record>
: if we want to dispatch on the slave types while the host executable is bringing the slave to life, then the slave types need to integrate into that of the host's. What this means is that in the bootstrap process the slave does not create its own<type>
and<record>
instances, but "borrows" them from the host.