This project proposes a file format for programs encoded in binary
lambda calculus (BLC). Its main goal is to minimize redundancy by
incorporating reference-based term sharing, while remaining easy to work
with. The included bloc
tool converts BLC to BLoC and greatly reduces
its size.
The additional tool blocade
(found in
BLoCade) can compile .bloc
files to various targets, including a shared pure BLC representation.
Before explaining the format or its optimization techniques, let me first show you some results:
- x86 C compiler 8cc translated to
lambda calculus:
- the original expression takes ~5M bytes in bit-encoded BLC
- the same expression in BLoC needs only ~640K bytes (which is around 2x the original 8cc!)
- The bruijn expression
fac (+30)
, wherefac
is the factorial implementation fromstd/Math
:- the original expression takes 1200 bytes in bit-encoded BLC
- the same expression in BLoC needs only 349 bytes
- My
solution
for the “Advent of Code” challenge
2022/01 in
bruijn:
- the original expression takes 6258 bytes in bit-encoded BLC
- the same expression in BLoC needs only 935 bytes
You can find these examples in test/
.
It’s assumed that all bit-encoded strings are padded with zeroes at the end.
from | to | content |
---|---|---|
0x00 | 0x04 | identifier: “BLoC” |
0x04 | 0x06 | number of entries |
0x06 | 0x?? | entries |
This reflects the basic structure of an expression. It uses the following derivation of normal bit-encoded BLC:
prefix | content |
---|---|
010M | abstraction of expression M |
00MN | application of M and N |
1i+10 | bruijn index i |
011I | index* to an entry |
(*): The encoding of indices is quite special: src/{build,parse}.c
).
The final program will be in the last entry. The indices start counting from the number of entries down to 0.
Let E
be some kind of expression like E=\x.(((M (\y.N)) M) N)
, where
M
and N
are both arbitrary expressions of length 16.
The raw BLC expression of E
would then be E=00010101M00NMN
. This
obviously has the drawback of redundant repetition of M
and N
.
A possible encoding in BLoC:
from | to | content |
---|---|---|
0x00 | 0x04 | “BLoC” |
0x04 | 0x06 | number of entries: 3 |
0x06 | 0x17 | encoded M : gets a bit longer due to different encoding |
0x17 | 0x28 | encoded N : gets a bit longer due to different encoding |
0x28 | 0x34 | 00010010010011<M>00011<N>011<M>011<N> , where <M>=00{1} and <N>=00{0} are indices with length of 1.25 byte |
Even in this small example BLoC uses less space than BLC (0x34 vs. 0x42
bytes). Depending on the content of M
and N
, this could have
potentially been compressed even more.
You can dump the bloc table using the -d/--dump
flag of bloc
. Some
additional dumps for testing can also be found in test/
.
The optimizer converts a normal BLC expression to the BLoC format.
In order to find the largest repeating sub-expressions, the expression first gets converted to a hashed tree (similar to a Merkle tree).
The repeating sub-trees then get sorted by length and inserted into the
final table, while replacing them with references in the original
expression (see src/tree.c
for more).
As of right now, expressions don’t get beta-reduced or manipulated in any other way. As an idea for the future, long expressions could get reduced using different techniques/depths and then get replaced with the shortest one (as fully reduced expressions aren’t necessarily shorter).