Initial motivation for this project was to learn the Mojo programming language and the best is to learn by doing. Since I've been involved in the Apache Arrow project for a while, I thought it would be a good idea to implement the Arrow specification in Mojo.
The implementation is far from being complete or usable in practice, but I prefer to share it its early stage so others can join the effort.
Apache Arrow is a cross-language development platform for in-memory data. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware like CPUs and GPUs.
Mojo is promising new programming language built on top of MLIR providing the expressiveness of Python, with the performance of systems programming languages.
I find the Mojo lanauge really promising and Arrow should be a first-class citizen in Mojo's ecosystem. Since the language itself is still in its early stages, under heavy development, this Arrow implementation is still in an experimental phase.
Buffer
providing the memory management for contiguous memory regions.DataType
for defining theArrow
data types.ArrayData
as the common layout for allArrow
arrays.- Typed array views for primitive, string and nested arrow arrays providing more convenient and efficient access to the underlying
ArrayData
. - Arrow C Data Interface to exchange arrow data between other implementations in a zero-copy manner, but only one direction is implemented for now.
from firebolt.arrays import array, StringArray, ListArray, Int64Array
from firebolt.dtypes import int8, bool_, list_
var a = array[int8](1, 2, 3, 4)
var b = array[bool_](True, False, True)
var s = StringArray()
s.unsafe_append("hello")
s.unsafe_append("world")
More convenient APIs are planned to be added in the future.
var ints = Int64Array()
var lists = ListArray(ints)
ints.append(1)
ints.append(2)
ints.append(3)
lists.unsafe_append(True)
assert_equal(len(lists), 1)
assert_equal(lists.data.dtype, list_(int64))
For more details see the Arrow C Data Interface.
var pa = Python.import_module("pyarrow")
var pyarr = pa.array(
[1, 2, 3, 4, 5], mask=[False, False, False, False, True]
)
var c_array = CArrowArray.from_pyarrow(pyarr)
var c_schema = CArrowSchema.from_pyarrow(pyarr.type)
var dtype = c_schema.to_dtype()
assert_equal(dtype, int64)
assert_equal(c_array.length, 5)
assert_equal(c_array.null_count, 1)
assert_equal(c_array.offset, 0)
assert_equal(c_array.n_buffers, 2)
assert_equal(c_array.n_children, 0)
var data = c_array.to_array(dtype)
var array = data.as_int64()
assert_equal(array.bitmap[].size, 64)
assert_equal(array.is_valid(0), True)
assert_equal(array.is_valid(1), True)
assert_equal(array.is_valid(2), True)
assert_equal(array.is_valid(3), True)
assert_equal(array.is_valid(4), False)
assert_equal(array.unsafe_get(0), 1)
assert_equal(array.unsafe_get(1), 2)
assert_equal(array.unsafe_get(2), 3)
assert_equal(array.unsafe_get(3), 4)
assert_equal(array.unsafe_get(4), 0)
array.unsafe_set(0, 10)
assert_equal(array.unsafe_get(0), 10)
assert_equal(str(pyarr), "[\n 10,\n 2,\n 3,\n 4,\n null\n]")
So far the implementation has been focused to provide a solid foundation for further development, not for memory efficiency, performance or completeness.
A couple of notable limitations:
-
The chosen abstractions may not be ideal, but:
- mojo lacks support for dynamic dispatch at the moment
- variant elements must be copyable
- references and lifetimes are not hardened yet
- expressing nested data types is not straightforward
Due to these reasons polymorphism is achieved by defining a common layout for type hierarchies and providing specialized views for each child type. This approach seems to work well for nested
DataType
andArray
types and the implementation can be continued whileMojo
gains the necessary features to rethink theses abstractions. -
The
C Data Interface
doesn't call the release callbacks yet and only consuming arrow data is implemented for now because aMojo
callback cannot be passed to aC
function yet. As mojo matures, this limitation will be certainly addressed. -
Testing of the conformance against the
Arrow
specification is done by reading arrow data from the python implementationPyArrow
sinceMojo
can already call python functions. If the project manages to evolve further, it should be wired into the arrow integration testing suite, but first that requires aJSON
library inMojo
. -
Only boolean, numeric, string, list and struct datatypes are supported for now since these cover most of the implementation complexity. Support for the rest of the arrow data types can be added incrementally.
-
A convenient API hasn't been designed yet, preferably that should be tackled once the implementation is more mature.
-
No
ChunkedArray
s,RecordBatch
es,Table
s are implemented yet, but soon they will be. -
No CI has been set up yet, but it is going to be in focus really soon.
I shared the implementation it its current state so others can join the effort. If the project manages to evolve, ideally it should be donated to the upstream Apache Arrow project.
Given an existing Mojo installation the tests can be run with:
cd firebolt
mojo test firebolt -I .
Tested with nightly Mojo
:
$ mojo --version
mojo 2024.7.1805 (0a697965)