(He's little). A pure Lua rewrite of Apache Spark 2.2, designed for embedding and edge computing.
To install on an operating system:
$ luarocks install stuart
To load into a web page:
<html>
<body>
<script type="text/javascript" src="https://cdn.jsdelivr.net/npm/fengari-web@0.1.2/dist/fengari-web.js"></script>
<script type="application/lua" src="https://cdn.jsdelivr.net/npm/lua-stuart@2.0.4-0/stuart.lua"></script>
<script type="application/lua">
local stuart = require 'stuart'
local sc = stuart.NewContext('local[1]', 'My Spark job')
...
</script>
</body>
</html>
To use in Redis, see examples/redis:
$ redis-cli --eval SparkPi-with-dependencies.lua 0,0
"Pi is roughly 3.1332956664783"
Create a "Stuart Context", then count the number of lines in this README:
$ lua
Lua 5.2.4 Copyright (C) 1994-2015 Lua.org, PUC-Rio
stuart = require 'stuart'
sc = stuart.NewContext()
rdd = sc:textFile('README.md')
print(rdd:count())
151
rdd = sc:parallelize({1,2,3,4,5,6,7,8,9,10}, 3)
filtered = rdd:filter(function(x) return x % 2 == 0 end)
print('evens: ' .. table.concat(filtered:collect(), ','))
evens: {2,4,6,8,10}
rdd = sc:parallelize({{4,'Gnu'}, {4,'Yak'}, {5,'Mouse'}, {4,'Dog'}})
countsByKey = rdd:countByKey()
print(countsByKey[4])
3
print(countsByKey[5])
1
Start by installing network support:
$ luarocks install luasocket
Next, start a local network service with netcat:
$ nc -lk 9999
Start a Spark Streaming job to read from the network service:
local stuart = require 'stuart'
local sc = stuart.NewContext()
local ssc = stuart.NewStreamingContext(sc, 1.5)
local dstream = ssc:socketTextStream('localhost', 9999)
dstream:foreachRDD(function(rdd)
print('Received:', string.format('%s {%s}', tostring(rdd), table.concat(rdd:collect(),',')))
end)
ssc:start()
ssc:awaitTerminationOrTimeout(10)
ssc:stop()
Then type some input into netcat:
abc
def
123
The output shows lines received within the 1.5 second batch interval:
INFO Running Stuart (Embedded Spark 2.2.0)
INFO Connected to localhost:9999
Received RDD: RDD[1] {abc}
Received RDD: RDD[2] {def,123}
This custom receiver acts like a SocketInputDStream
, and reads lines of text from a socket.
local now = require 'stuart.interface'.now
local Receiver = require 'stuart.streaming.Receiver'
local socket = require 'socket'
local stuart = require 'stuart'
-- MyReceiver ------------------------------
local MyReceiver = stuart.class(Receiver)
function MyReceiver:_init(ssc, hostname, port)
Receiver._init(self, ssc)
self.hostname = hostname
self.port = port or 0
end
function MyReceiver:onStart()
self.conn = socket.connect(self.hostname, self.port)
end
function MyReceiver:onStop()
if self.conn ~= nil then self.conn:close() end
end
function MyReceiver:poll(durationBudget)
local startTime = now()
local data = {}
local minWait = 0.02
while true do
local elapsed = now() - startTime
if elapsed > durationBudget then break end
self.conn:settimeout(math.max(minWait, durationBudget - elapsed))
local line, err = self.conn:receive('*l')
if not err then
data[#data+1] = line
end
end
return self.ssc.sc:makeRDD(data)
end
-- Spark Streaming Job ------------------------------
sc = stuart.NewContext()
ssc = stuart.NewStreamingContext(sc, 0.5)
local receiver = MyReceiver.new(ssc, 'localhost', 9999)
local dstream = ssc:receiverStream(receiver)
dstream:foreachRDD(function(rdd)
print('Received RDD: ' .. rdd:collect())
end)
ssc:start()
ssc:awaitTerminationOrTimeout(10)
ssc:stop()
The stuart.interface
module provide interfaces to hardware or a host OS, designed to make it easy for you to preload your own custom module that is specific to your host application or device.
To embed Stuart into a Go app, use:
To embed Stuart into an eLua image, see stuart-elua.
See the stuart-hardware project for edge hardware specific integration guides.
Stuart is compatible with:
- eLua (a C-based 5.1 baremetal VM that runs on a breadth of microcontroller families in as little as 32k of RAM)
- Fengari (a JavaScript-based Lua 5.3 VM)
- GopherLua (a Go-based Lua 5.1 VM)
- Lua 5.1, 5.2, 5.3
- LuaJIT 2.0, 2.1
- Redis 2.6+ (a Lua 5.1 VM)
- stuart-ml : A Lua port of Spark MLlib
- stuart-sql : A Lua port of Spark SQL
- stuart-redis : Extended Redis support
- Improve streaming support and test coverage
- Build out Stuart ML to support more models
- Build out Stuart SQL to support DataFrames
- Improve Stuart SQL's Parquet interop with Lua 5.3 support and more codecs
Stuart is designed for real-time and embedding, and so it follows some rules:
- It does not perform deferred evaluation of anything; all compute costs are paid upfront for predictable throughput.
- It uses pure Lua and does not include native C code. This maximizes portability and opportunity to be cross-compiled. Any potential C code optimizations are externally sourced through the module loader. For example, Stuart links to
lunajson
, but it also detects and usescjson
when that native module is present. - It does not execute programs (like
ls
ordir
to list files), because there may not even be an OS. - It does not make use of coroutines, in order to ensure easy transpiling to C.
- It does not use upvalues or metatables in module scripts, so that module tables can be burned into ROM and chipsets (see eLua LTR)
- It should be able to eventually do everything that Apache Spark does.
While many frameworks deliver streaming analytics capabilities, Spark leads the pack in numbers of trained data scientists, numbers of SaaS environments where Spark models can be built and trained, numbers of contributors moving the platform forward, numbers of universities teaching it, and net commercial investment.
Depoyment. Amalgamated Lua jobs with inlined module dependencies solves the Spark job deployment problem, and obviates the need for any shared filesystem or brittle classpath coordination. Redis Scripting showcases the power of SHA1 content hashing for Lua job distribution.
Packaging. Lua jobs, like JavaScript, are easy to minify, and statically analyze to strip out unused modules and function calls. Your job script only need be as large as the number of Spark capabilities it makes use of.
Portability. Because Lua is a tiny language that elegantly supports classes and closures, it serves as a better source of truth for functional algorithms than Scala. This makes it relatively easy for Stuart jobs to be transpiled into Scala, Java, Python, Go, C, or maybe even CUDA, or to be interpreted by a VM in any of those same environments, which significantly extends Spark's reach by divorcing it from the JVM.
Embedding. Lua is arguably one of the most crash-proof language runtimes, making it attractive for industrial automation, sensors, wearables, and microcontrollers. Whereas JVM-based analytics tend to require an operator.
GPUs. If you are thinking about pushing closures into a GPU, Lua seems like a reasonable choice, and one of the easier languages to transpile into OpenCL or CUDA.
Torch. Torch is the original deep-learning library ecosystem, 15+ years mature, and with deep ties to university and leading commercial interests. It runs on mobile phones, and serves as a fantastic case in point for why Lua makes sense for analytics jobs. A data scientist should be able to use Spark and Torch side-by-side, and maybe even from the same Spark Streaming control loop.
The LuaRocks built-in build system is used for packaging.
$ luarocks make stuart-<version>.rockspec
stuart <version> is now built and installed in /usr/local (license: Apache 2.0)
Testing with lua-cjson
:
$ luarocks install busted
$ luarocks install lua-cjson
$ luarocks intall moses
$ busted -v --defer-print
17/11/12 08:46:51 INFO Running Stuart (Embedded Spark) version 2.2.0
...
177 successes / 0 failures / 0 errors / 0 pending : 8.026833 seconds
Testing with lunajson
:
$ luarocks remove lua-cjson
$ busted -v --defer-print
17/11/12 08:46:51 INFO Running Stuart (Embedded Spark) version 2.2.0
...
175 successes / 0 failures / 0 errors / 2 pending : 8.026833 seconds
Pending → ...
util.json can decode a scalar using cjson
... cjson not installed
Pending → ...
util.json can decode an object using cjson
... cjson not installed
Testing with a WebHDFS endpoint:
$ WEBHDFS_URL=webhdfs://localhost:50075/webhdfs busted -v --defer-print
Various Dockerfiles are made available in the root directory to provide a specific Lua VM for the test suite:
Test-Lua5.1.Dockerfile
Test-Lua5.2.Dockerfile
Test-Lua5.3.Dockerfile
$ docker build -f Test-Lua5.3.Dockerfile -t test .
$ docker run -it test busted -v --defer-print
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
172 successes / 0 failures / 0 errors / 5 pending : 10.246418 seconds