At Indix we collect and process lots of data. Most of our processing initially were done as MapReduce (henceforth MR) jobs but as our data grew in size we moved towards stream processing. We monitor the behaviour of our systems through collection of business metrics. It was relatively easy to write Stats jobs on our MR output but things got tricky when we moved to Stream based processing.
Our key learnings over the years have been
- Approximate stats now > Accurate stats tomorrow
- Our metrics were just aggregates (counts / uniques) with rollups
- Existing open source systems were more for system monitoring than business metrics
- Model aggregates as Commutative Monoids using Algebird's typeclasses.
We put all our learnings and built a system called Abel which solved this for us. It aggregates a million events in ~15 seconds on a single box.
Ashwanth Kumar is a Principal Engineer at Indix. His major interest lies in building and operating large data systems. When not dealing with data, he spends his time reading research papers in similar topics.
Abel as an idea was conceptualised by Vinoth Kumar while working at Indix as part of the Ingestion team.
- Evolution of stats @ Indix from Vinothkumar Raman
- HyperLogLog Overview by Andrew Sy
- Add ALL the Things: Abstract Algebra Meets Analytics by Avi Bryant. One of the best introductions on Monoids and Algebird library.
- Video by @brewkode on Suuchi - Toolkit to build distributed systems at Fifth Elephant, 2017.
- Zach Tellman - Everything Will Flow
- Of Algebirds, Monoids, Monads, and Other Bestiary for Large-Scale Data Analytics by Michael G. Noll
- Functional Programming in Scala by P. Chiusano and R. Bjarnason, published by Manning. Includes chapters on monoids and monads, and how to implement them in Scala.
- https://github.com/ashwanthkumar/suuchi
- https://github.com/twitter/algebird
- http://riemann.io/
- http://kafka.apache.org/
- http://hadoop.apache.org/
- http://spark.apache.org/
- https://github.com/facebook/rocksdb
- Twitter by aguycalledgary from the Noun Project
- team by Wilson Joseph from the Noun Project
- Earth by To Uyen from the Noun Project
- database by Kevin Woodland from the Noun Project
- options by Mert Güler from the Noun Project
- SQL File by Viktor Vorobyev from the Noun Project
- database by ✦ Shmidt Sergey ✦ from the Noun Project
- sigma by Davo Sime from the Noun Project
- tools by Viktor Vorobyev from the Noun Project
- Info by Lance Weisser from the Noun Project
- Heart by i cons from the Noun Project
Using Monoids for Large Scale Business Stats by Ashwanth Kumar is licensed under a Creative Commons Attribution 4.0 International License.