Lambda Stream

The surest sign you have invented something worthwhile is when several other people invent it too. That means the creative pressure that gave birth to the idea is more general than your particular situation.

Lambda ArchitectureEven when faced with the same pressures, people will approach an idea in different ways. When Jay Kreps was developing Kafka at LinkedIn, he called it The Log. Facebook (being Facebook) created several independent implementations of “stream-oriented processing”, including Puma and TailerSwift. Twitter has the adorably-named Summingbird. The jargon we seem to be converging on for these kinds of systems is the Lambda Architecture. [1]

Lambda in a Nutshell

The gist of Lambda is to model all of the stuff that goes on in a complex computing system as an ordered, immutable log of events. Processing the data (say, totting up the number of website visitors) is done as a series of transformations that output to new tables or streams.

It is important to keep the input unchanged. By breaking data processing into independent pieces, each with a defined input and output, you get closer to the ideal of purely functional programming. Writing and testing each piece is made simpler and parallelization can be automated. Parts of the dataflow can be replayed (say, when code changes or machines fail) and tinker-toyed together with other flows.

This is a very nice property to build into your system. A long time ago, people who did 3D modeling would “carve” digital blocks into the shapes they wanted. If they wanted to undo something 10 steps back, they were largely out of luck. Then 3DStudio introduced a brilliant feature it called the “transform stack”. The stack records every change to an object separately, and applies them in real time. This allows the modeler to modify, add, remove, and even reorder their changes on the fly.
3d Modeling

So far, all of this is just good data engineering hygiene. Any well-run batch processing or map/reduce system will follow the same principles. There’s nothing special about stream-processing that makes immutable data flows work better.

The Trick of Lambda

The special trick that makes Lambda “Lambda” is the technique of writing data to two places. That’s one reason why the logo for it is the symbol “λ”. In effect, one half of a Lambda system optimizes for space and the other optimizes for time. Lambda systems incorporate a slower, high-capacity batch-processing system, and a faster stream processing track. This allows existing map/reduce systems to be upgraded with a new fast track. It also leaves the system of record untouched, which is the main selling point for CIOs looking to improve the responsiveness of their data flows.

Lambda Architecture

This is an old and venerable technique. Document search engines of a certain age (eg, Yahoo’s Vespa) often have a “slow” index that is very compact but difficult to update. To compensate they will also have a “fast” index, perhaps in memory, where changes are cached until the next index rebuild. Under the hood a search will consult both indexes and merge the results.

The thing is, design feature was an evolution on top of the slower batched index. It is not certain that you would do it that way if you were building from scratch. Lucene, for example, uses an incremental index for everything. Jay Kreps, in a thoughtful critique of Lambda, points out that you need two implementations of the same queries and data flow. And of course, you need two copies of the data. If you had a better streaming system, one that could “read a table” simply by replaying a stream, why would you need both kinds of system?

The Lambda Architecture Isn’t

The Lambda Architecture isn’t. What it is is a sensible set of data engineering practices, which you should be doing anyway, plus a clever (but transitional) double-write approach to help people add a low-latency fast track to existing big data systems. I actually think it is good to have all of these ideas wrapped up under a catchy rubric, as long as it’s not fetishized too much.


[1] The term originally comes from the Greek word “λάμδα”, which means “to write to two places” No, I’m kidding. Lambda is actually the Greek word for “monoid in the category of endofunctors”.