Insights & Data Blog

Insights & Data Blog

Meningen op deze blog weerspiegelen de opvattingen van de schrijver en niet per definitie die van de Capgemini Group

Big Data Architectures – From Lambda Architecture to Streaming ETL Architecture

The term “Lambda Architecture” was first coined by Nathan Marz who was a Big Data Engineer working for Twitter at the time. This architecture enables the creation of real-time data pipelines with low latency reads and high frequency updates. This architecture was praised and well received by the Big Data Community and led to the creation of a book: “Big Data – Principles and best practices of scalable real-time data systems” by Nathan Marz and James Warren. The Lambda Architecture has 3 main components (see Figure1):

  • The Batch Layer creates views using a batch-oriented framework over the total and an immutable data set.
  • The Speed Layer creates incremental views using real-time data processing framework based on the most recent data.
  • The Serving Layer provides a unified view that can be queried by applications based on the views created by the Batch and Speed Layer. 

NB: what is meant by an immutable data set is that all (subsequent) versions of a record are stored in a distributed data store. No logic is implemented to modify or update a record with a newer version.

Figure 1: Lambda Architecture Main Components

The Batch Layer runs periodical batches every day or several hours on the whole immutable data set to update its views. This architecture introduces a technical challenge that requires discarding the Speed Layer views once the Batch Layer has updated its views with the most recent data (see figure 2).

Figure 2: Update Process of Batch and Speed Layer Views

Furthermore this architecture requires you to code your functionality twice in two distinct layers: the Speed and Batch Layer. Thankfully frameworks like Twitter Summing Bird and Apache Spark were created which reasonably prevent you from having to write code twice for the same piece of functionality in a Batch and Speed Layer. Still this architecture does incur having to maintain two layers performing the same functionality with different periodicities. In the past years, the advancements around real-time data pipelines have been huge. Apache Spark one of the most active open source projects in the world, created the Apache Spark Streaming framework. The Apache Spark Streaming frameworks now includes fault-tolerance, data replication and exactly once processing of the data. The argument does not hold anymore that pure real-time data streaming pipelines are too faulty or unreliable to be used standalone. It is now therefore possible to create ETL pipelines based on streaming technology that update the views of the Serving Layer as the data comes in. With the advent of new technologies, new architectures are also born ☺ (see figure 3).

Figure 3: Streaming ETL Architecture

References 

  • Big Data Principles and best practices of scalable realtime data systems (Nathan Marz, James Warren) https://www.manning.com/books/big-data
  • Official Lambda Architecture Website: http://lambda-architecture.net
  • http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html
  • The Lambda Architecture: Principles architecting realtime Big Data systems (James Kinley) http://jameskinley.tumblr.com/post/37398560534/the-lambda-architecture-principles-for
  • Questioning the Lambda Architecture (Jay Kreps): http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html
  • Twitter Summing Bird: https://github.com/twitter/summingbird/wiki
  • Apache Spark: http://spark.apache.org

Over de auteur

Mark Vervuurt
Mark Vervuurt
Senior Big Data Scientist

Plaats een reactie

Uw e-mailadres wordt niet gepubliceerd. Verplichte velden zijn gemarkeerd met *.