Online Training with Virtual Labs

Hadoop Architecture

I do not know about you, but when I look at any building or technology, I always wonder, what was going through the designers and architects thought. Why they decided to pick one route over another. With this in mind, in next few posts, I will be exploring Hadoop Architecture.

As we all know Hadoop is a framework to solve for large data processing. It is helping businesses to answer relevant business question by looking into the data they could not process earlier.  As developers write framework, first order of business is to write down the guidelines, which helps the team to keep focus and use the guidelines to resolve conflicts as they come during framework development. Below is a list of principles, I put together to advance our conversations. By no means this is all inclusive list. Feel free to add your own. If you look at the list, you can see that they are all related with each other.

  • Use commodity hardware, so more people can use it
  • Focus on easy recovery, so failure is not expensive
  • Use replication for data redundancy so you can recover
  • Support very large distributed file system
  • Streaming Data access, so you can read data quickly
  • Enable Fast compute to handle large amount of data in timely manner
  • Optimize for batch processing, not for transactions

If you can see the theme you will find that all of these are coming from two requirements. First is to process large amount of data quickly and second is to use commodity hardware. Since commodity hardware will fail so plan for recovery. To support recovery, you need replication. For fast processing you need to read quickly and do processing near data to avoid moving large amount of data across network. A combination of these principles have made Hadoop one of the most successful framework in recent history.

In our next post, we will look at the Hadoop core components. All comments and questions are welcome.