In this post, we will look into the core components of Hadoop and find out their role. There are two major concepts for us to understand in Hadoop. Once you understand these, you will start understanding why Hadoop works the way, it does. These concepts are related to HDFS and Map Reduce. At a very high level, HDFS handles storage and Map Reduce does fast computing.
But, why we need Distributed File System(DFS). Believe it or not till today I/O channels are still the constraint for quick data reading, so there is a limit on how much and how quickly you can read from one machine. So to process large amount of data, you want to distribute your work to multiple machines to read faster. Suppose we want to read 1 TB of data, which is not much for a big data project. If we choose to take one server with four I/O channels and each channel with 100 MB/S of bandwidth, it will take us around 44 minutes to read the data once. Now if take 10 such servers, we can read the same data in one tenth of the time.
So till Disk I/O constraints are removed, we need DFS to read large volume of data.
HDFS component of the Hadoop architecture is nothing but the DFS implementation.
Earlier we talked about how HDFS handles large storage for big data projects. Now we will look into Map Reduce. In any big data projects, you have large volume of data. If you can’t process the data in a reasonable time frame, the value of your result is diminished. So other than getting a large powerful machine, what does the Hadoop architecture do for us? Here is something clever the Hadoop team has done for you. Instead of moving the data to the processing machine, the architecture moves the processing to the machine with data. This way, it saves the time and network bandwidth.
Here is a high level picture of Hadoop components. By now, we know that HDFS manages storage and MAP Reduce drives the computing. Inside HDFS, you have Name node and data node. Name node is also called admin node. It manages the activities happening at data node. In map reduce you have job tracker and task tracker. Job tracker manages the task trackers.
You can build a cluster of 1 to N nodes with this design pattern. As always, all questions and comments are welcome.