Before we start, I want to give credit to author of the book “Hadoop – The Definitive Guide“, for the read and write diagrams. This is one of the best diagrams available to show the read and write operation in Hadoop.
Take a look at the picture below and follow the sequence. It describes the reading in HDSF. First step is to open the file followed by getting the block locations from name node. Third is to send the read request to FSData Input stream, which sends the read request to multiple data nodes as required. At the end the client sends the file close request to FSData Input stream. One important point to note here is the direct connection between client and data node during the read. Even though name node manages the data node, but for greater efficiency, the connection during read is not via the name node. The client reads the data directly from data node. If there is a failure, the read moves to one of the replication copy closest to the reading block.
Similar to reading, you can follow the sequence for write as well in the picture. Remember that Hadoop is optimized for batch processing. Files in HDFS are, write once and at a time there is only one writer. This is one of the major differences from transaction systems framework, where you have to consider multiple writers at any time. Once you have multiple writers, you have to consider locking to maintain integrity of the data you write. All of these, makes writing slow in transaction system. Hadoop does not consider the write complete till it receives the acknowledgement from data node. If you have a pipeline of data nodes, writes are forwarded from one data node to others. At the end, the process is completed by updating the name node.
This is the end of introduction to Hadoop Architecture. For questions/comments leave a reply or send us an email at firstname.lastname@example.org