Date: May31, 2016
HDFS = Hadoop Distributed File System
HDFS is the filesystem used by hadoop and its subcomponents. It hides the complexity of having multiple disks, racks, partitions, and filesystems so that the user only thinks of the data as being stored in a single "drive".
We have the concept of a cluster, which is a network of connected computers. These computers have storage devices, processors and memory which hadoop uses to quickly process information. The HDFS is the filesystem that allows these computers to interact with the different data stored in each of these drives.
Upon storing a file, HDFS splits files into blocks with a 64MB default size. These blocks are distributed with redunancy into different physical drives called nodes. The blocks that contigouously compose a single file does not necessarily have to be stored in the same node.
This redundancy is a feature that allows the data to be readily available even when some nodes fail. It also improves concurrency and allows the data to scale horizontally across different commodity hardware.
The Namenode is a node that keeps track of the locations of data blocks. Think of it as an address book for blocks. It contains a key-value pair that index the node (and rack) of the block.
Every so often (the time interval is configurable), the Namenode receives a heartbeat from each of the data nodes. If it does not receive a heartbeat from a data node, it assumes that the data node is dead and replaces it.
- Hadoop Diagram
- Hadoop Diagram2
- Justin Abrantes