How to configure a multi-node Apache Hadoop cluster

A multi-node Apache Hadoop cluster fails in uneven ways when hostnames, worker lists, and daemon roles do not agree. The NameNode, ResourceManager, DataNode, and NodeManager processes need the same logical cluster settings before HDFS formatting or service startup begins.

The active configuration is the set of XML files under $HADOOP_CONF_DIR on every node. Treat the first master host as the configuration source, copy the same files to workers, and verify that each node resolves the same names before starting daemons.

The example uses one NameNode and one ResourceManager with two workers. Production clusters usually add high availability, Kerberos, rack awareness, and service supervision after this baseline is already working.

Steps to configure a multi-node Apache Hadoop cluster:

  1. Confirm hostname resolution from the master host.
    $ getent hosts master01 worker01 worker02
    10.10.20.11 master01.example.net master01
    10.10.20.21 worker01.example.net worker01
    10.10.20.22 worker02.example.net worker02
  2. Set the default HDFS URI in core-site.xml.
    core-site.xml
    <property>
      <name>fs.defaultFS</name>
      <value>hdfs://master01.example.net:9000</value>
    </property>
  3. Set the NameNode and DataNode storage paths in hdfs-site.xml.
    hdfs-site.xml
    <property>
      <name>dfs.namenode.name.dir</name>
      <value>file:///data/hadoop/hdfs/name</value>
    </property>
    <property>
      <name>dfs.datanode.data.dir</name>
      <value>file:///data/hadoop/hdfs/data</value>
    </property>
    <property>
      <name>dfs.replication</name>
      <value>2</value>
    </property>
  4. Set the YARN ResourceManager host in yarn-site.xml.
    yarn-site.xml
    <property>
      <name>yarn.resourcemanager.hostname</name>
      <value>master01.example.net</value>
    </property>
    <property>
      <name>yarn.nodemanager.aux-services</name>
      <value>mapreduce_shuffle</value>
    </property>
  5. List the worker hosts in the workers file.
    workers
    worker01.example.net
    worker02.example.net
  6. Copy the configuration to every worker.
    $ rsync -a $HADOOP_CONF_DIR/ worker01.example.net:$HADOOP_CONF_DIR/
    core-site.xml
    hdfs-site.xml
    yarn-site.xml
    mapred-site.xml
    workers
  7. Format HDFS once from the NameNode host.
    $ hdfs namenode -format -clusterId hadoop-lab01
    Storage directory /data/hadoop/hdfs/name has been successfully formatted.

    Formatting an existing NameNode destroys namespace metadata for that cluster. Run it only before the first startup or after a tested recovery plan.

  8. Start HDFS and YARN from the master host.
    $ start-dfs.sh
    Starting namenodes on [master01.example.net]
    Starting datanodes
    Starting secondary namenodes [master01.example.net]
  9. Verify both DataNodes joined the cluster.
    $ hdfs dfsadmin -report
    Configured Capacity: 214748364800 (200 GB)
    Present Capacity: 209715200000 (195 GB)
    Live datanodes (2):
    Name: worker01.example.net:9866
    Name: worker02.example.net:9866