Setting Up and Running a Hadoop MapReduce Job on a Standalone Cluster: A Step-by-Step Guide

Hadoop’s MapReduce framework is a powerful tool for processing large-scale data in a distributed fashion. In this guide, we walk through setting up a Hadoop cluster. We will debug common errors. We will execute a word count MapReduce job using Shakespeare’s complete works as input. This guide captures real-world troubleshooting steps, ensuring a practical and error-resilient approach.

Setting Up the Hadoop Cluster Manually

Before we can run MapReduce, we need a fully operational Hadoop environment. Our first step is to check if the Hadoop NameNode is running. The NameNode manages the file system and must be active for HDFS commands to work. To check run the following:

hdfs dfsadmin -safemode get

But it fails with:

safemode: Call From master/172.31.0.216 to master:9000 failed on connection exception: java.net.ConnectException: Connection refused

Debugging the Issue

A quick check using:

jps

It shows me the following:

885 Master
9655 Jps
6591 QuorumPeerMain

The NameNode process is missing, meaning HDFS isn’t functional.

Manually start the NameNode

hdfs --daemon start namenode

Check again with jps, you should now see:

9840 NameNode
885 Master
9886 Jps
6591 QuorumPeerMain

Now, HDFS is operational.

Exiting Safe Mode

By default, Hadoop starts in safe mode, preventing modifications to the filesystem. We confirm this by running:

hdfs dfsadmin -safemode get
Safe mode is ON

Which means the safe mode is ON. To disable it type the following and press enter:

hdfs dfsadmin -safemode leave

and verify again with:

hdfs dfsadmin -safemode get

Start DataNodes

hdfs --daemon start datanode

Start the ResourceManager

We need to run the resource manager for running the MapReduce jobs on the cluster. To start the resource manager type the following on the console:

yarn --daemon start resourcemanager

jps 
786 QuorumPeerMain
6773 NameNode
15127 ResourceManager
15367 Jps
7930 DataNode

Start NodeManager

yarn --daemon start nodemanager
jps
17040 Jps
786 QuorumPeerMain
16915 NodeManager
6773 NameNode
15127 ResourceManager
7930 DataNode

Preparing the Input Directory

Create an input directory in HDFS:

hdfs dfs -mkdir -p /user/vivekb/input
hdfs dfs -ls /user/vivekb/

Output should confirm the directory exists:

Found 1 items
drwxr-xr-x   - centos hadoop 0 2025-03-15 21:57 /user/vivekb/input

Checking Cluster Health

hdfs dfsadmin -report

Before running MapReduce, it’s crucial to ensure HDFS is working as expected.

hdfs dfsadmin -report
Configured Capacity: 0 (0 B)
DFS Remaining: 0 (0 B)
Missing blocks: 54

No DataNodes are active. So start the DataNode manually:

hdfs --daemon start datanode

Now once again check the services running:

jps
9840 NameNode
885 Master
13544 DataNode
6591 QuorumPeerMain

Now that HDFS is operational, we check the status again:

hdfs dfsadmin -report
Configured Capacity: 39.99 GB
DFS Remaining: 24.50 GB
Missing blocks: 9

Some blocks are still missing. To check the missing block run the following:

hdfs fsck / | grep -i 'MISSING'

The above shows multiple missing blocks related to HBase metadata. Since the missing files belong to HBase, start HBase manually:

cd /opt/hbase-2.4.15/bin
./start-hbase.sh

Now check the services list again that are running:

jps
9840 NameNode
885 Master
13544 DataNode
14696 HMaster
6591 QuorumPeerMain

Transferring Project Files to the Cluster

Transferring project files to the cluster turned out to be more challenging than expected. Methods like SCP and rsync ran into issues, making them unreliable for our setup. To avoid these problems, we decided to use Git for a more seamless and version-controlled transfer.

By cloning the repository directly onto the cluster, we ensure that the latest files are always available without manual intervention:

git clone git@github.com:vivekbhadra/BDS_Assignment.git

This approach simplifies file management and keeps everything in sync effortlessly.

Uploading Data file to HDFS

We now need to place our data file for the experiment. It should be in the previously created input folder called input. This folder resides under /users/vivekb/. To do that type the following command on the console:

[centos@master ~]$ cd BDS_Assignment/
[centos@master BDS_Assignment]$ ls -la 
total 5500
drwxrwxr-x   3 centos centos      76 Mar 16 05:45 .
drwx------. 29 centos centos    4096 Mar 16 05:45 ..
drwxrwxr-x   8 centos centos     163 Mar 16 05:45 .git
-rwxrwxr-x   1 centos centos     498 Mar 16 05:45 mapper.py
-rwxrwxr-x   1 centos centos     405 Mar 16 05:45 reducer.py
-rw-rw-r--   1 centos centos 5618733 Mar 16 05:45 shakespeare.txt
[centos@master BDS_Assignment]$ hdfs dfs -put shakespeare.txt /user/vivekb/input/
[centos@master BDS_Assignment]$ hdfs dfs -ls /user/vivekb/input/
Found 1 items
-rw-r--r--   3 centos hadoop    5618733 2025-03-16 05:58 /user/vivekb/input/shakespeare.txt

Running the MapReduce Job

Now we are ready to run our Map Reduce functions in the cluster. Type the following in the console:

[centos@master BDS_Assignment]$ hadoop jar  $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-3.2.4.jar     -files mapper.py,reducer.py     -mapper "python3 mapper.py"     -reducer "python3 reducer.py"     -input /user/vivekb/input/shakespeare.txt     -output /user/vivekb/output1

Check the output of the job

To check the output from your MapReduce function do the following:

[centos@master BDS_Assignment]$ hdfs dfs -cat /user/vivekb/output1/part-00000

Automating the Process

After manually setting up and running a Hadoop MapReduce job, the next logical step is to automate the entire workflow. This ensures that the cluster is always correctly configured before execution. Missing directories or output folders do not cause errors. Jobs can be rerun easily.

By using a shell script, we can streamline the following tasks:

Starting essential Hadoop services (if not already running)
Ensuring the HDFS input directory exists and contains the necessary data
Removing old output directories before execution
Submitting the MapReduce job
Retrieving and displaying results efficiently

Here’s a script that fully automates the process:

Automating the Workflow

#!/bin/bash

# Set environment variables (modify paths as necessary)
HDFS_INPUT_DIR="/user/vivekb/input"
HDFS_OUTPUT_DIR="/user/vivekb/output1"
LOCAL_INPUT_FILE="shakespeare.txt"
MAPPER_SCRIPT="mapper.py"
REDUCER_SCRIPT="reducer.py"

# Start Hadoop Services (Only if necessary)
echo "====== Starting Hadoop Services ======"
sudo systemctl start hadoop.service  # Modify as per your setup

# Ensure HDFS is running
hdfs dfsadmin -safemode leave

# Check if the input directory exists, create if not
hdfs dfs -test -d "$HDFS_INPUT_DIR"
if [ $? -ne 0 ]; then
    echo "Creating HDFS input directory: $HDFS_INPUT_DIR"
    hdfs dfs -mkdir -p "$HDFS_INPUT_DIR"
fi

# Upload input file if not already in HDFS
hdfs dfs -test -e "$HDFS_INPUT_DIR/shakespeare.txt"
if [ $? -ne 0 ]; then
    echo "Uploading input file to HDFS..."
    hdfs dfs -put "$LOCAL_INPUT_FILE" "$HDFS_INPUT_DIR/"
fi

# Ensure the output directory does not exist (delete if it does)
hdfs dfs -test -d "$HDFS_OUTPUT_DIR"
if [ $? -eq 0 ]; then
    echo "Removing previous output directory..."
    hdfs dfs -rm -r "$HDFS_OUTPUT_DIR"
fi

# Run the Hadoop Streaming MapReduce Job
echo "====== Running Hadoop MapReduce Job ======"
hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-*.jar \
    -files "$MAPPER_SCRIPT","$REDUCER_SCRIPT" \
    -mapper "python3 $MAPPER_SCRIPT" \
    -reducer "python3 $REDUCER_SCRIPT" \
    -input "$HDFS_INPUT_DIR/shakespeare.txt" \
    -output "$HDFS_OUTPUT_DIR"

# Check if the job completed successfully
if [ $? -eq 0 ]; then
    echo "====== Job Completed Successfully! ======"
    echo "====== Output Preview ======"
    hdfs dfs -cat "$HDFS_OUTPUT_DIR/part-00000" | head -20
else
    echo "====== Job Failed. Check Hadoop Logs. ======"
fi

How This Automates the Process

Starts Hadoop services automatically if needed.
Checks for missing directories and creates them before running the job.
Ensures the input data is present in HDFS before execution.
Deletes the previous output directory to prevent job failures.
Submits the Hadoop MapReduce job and monitors its execution.
Displays a preview of the output to confirm success.

Running the Script

After saving the script as run_mapreduce.sh, grant execute permissions and run it:

chmod +x run_mapreduce.sh
./run_mapreduce.sh