How to Install Hadoop in Linux

Apache Hadoop is an open-source, distributed computing framework designed to process large volumes of data across clusters of computers. It is particularly useful for big data processing and analysis.

In this step-by-step guide, we will show you how to install Hadoop on a Linux system, specifically on Ubuntu. By following these steps, you will be able to set up a single-node Hadoop cluster for development and testing purposes.

Step 1: Install Java Development Kit (JDK)

Hadoop requires Java to run, so you need to install the Java Development Kit (JDK) on your system. For this guide, we will install OpenJDK 8.

sudo apt-get update
sudo apt-get install -y openjdk-8-jdk

After installation, verify that Java is installed correctly by checking the version:

java -version

Step 2: Create a Dedicated Hadoop User

For security purposes, it’s a good practice to create a dedicated user for Hadoop:

sudo adduser --gecos "" hadoop

Set a strong password for the new user when prompted.

Step 3: Download and Install Hadoop

As the newly created Hadoop user, download and install Hadoop:

Switch to the Hadoop user:

sudo su - hadoop

Download Hadoop from the official website (replace the version number and download link with the latest version):

wget https://archive.apache.org/dist/hadoop/core/hadoop-3.3.0/hadoop-3.3.0.tar.gz

Extract the downloaded archive:

tar -xzf hadoop-3.3.0.tar.gz

Move the extracted folder to the desired location, such as /usr/local/:

sudo mv hadoop-3.3.0 /usr/local/hadoop

Step 4: Configure Hadoop Environment Variables

To configure Hadoop environment variables, edit the ~/.bashrc file and add the following lines at the end of the file:

export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

Save the changes and load the new environment variables:

source ~/.bashrc

Step 5: Configure Hadoop

To configure Hadoop for a single-node setup, edit the following configuration files located in the $HADOOP_HOME/etc/hadoop directory:

See also  How to Setup a Login Banner on a Linux system

core-site.xml: Set the Hadoop temporary directory and the default file system.

<configuration>
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://localhost:9000</value>
  </property>
  <property>
    <name>hadoop.tmp.dir</name>
    <value>/app/hadoop/tmp</value>
  </property>
</configuration>

hdfs-site.xml: Configure the Hadoop Distributed File System (HDFS) replication factor.

<configuration>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>
</configuration>

mapred-site.xml: Configure the MapReduce framework. First, create a copy of the template file:

cp $HADOOP_HOME/etc/hadoop/mapred-site.xml.template $HADOOP_HOME/etc/hadoop/mapred-site.xml

Then, edit the mapred-site.xml file and set the MapReduce framework:

<configuration>
  <property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
  </property>
</configuration>

yarn-site.xml: Configure the YARN resource manager.

<configuration>
  <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
  </property>
</configuration>

Step 6: Format HDFS

Before starting Hadoop, format the Hadoop Distributed File System (HDFS) as the Hadoop user:

hdfs namenode -format

Step 7: Start Hadoop Services

Start the Hadoop services, including the HDFS NameNode, DataNode, YARN ResourceManager, and NodeManager:

start-dfs.sh
start-yarn.sh

Verify that Hadoop services are running:

jps

You should see the following processes: NameNode, DataNode, ResourceManager, and NodeManager.

See also  How to Find Big Files Size on Linux RHEL/CentOS

Commands Mentioned:

  • apt-get update – Update package repositories
  • apt-get install – Install specified packages and their dependencies
  • adduser – Add a new user to the system
  • wget – Download files from the internet
  • tar – Extract or create compressed files
  • mv – Move or rename files and directories
  • source – Load a file’s contents into the current shell environment
  • cp – Copy files and directories
  • hdfs – Interact with the Hadoop Distributed File System
  • start-dfs.sh – Start HDFS services
  • start-yarn.sh – Start YARN services
  • jps – Display Java Virtual Machine processes

Conclusion

In this guide, we’ve demonstrated how to install Hadoop on a Linux system, specifically on Ubuntu. You should now have a single-node Hadoop cluster for development and testing purposes. To further explore Hadoop, you can try running MapReduce jobs, creating HDFS directories, and uploading or downloading files to and from HDFS.

See also  How to Fix GNOME License Not Accepted Issue on CentOS 7

Please note that the single-node setup presented in this guide is not suitable for production environments. For a production-ready Hadoop cluster, you will need to configure a multi-node setup, which involves configuring multiple machines and adjusting Hadoop settings to distribute data and processing tasks across the cluster.

When setting up a production environment, it’s essential to consider factors such as hardware requirements, network configuration, and security measures like user authentication and authorization. Additionally, you may want to look into various Hadoop ecosystem components, such as Apache Hive, Apache HBase, Apache Spark, and Apache Pig, to enhance your data processing and analysis capabilities.

We hope this guide has helped you successfully install Hadoop on your Linux system and provided a foundation for further exploration of the Hadoop ecosystem. If you have any questions, comments, or suggestions for improvement, please feel free to share your thoughts in the comments section below.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *