Apache Hadoop is an open-source, distributed computing framework designed to process large volumes of data across clusters of computers. It is particularly useful for big data processing and analysis.
In this step-by-step guide, we will show you how to install Hadoop on a Linux system, specifically on Ubuntu. By following these steps, you will be able to set up a single-node Hadoop cluster for development and testing purposes.
Step 1: Install Java Development Kit (JDK)
Hadoop requires Java to run, so you need to install the Java Development Kit (JDK) on your system. For this guide, we will install OpenJDK 8.
sudo apt-get update sudo apt-get install -y openjdk-8-jdk
After installation, verify that Java is installed correctly by checking the version:
Step 2: Create a Dedicated Hadoop User
For security purposes, it’s a good practice to create a dedicated user for Hadoop:
sudo adduser --gecos "" hadoop
Set a strong password for the new user when prompted.
Step 3: Download and Install Hadoop
As the newly created Hadoop user, download and install Hadoop:
Switch to the Hadoop user:
sudo su - hadoop
Download Hadoop from the official website (replace the version number and download link with the latest version):
Extract the downloaded archive:
tar -xzf hadoop-3.3.0.tar.gz
Move the extracted folder to the desired location, such as /usr/local/:
sudo mv hadoop-3.3.0 /usr/local/hadoop
Step 4: Configure Hadoop Environment Variables
To configure Hadoop environment variables, edit the ~/.bashrc file and add the following lines at the end of the file:
export HADOOP_HOME=/usr/local/hadoop export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
Save the changes and load the new environment variables:
Step 5: Configure Hadoop
To configure Hadoop for a single-node setup, edit the following configuration files located in the $HADOOP_HOME/etc/hadoop directory:
core-site.xml: Set the Hadoop temporary directory and the default file system.
<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://localhost:9000</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/app/hadoop/tmp</value> </property> </configuration>
hdfs-site.xml: Configure the Hadoop Distributed File System (HDFS) replication factor.
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration>
mapred-site.xml: Configure the MapReduce framework. First, create a copy of the template file:
cp $HADOOP_HOME/etc/hadoop/mapred-site.xml.template $HADOOP_HOME/etc/hadoop/mapred-site.xml
Then, edit the mapred-site.xml file and set the MapReduce framework:
<configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration>
yarn-site.xml: Configure the YARN resource manager.
<configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> </configuration>
Step 6: Format HDFS
Before starting Hadoop, format the Hadoop Distributed File System (HDFS) as the Hadoop user:
hdfs namenode -format
Step 7: Start Hadoop Services
Start the Hadoop services, including the HDFS NameNode, DataNode, YARN ResourceManager, and NodeManager:
Verify that Hadoop services are running:
You should see the following processes: NameNode, DataNode, ResourceManager, and NodeManager.
- apt-get update – Update package repositories
- apt-get install – Install specified packages and their dependencies
- adduser – Add a new user to the system
- wget – Download files from the internet
- tar – Extract or create compressed files
- mv – Move or rename files and directories
- source – Load a file’s contents into the current shell environment
- cp – Copy files and directories
- hdfs – Interact with the Hadoop Distributed File System
- start-dfs.sh – Start HDFS services
- start-yarn.sh – Start YARN services
- jps – Display Java Virtual Machine processes
In this guide, we’ve demonstrated how to install Hadoop on a Linux system, specifically on Ubuntu. You should now have a single-node Hadoop cluster for development and testing purposes. To further explore Hadoop, you can try running MapReduce jobs, creating HDFS directories, and uploading or downloading files to and from HDFS.
Please note that the single-node setup presented in this guide is not suitable for production environments. For a production-ready Hadoop cluster, you will need to configure a multi-node setup, which involves configuring multiple machines and adjusting Hadoop settings to distribute data and processing tasks across the cluster.
When setting up a production environment, it’s essential to consider factors such as hardware requirements, network configuration, and security measures like user authentication and authorization. Additionally, you may want to look into various Hadoop ecosystem components, such as Apache Hive, Apache HBase, Apache Spark, and Apache Pig, to enhance your data processing and analysis capabilities.
We hope this guide has helped you successfully install Hadoop on your Linux system and provided a foundation for further exploration of the Hadoop ecosystem. If you have any questions, comments, or suggestions for improvement, please feel free to share your thoughts in the comments section below. Your feedback is invaluable to us, and it helps us create better and more informative content for our users.