Run WordCount on 3-Node Yarn Cluster

According to How to Install and Set Up a 3-Node Hadoop Cluster.

Create 3 Ubuntu 16.04 hosts with vagrant on Lenovo ThinkCenter. Configure the network, hostname and IP address in each Vagrantfile. The memory should be no less than 8GB for each host. Or you have to configure yarn's memory usage manually. The node-master Vagrantfile:

# -*- mode: ruby -*-
Vagrant.configure(2) do |config|
  config.vm.box = "ubuntu/xenial64"
  config.ssh.username = "ubuntu"
  config.ssh.password = "3d7d18ebe09a49ff99028120"
  config.vm.define "yarnmaster"
  config.vm.hostname = "nodemaster"
  config.vm.network "private_network", ip: "192.0.2.1"
  config.vm.provider "virtualbox" do |vb|
    vb.memory = "8192"
  end
end

Add the following lines into /etc/hosts of each host:

192.0.2.1    node-master
192.0.2.2    node1
192.0.2.3    node2

Install JDK on each host:

sudo apt update
sudo apt install -y openjdk-8-jdk

Create hadoop user on each host:

sudo useradd -m hadoop
sudo passwd hadoop   # input password: hadoop

Configure SSH access: on node-master:

sudo su - hadoop
ssh-keygen
ssh-copy-id hadoop@node-master
ssh-copy-id hadoop@node1
ssh-copy-id hadoop@node2

Install Hadoop as user hadoop on node-master and duplicate to other nodes:

wget http://mirror.bit.edu.cn/apache/hadoop/common/hadoop-2.9.0/hadoop-2.9.0.tar.gz
tar xf hadoop-2.9.0.tar.gz
mv hadoop-2.9.0 hadoop
echo 'PATH=/home/hadoop/hadoop/bin:/home/hadoop/hadoop/sbin:$PATH' >> $HOME/.profile

Modify files under ~/hadoop/etc/hadoop/ and duplicate the files on each node:

scp -r hadoop node1:~/
scp -r hadoop node2:~/
scp .profile node1:~/
scp .profile node2:~/

Note 1: Spark 2.2.1 using Hadoop 2.7+, so we download Hadoop 2.9.0.

Note 2: The memory of each host is 8GB. So I didn't modify the memory configurations.

Start HDFS: on node-master, run:

hdfs namenode -format
start-dfs.sh
jps   # on node-master: NameNode and SecondaryNameNode; on node1/2: DataNode
hdfs dfsadmin -report
hdfs dfs -mkdir -p /user/hadoop
wget -O alice.txt https://www.gutenberg.org/files/11/11-0.txt
wget -O holmes.txt https://www.gutenberg.org/ebooks/1661.txt.utf-8
wget -O frankenstein.txt https://www.gutenberg.org/ebooks/84.txt.utf-8
hdfs dfs -mkdir books
hdfs dfs -put alice.txt holmes.txt frankenstein.txt books
# on node1:
hdfs dfs -get books/alice.txt

Start Yarn and run a wordcount app:

start-yarn.sh
jps    # on node-master: ResourceManager; on node1/2: NodeManager
yarn node -list     # this works on both node-master and node1/2
yarn application -list
yarn jar ~/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.9.0.jar wordcount "books/*" output
hdfs dfs -ls output

Run Spark on Yarn

According to Install, Configure, and Run Spark on Top of a Hadoop YARN Cluster.

Download and install Spark on each host (as user hadoop):

cd /home/hadoop
wget https://d3kbcqa49mib13.cloudfront.net/spark-2.2.0-bin-hadoop2.7.tgz
tar -xvf spark-2.2.0-bin-hadoop2.7.tgz
mv spark-2.2.0-bin-hadoop2.7 spark

Edit ~/.profile:

export PATH=/home/hadoop/spark/bin:$PATH
export HADOOP_CONF_DIR=/home/hadoop/hadoop/etc/hadoop
export SPARK_HOME=/home/hadoop/spark
export LD_LIBRARY_PATH=/home/hadoop/hadoop/lib/native:$LD_LIBRARY_PATH

Run spark-shell on node-master, some errors raised. In the spark-shell, the spark variable doesn't exists.

Run Spark on Yarn Cluster

Run WordCount on 3-Node Yarn Cluster

Run Spark on Yarn

Published

Last Updated

Category

Tags

Contact