Run WordCount on 3-Node Yarn Cluster
According to How to Install and Set Up a 3-Node Hadoop Cluster.
Create 3 Ubuntu 16.04 hosts with vagrant on Lenovo ThinkCenter. Configure the network, hostname and IP address in each Vagrantfile. The memory should be no less than 8GB for each host. Or you have to configure yarn's memory usage manually. The node-master Vagrantfile:
# -*- mode: ruby -*-
Vagrant.configure(2) do |config|
config.vm.box = "ubuntu/xenial64"
config.ssh.username = "ubuntu"
config.ssh.password = "3d7d18ebe09a49ff99028120"
config.vm.define "yarnmaster"
config.vm.hostname = "nodemaster"
config.vm.network "private_network", ip: "192.0.2.1"
config.vm.provider "virtualbox" do |vb|
vb.memory = "8192"
end
end
Add the following lines into /etc/hosts of each host:
192.0.2.1 node-master
192.0.2.2 node1
192.0.2.3 node2
Install JDK on each host:
sudo apt update
sudo apt install -y openjdk-8-jdk
Create hadoop
user on each host:
sudo useradd -m hadoop
sudo passwd hadoop # input password: hadoop
Configure SSH access: on node-master:
sudo su - hadoop
ssh-keygen
ssh-copy-id hadoop@node-master
ssh-copy-id hadoop@node1
ssh-copy-id hadoop@node2
Install Hadoop as user hadoop on node-master and duplicate to other nodes:
wget http://mirror.bit.edu.cn/apache/hadoop/common/hadoop-2.9.0/hadoop-2.9.0.tar.gz
tar xf hadoop-2.9.0.tar.gz
mv hadoop-2.9.0 hadoop
echo 'PATH=/home/hadoop/hadoop/bin:/home/hadoop/hadoop/sbin:$PATH' >> $HOME/.profile
Modify files under ~/hadoop/etc/hadoop/ and duplicate the files on each node:
scp -r hadoop node1:~/
scp -r hadoop node2:~/
scp .profile node1:~/
scp .profile node2:~/
Note 1: Spark 2.2.1 using Hadoop 2.7+, so we download Hadoop 2.9.0.
Note 2: The memory of each host is 8GB. So I didn't modify the memory configurations.
Start HDFS: on node-master, run:
hdfs namenode -format
start-dfs.sh
jps # on node-master: NameNode and SecondaryNameNode; on node1/2: DataNode
hdfs dfsadmin -report
hdfs dfs -mkdir -p /user/hadoop
wget -O alice.txt https://www.gutenberg.org/files/11/11-0.txt
wget -O holmes.txt https://www.gutenberg.org/ebooks/1661.txt.utf-8
wget -O frankenstein.txt https://www.gutenberg.org/ebooks/84.txt.utf-8
hdfs dfs -mkdir books
hdfs dfs -put alice.txt holmes.txt frankenstein.txt books
# on node1:
hdfs dfs -get books/alice.txt
Start Yarn and run a wordcount app:
start-yarn.sh
jps # on node-master: ResourceManager; on node1/2: NodeManager
yarn node -list # this works on both node-master and node1/2
yarn application -list
yarn jar ~/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.9.0.jar wordcount "books/*" output
hdfs dfs -ls output
Run Spark on Yarn
According to Install, Configure, and Run Spark on Top of a Hadoop YARN Cluster.
Download and install Spark on each host (as user hadoop):
cd /home/hadoop
wget https://d3kbcqa49mib13.cloudfront.net/spark-2.2.0-bin-hadoop2.7.tgz
tar -xvf spark-2.2.0-bin-hadoop2.7.tgz
mv spark-2.2.0-bin-hadoop2.7 spark
Edit ~/.profile:
export PATH=/home/hadoop/spark/bin:$PATH
export HADOOP_CONF_DIR=/home/hadoop/hadoop/etc/hadoop
export SPARK_HOME=/home/hadoop/spark
export LD_LIBRARY_PATH=/home/hadoop/hadoop/lib/native:$LD_LIBRARY_PATH
Run spark-shell
on node-master, some errors raised.
In the spark-shell, the spark
variable doesn't exists.