How to Setup Big Data Environment Manually in Linux Part 1 (HADOOP and HIVE)

7 min readJan 17, 2021

Knowing how to set up a big data environment may be considered unnecessary. But for my understanding is crucial to understand whole system you are working on. For to do this for big data developments, data analytics and etc. At least one time you need to install and configure apache software’s. You don’t need to be system admin to know this stuff’s.

There are companies that offer big data environments as a service and provides big data processing on cloud. Small companies and low budget teams rely on these services generally because its cheaper install your self bottom to up. For personal perspective you can use pre configured VMs like Horton Sandbox for learning purposes. But for my opinion its crucial know how to install and how to configure for to understand apache's big data ecosystems.

In my current job I don't need to know about this configuration generally for solving system errors. I just need to write my Spark code or HiveQL script but it’s necessary for understand production errors and other kind of errors. Its good to provide solution’s to other team and colleges.

In this tutorial I will try to explain installing Hadoop and Hive. End of this tutorial you can run HiveQL script on Hadoop with MapReduce configuration.

MapReduce is deprecated but its default for execution for an historical purposes. For next tutorial we will also install spark and run our Hive SQL with spark.

Prerequisite

Ubuntu 20.04
Postgress (We will use for hive metastore you can also set docker image for this.)
SSH
OpenJDK

Installing Prerequisite

Installing Postgres and JDK and SSH

Install prerequisites.

Last controls before starting the installations

Check services ssh and postgresql must be running. We need postgresql database for later when configure hive metastore.

service ssh status
service postgresql status
#or
service --status-all

sudo service postgresql start
sudo service ssh start
# Alternatives
sudo /etc/init.d/ssh start
sudo /etc/init.d/postgresql start

After the start ssh service check if you can ssh localhost.you need to able SSH to local host to continue installation.

ssh localhost

If you cannot SSH your local host. Add your public key to authorized key if you don’t have any initialize your keys.

# init without passphrase if you dont have one.
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
# if you have one
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys

Check java version and java home.

java -version
echo $JAVA_HOME

Make sure all required services running. Don’t forget to check this to services start at boot properly. This is not for and issue for Hadoop installation but when you close your computer in next restart your Hadoop and Hive will not run properly.

Installing HADOOP

Download and Extract Hadoop to Installation destination. We are normally install our packages in /opt because /opt is a directory for installing unbundled packages. We will install packages under /opt/ folder.

wget https://ftp.itu.edu.tr/Mirror/Apache/hadoop/common/hadoop-3.3.0/hadoop-3.3.0.tar.gz
mkdir /opt/hadoop/
tar -xvzf hadoop-3.3.0.tar.gz -C /opt/hadoop

Open following files and edit and add following properties to configuration files.

Setting Up Enviroment Variables For Installation Folder

Do not forget to add this environment variables. Change your paths according to yourself.

export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-amd64
export HADOOP_HOME=/opt/hadoop/hadoop-3.3.0
export PATH=$PATH:$HADOOP_HOME/bin

Editing hadoop-env.sh

Open and edit line 54 in hadoop-env.sh

vi +"set number" /opt/hadoop/hadoop-3.3.0/etc/hadoop/hadoop-env.sh

Edit JAVA_HOME variable if its not exist please add like below.

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

You can optionally set this variables inside this file but we will configure parameters with `hdfs namenode -format` after setting up other configurations.

export HDFS_NAMENODE_USER=isaes
export HDFS_NAMENODE_USER=isaes
export HDFS_DATANODE_USER=isaes
export HDFS_SECONDARYNAMENODE_USER=isaes
export YARN_RESOURCEMANAGER_USER=isaes
export YARN_NODEMANAGER_USER=isaes

Editing core-site.xml

vi  +"set number" /opt/hadoop/hadoop-3.3.0/etc/hadoop/core-site.xml

Delete line 19 and 20 and add config below or just and property between configuration tags.

Edit `LINUXUSERNAME` with yours before saving core-site.xml.

Editing hdfs-site.xml

vi /opt/hadoop/hadoop-3.3.0/etc/hadoop/hdfs-site.xml

hdfs-site.xml

Editing mapred-site.xml

vi /opt/hadoop/hadoop-3.3.0/etc/hadoop/mapred-site.xml

mapred-site.xml

Editing yarn-site.xml

vi /opt/hadoop/hadoop-3.3.0/etc/hadoop/yarn-site.xml

yarn-site.xml

Format Name Nodes

If you don’t set PATH variable you need to go inside Hadoop installation folder `bin` folder and run from them.

hdfs namenode -format

Start Datanodes, Namenodes And Yarn

We need to go inside `sbin` folder at Hadoop installation path. We will start Hadoop nodes and yarn with script inside this folders.

cd /opt/hadoop/hadoop-3.3.0/sbin
./start-dfs.sh
./start-yarn.sh

Checking installation status and Web Interface

Run following jps command to check hadoop an yarn services. We need output result similar like this.

1328 NodeManager
720 DataNode
944 SecondaryNameNode
1169 ResourceManager
534 NameNode
1706 Jps

You can visit following URL’s for check status.

HADOOP Summary: <http://localhost:9870/dfshealth.html>
YARN resource manager: <http://localhost:8088/cluster>

INSTALLING HIVE

Download and extract hive

wget https://kozyatagi.mirror.guzel.net.tr/apache/hive/hive-3.1.2/apache-hive-3.1.2-bin.tar.gz
mkdir /opt/hive/
tar -xvzf apache-hive-3.1.2-bin.tar.gz -C /opt/hive

Do not forget to this environment variables . Change your paths according to yourself.

export HIVE_HOME=/opt/hive/apache-hive-3.1.2-bin
export PATH=$HIVE_HOME/bin:$PATH

Editing hive-site.xml

Template for hive-site.xml named hive-default.xml.template is inside

/opt/hive/apache-hive-3.1.2-bin/conf/ folder. You can checkout this file for further configuration.

vi /opt/hive/apache-hive-3.1.2-bin/conf/hive-site.xml

hive-site.xml

Creating user in Postgres for hive metastore

Create user and database for hive metastore to manage the metadata of persistent relational entities to fast access its use apache derby as default but we will rely on postgres for to do this job.

sudo -u postgres psql
CREATE USER metaadmin WITH PASSWORD 'metapass';

Addition to hive-site.xml

Edit DBUSERNAME and DBPASSWORD with yours then add this properties between configuration tags hive-site.xml.

Initialize postgres schemas with schema tool. This tool creates schemas under database named metastore database. We configured this database name injavax.jdo.option.ConnectionURL property under hive-site.xml

$HIVE_HOME/bin/schematool -dbType postgres -initSchema
$HIVE_HOME/bin/init-hive-dfs.sh

Start hive metastore and hiveserver2 services and check logs for any error.

$HIVE_HOME/bin/hive --service metastore
$HIVE_HOME/bin/hive --service hiveserver2

If there is error while starting metastore server like in the image to the below.

Copy one jar from hadoop path to hive path and try again it will fix this issue.

rm $HIVE_HOME/lib/guava-19.0.jar
cp $HADOOP_HOME/share/hadoop/common/lib/guava-27.0-jre.jar $HIVE_HOME/lib/

Check beeline connection. Its imported to starting and using correct read write permisson to hdfs read write.

# Try to connect beelinebeeline -u jdbc:hive2://# If you got an error run following command again.$HIVE_HOME/bin/init-hive-dfs.sh# Try to check hadoop system write permission and error keeps going try following commands.hdfs dfs -chmod 777 /tmp
hdfs dfs -chmod 777 /user/hive/warehouse

Try to connect beeline and try to create table with user who has priveledge to read write and execute in Hadoop file system. This mean if you create for other user and install Hadoop with this user you need to connect with this user and password.

-n for username
-p for password

beeline -u jdbc:hive2://localhost:10000 -n isaes -p passwCREATE TABLE test_table (id INT,name STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|';LOAD DATA INPATH '/home/data.txt' INTO TABLE test_table;

Creating log folders for metastore service and hiveserver2 and running services in background.

mkdir -p /var/opt/hive/logs/
touch /var/opt/hive/logs/hive.log
touch /var/opt/hive/logs/hiveserver2.log
nohup $HIVE_HOME/bin/hive --service metastore 1>/var/opt/hive/logs/hive.log 2>&1 &
nohup $HIVE_HOME/bin/hive --service hiveserver2 1>/var/opt/hive/logs/hiveserver2.log 2>&1 &

Running Example Query via Beeline

Create table named ages we will calculate average age.

CREATE TABLE ages (id INT,name STRING,age INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|';

Prepare example data and load to table.

1|Angela|20
2|Kasia|26
3|Chen|24
4|Frank|30

Maybe you are calculating average age for your website user’s.

SELECT AVG(age) as avg_age FROM ages;

We will install spark in second part and run this query with spark in second part.