Knowing how to set up a big data environment may be considered unnecessary. But for my understanding is crucial to understand whole system you are working on. For to do this for big data developments, data analytics and etc. At least one time you need to install and configure apache software’s. You don’t need to be system admin to know this stuff’s.
There are companies that offer big data environments as a service and provides big data processing on cloud. Small companies and low budget teams rely on these services generally because its cheaper install your self bottom to up. For personal perspective you can use pre configured VMs like Horton Sandbox for learning purposes. But for my opinion its crucial know how to install and how to configure for to understand apache's big data ecosystems.
In my current job I don't need to know about this configuration generally for solving system errors. I just need to write my Spark code or HiveQL script but it’s necessary for understand production errors and other kind of errors. Its good to provide solution’s to other team and colleges.
In this tutorial I will try to explain installing Hadoop and Hive. End of this tutorial you can run HiveQL script on Hadoop with MapReduce configuration.
MapReduce is deprecated but its default for execution for an historical purposes. For next tutorial we will also install spark and run our Hive SQL with spark.
- Ubuntu 20.04
- Postgress (We will use for hive metastore you can also set docker image for this.)
Installing Postgres and JDK and SSH
Last controls before starting the installations
Check services ssh and postgresql must be running. We need postgresql database for later when configure hive metastore.
service ssh status
service postgresql status
sudo service postgresql start
sudo service ssh start
sudo /etc/init.d/ssh start
sudo /etc/init.d/postgresql start
After the start ssh service check if you can ssh localhost.you need to able SSH to local host to continue installation.
If you cannot SSH your local host. Add your public key to authorized key if you don’t have any initialize your keys.
# init without passphrase if you dont have one.
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
# if you have one
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys
Check java version and java home.
Make sure all required services running. Don’t forget to check this to services start at boot properly. This is not for and issue for Hadoop installation but when you close your computer in next restart your Hadoop and Hive will not run properly.
Download and Extract Hadoop to Installation destination. We are normally install our packages in /opt because /opt is a directory for installing unbundled packages. We will install packages under /opt/ folder.
tar -xvzf hadoop-3.3.0.tar.gz -C /opt/hadoop
Open following files and edit and add following properties to configuration files.
Setting Up Enviroment Variables For Installation Folder
Do not forget to add this environment variables. Change your paths according to yourself.
Open and edit line 54 in hadoop-env.sh
vi +"set number" /opt/hadoop/hadoop-3.3.0/etc/hadoop/hadoop-env.sh
Edit JAVA_HOME variable if its not exist please add like below.
You can optionally set this variables inside this file but we will configure parameters with `hdfs namenode -format` after setting up other configurations.
vi +"set number" /opt/hadoop/hadoop-3.3.0/etc/hadoop/core-site.xml
Delete line 19 and 20 and add config below or just and property between configuration tags.
Edit `LINUXUSERNAME` with yours before saving core-site.xml.
Format Name Nodes
If you don’t set PATH variable you need to go inside Hadoop installation folder `bin` folder and run from them.
hdfs namenode -format
Start Datanodes, Namenodes And Yarn
We need to go inside `sbin` folder at Hadoop installation path. We will start Hadoop nodes and yarn with script inside this folders.
Checking installation status and Web Interface
Run following jps command to check hadoop an yarn services. We need output result similar like this.
You can visit following URL’s for check status.
- HADOOP Summary: <http://localhost:9870/dfshealth.html>
- YARN resource manager: <http://localhost:8088/cluster>
Download and extract hive
tar -xvzf apache-hive-3.1.2-bin.tar.gz -C /opt/hive
Do not forget to this environment variables . Change your paths according to yourself.
Template for hive-site.xml named hive-default.xml.template is inside
/opt/hive/apache-hive-3.1.2-bin/conf/ folder. You can checkout this file for further configuration.
Creating user in Postgres for hive metastore
Create user and database for hive metastore to manage the metadata of persistent relational entities to fast access its use apache derby as default but we will rely on postgres for to do this job.
sudo -u postgres psql
CREATE USER metaadmin WITH PASSWORD 'metapass';
Addition to hive-site.xml
Edit DBUSERNAME and DBPASSWORD with yours then add this properties between configuration tags hive-site.xml.
Initialize postgres schemas with schema tool. This tool creates schemas under database named metastore database. We configured this database name injavax.jdo.option.ConnectionURL property under hive-site.xml
$HIVE_HOME/bin/schematool -dbType postgres -initSchema
Start hive metastore and hiveserver2 services and check logs for any error.
$HIVE_HOME/bin/hive --service metastore
$HIVE_HOME/bin/hive --service hiveserver2
If there is error while starting metastore server like in the image to the below.
Copy one jar from hadoop path to hive path and try again it will fix this issue.
cp $HADOOP_HOME/share/hadoop/common/lib/guava-27.0-jre.jar $HIVE_HOME/lib/
Check beeline connection. Its imported to starting and using correct read write permisson to hdfs read write.
# Try to connect beelinebeeline -u jdbc:hive2://# If you got an error run following command again.$HIVE_HOME/bin/init-hive-dfs.sh# Try to check hadoop system write permission and error keeps going try following commands.hdfs dfs -chmod 777 /tmp
hdfs dfs -chmod 777 /user/hive/warehouse
Try to connect beeline and try to create table with user who has priveledge to read write and execute in Hadoop file system. This mean if you create for other user and install Hadoop with this user you need to connect with this user and password.
beeline -u jdbc:hive2://localhost:10000 -n isaes -p passwCREATE TABLE test_table (id INT,name STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|';LOAD DATA INPATH '/home/data.txt' INTO TABLE test_table;
Creating log folders for metastore service and hiveserver2 and running services in background.
mkdir -p /var/opt/hive/logs/
nohup $HIVE_HOME/bin/hive --service metastore 1>/var/opt/hive/logs/hive.log 2>&1 &
nohup $HIVE_HOME/bin/hive --service hiveserver2 1>/var/opt/hive/logs/hiveserver2.log 2>&1 &
Running Example Query via Beeline
Create table named ages we will calculate average age.
CREATE TABLE ages (id INT,name STRING,age INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|';
Prepare example data and load to table.
Maybe you are calculating average age for your website user’s.
SELECT AVG(age) as avg_age FROM ages;
We will install spark in second part and run this query with spark in second part.