Hive training : Spark Databox

As part of this section, we will see how to set up the Hive in the Cloudera distribution. We will also understand important concepts related to Hive.

Setup Hive and Impala
Validating Hive and Impala
Components and Properties of Hive
Troubleshooting Hive issues
Hive Commands and Queries – Overview
Different Query Engines
Components and Properties of Impala
Running Queries using Impala – Overview

Hive QL is an abstraction over Map Reduce like Pig and interface is more similar to SQL. Impala is also a Query Engine but instead of Map Reduce it uses different approach while processing data by running queries. We can accelerate the application development in Hadoop ecosystem using hive by writing queries which will generate Map Reduce code to process data.

Cluster Topology

We are setting up the cluster on 7+1 nodes. We start with 7 nodes and then we will add one more node later.

Gateway(s) and Management Service
- bigdataserver-1
Masters
- bigdataserver-2
  - Zookeeper
  - Active/Standby Namenode
- bigdataserver-3
  - Zookeeper
  - Active/Standby Namenode
  - Active/Standby Resource Manager
  - Impala State Store
- bigdataserver-4
  - Zookeeper
  - Active/Standby Resource Manager
  - Job History Server
  - Spark History Server
  - Hive Server and Hive Metastore
  - Impala Catalog
Slaves or Worker Nodes
- bigdataserver-5 – Datanode, Node Manager, Impala Daemon
- bigdataserver-6 – Datanode, Node Manager, Impala Daemon
- bigdataserver-7 – Datanode, Node Manager, Impala Daemon

Learning Process

We need to follow the standard process while setting up any software.

Downloading and Installing – already taken care as part of adding hosts to the cluster.
Configuration – we need to understand architecture and plan for the configuration.
- Hive has three components – DDL or Physical Modeling, Copying data (Load and Insert) and Querying data (Hive QL).
- Configure Hive meta store server – Setup Database or use an existing database (MySql database is already created in the previous step.)
- Configuration Files – hive-site.xml and .hiverc
- Impala have both masters and slave components as it does not use map reduce. Impala State Store and Impala Catalog Server are master components while Impala Daemons (impalad) are slaves.
Service logs
- Where ever hive is running, it will create a hive.log file under /tmp/’username’ location
- Service log files are saved under /var/log/hive

Setup Hive and Impala

Let us see how to setup Hive on an existing cluster using Cloudera Manager.

Configure Hive

Go to Add Service and select Hive
Assign role or select the host to install the components – Gateway , HiveServer2 and HiveMetastore Server . We will be configuring Gateway on all nodes and Hive Server as well as HiveMetastore on bigdataserver-4.
We need to make sure MySQL java connector is installed on bigdataserver-4.
Review important properties and customize if required.
Complete installation – Run the installer and start each component.

Configure Impala

Go to Add Service and select Impala
Assign role or select the hosts to install the components – Impala State Store and Impala Catalog and Impala Daemons .
There are no Gateways with respect to impala. We need to specify one of the servers on which Impala Daemon is running while launching Impala Shell to run commands or queries.

Validating Hive and Impala

As Hive and Impala are configured on the cluster, now let us validate by running some Hive as well as Impala queries.

Both Hive as well as Impala, have CLI
We can launch Hive CLI using hive command while we need to specify one of the servers on which Impala Daemon is running to launch impala-shell
Create a sample table
Load data into the table
Run queries which use Map Reduce job or Impala Job to process the data.
While running these commands or queries, let us also see what happens in the Hive Metastore Database as well as in HDFS when the table is being created.
When the table is created, there will be metadata associated with the table. That metadata will be stored in the database we have configured as part of Hive Metastore (in our case it is MySQL database running on bigdataserver-1 with name Hive).
We can connect to the MySQL database and review the list of tables used to metastore.
We will see how to troubleshoot in case of any issues at a later point in time.
In case, if we create a table in Hive and if it is not visible in impala we need to run INVALIDATE METADATA command as part of impala-shell.

Validate Hive by Running Commands and Queries

Validate Impala by Running Commands and Queries

cdh-validate-01-hive.sh

# Run this in Hive Console

CREATE DATABASE retail;

USE retail;

CREATE TABLE orders (

order_id INT,

order_date STRING,

order_customer_id INT,

order_status STRING

) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

LOAD DATA LOCAL INPATH '/home/itversity/data/retail_db/orders' INTO TABLE orders;

SELECT * FROM orders LIMIT 10;

SELECT order_status, count(1) FROM orders

GROUP BY order_status;

cdh-validate-02-Impala.sh

#Run below statement in impala-shell

CREATE TABLE order_items (

order_item_id INT,

order_item_order_id INT,

order_item_product_id INT,

order_item_quantity INT,

order_item_subtotal FLOAT,

order_item_product_price FLOAT

) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

#Run below command in Hive CLI

LOAD DATA LOCAL INPATH '/home/itversity/data/retail_db/order_items' INTO TABLE order_items;

#Run below queries as part of impala shell to validate

SELECT * FROM order_items LIMIT 10;

SELECT order_item_order_id, sum(order_item_subtotal) FROM order_items

GROUP BY order_item_order_id;

Components and Properties of Hive

Hive uses HDFS for storage, different processing engines to process the data. As it works on HDFS and processing engines such as Map Reduce, Spark, YARN (Tez) etc we do not have any daemon processes running on worker nodes.

Data storage – HDFS
Processing Engine – Map Reduce or Tez or Spark SQL
Metastore – Stores structure of tables. Metastore is used by non-Hive Query Engines as well.
- Cloudera’s Impala
- Hortonworks’s Tez
- MapR Drill
- Spark SQL
- Presto
- and more
Metastore Server – to connect to the Metastore Database
Hive Server – to facilitate external applications to connect to the Hive and generate reports leveraging data in Hive Tables.
Query engine – jar files that will be deployed on all the nodes added as gateways.
- Generates Java code at runtime using underlying distributed processing APIs
- Compiles and build jar at runtime based on underlying distributed engine
- Submit as one or more jobs to process the data using the underlying distributed engine.
- For now, we will be focusing on Map Reduce. But we can run with any of the frameworks mentioned earlier.

Configuration Files and Important Properties

Now let us look into important configuration files related to Hive and some important properties.

As Hive uses HDFS for storage and can work with different frameworks to process the data, Hiveinherits behavior from core-site.xml, hdfs-site.xml, yarn-site.xml, mapred-site.xml etc.
Also, Hive itself, have properties file called hive-site.xml.

Troubleshooting Hive Issues

Now let us look into how we can troubleshoot Hive issues.

We have server components such as Hive Metastore and HiveServer2 running on bigdataserver-4. If we need to troubleshoot any issues related to those we need to login to bigdataserver-4 and go to /var/log/hive. There will be a different log file for each of the service.
As with many of the other services, Hive Server logs are controlled by properties defined in log4j.properties.
Whenever we run hive query or command, information related to the query will be logged into a file under the subdirectory of /tmp. This subdirectory is named after the OS user submitted the job. Filename is hive.log
We can change the location as well as the name of the log file using log4j.properties.
When we run Hive Queries, it uses the underlying processing engine to run most of the HiveQueries (especially SELECT queries).
Many time the log file might have a very high-level exception and might not provide us with actual details of the issue.
Based on the underlying processing engine, we need to get into the job logs to get details about the actual issue. For example, if the data is processed using Map Reduce, we need to check job logs by using job history server.

Hive Commands and Queries

Hive Commands and Queries are very important for developers. However, it is important for Big Data Administrators to have some knowledge about these.

Hive CLI can be launched using hive . We can also run individual hive commands without getting into Hive CLI using hive -e .
We can also use hive -s to pass series of commands as a script.
You can check as well as update the properties using set command.
We can override properties in .hiverc to overwrite behavior of all hive sessions initiated by respective OS user.
Hive supports DDL, DML as well as SQL – to create tables, to load data into tables as well as process data.
Hive uses HDFS for underlying storage which is good for larger files. Due to this, Hive has restrictions with respect to traditional DML.
There are no transaction management commands such as commit, rollback etc.

Different Query Engines

As emphasized earlier queries against Hive Databases can be run using different Query Engines.

Map Reduce which is the default
Tez which comes as part of Hortonworks distribution.
Impala which comes as part of Cloudera distribution.
Drill which comes as part of MapR distribution.
Presto which is developed in Facebook
Spark SQL which comes as part of Spark.

Let us see a brief demo about running queries using some of the distributed processing engines – Map Reduce, Tez, Impala, Spark SQL etc.

Components and Properties of Impala

Now let us understand more about the components of Impala.

Impala StateStore
Impala Catalog Server
Impala Daemon
- Query Planner
- Query Coordinator
- Query Executor

Running Queries using Impala – Overview

Impala is SQL Engine. But instead of using Map Reduce for processing it takes care of it in its own way.

Most of the Hive commands work in Impala
- Hive CLI Commands such as show tables
- DDL, DML, and Queries developed in Hive will work in Impala
- It understands partitions and takes care of partition pruning
- There might be a few issues with respect to some of the file formats and compression algorithms.
It does not use Map Reduce or YARN for processing. Instead, it uses Impalad for processing.
We can launch impala-shell to run queries and also connect reporting tools such as Tableau to run queries.
Let us see a quick demo about how to use Impala Shell and run queries.

Configuring Database Engines – Hive & Impala Print