The Art of the Hive Query, Part 1

Hive is a data warehousing framework that uses SQL-esque queries to analyze data via MapReduce. Hive's query language is referred to as "HiveQL" and is not fully SQL-92 compliant. Among other things, it only has basic support for indexes, but adds multi-table inserts. Nevertheless, anyone familiar with the various ways of running SQL queries should have no problem adjusting to Apache Hive.

There are a number of ways to analyze and query big data with Hive - it can even be used for cleaning data!

Although you could (if you have Linux on your system) install Hadoop and Hive locally, more realistically you'll probably want to use a VM (virtual machine) by Cloudera or Hortonworks.

Creating a Database


Creating a database in Hive is straightforward, although "database" in this context is actually a catalog of namespaces or clusters. If you don't create a database, then any tables you create will go in the "default" database. Simply use the "create" command, followed by the name of the database and then a semicolon (a semicolon will be used to end a query in almost all cases), like this:

hive> CREATE DATABASE datasets;

An error will occur if a database with this name already exists; if you wish to avoid such error messages, you can use the following:

hive> CREATE DATABASE IF NOT EXISTS datasets;

To see the databases available in Hive, use the following command:

hive> SHOW DATABASES;
OK
default
datasets

In this tutorial, we're going to be using the 2011 Population Estimates dataset from Socrata to create a table and run queries on it.

Creating a Table


Tables are more complex to create. In this case, we need to specify not only the "create" command and the table name, but also the column names and the data types that are used in their contents. In simple terms, the data type refers to the kind of information contained within, such as numbers, strings, timestamps, etc. This is a large topic and won't be covered here, however, more information on data types can be found at this link.

Since we're creating a table that will use information from the San Francisco Building Permits dataset, we'll need to match the column names as well as try to find the right data types.

Here's how it will look:

hive> CREATE TABLE IF NOT EXISTS 2011_pop_est (
    > region string,
    > 2010_census int,
    > 2011_estimate int,
    > change int,
    > percent float)
    > ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
    > tblproperties("skip.header.line.count"="1");
OK
Time taken: 0.079 seconds

There's a few places where some liberties needed to be taken - I'm speaking specifically of the columns that use a timestamp format, since they're in the wrong format. There are ways to cast these values in ways that Hive's date and time formats will recognize, but this is beyond the scope of this post. So for now, we're classifying them as strings. "ROW FORMAT DELIMITED..." tells Hive that the table will use a comma as a separator. The "tblproperties..." line indicates that the header of the input file should be ignored.


To see details on the schema of the table, you can use "describe":

hive> describe 2011_pop_est;

To get even more detailed information, use "describe extended":

hive> describe extended 2011_pop_est;

Now that a table has been created, in the next post we'll start import the data and start performing queries on it.

Comments

Popular posts from this blog

The Basics of IICS

Real Estate Data Pipeline, Part 1

Imperial to Metric Conversion (and vice-versa) Script