The Art of the Hive Query, Part 2
The previous post went over information detailing how to create a database and tables. There was also some information about the kinds of data types that one sees in Hive. In this post, we're going to import the San Francisco Building Permits information into Hive and then run some queries on the data.
Hive can import data from a variety of sources, but arguably the two most common are HDFS and the local filesystem. To transfer files from the local filesystem to HDFS, you would need to use the "put" command, like so:
> hdfs dfs -put /path/to/file/2011_Population_Estimates.csv /user/data
To load the data into the Hive table, we will need to use the "load data inpath" parameter:
hive> LOAD DATA INPATH '/user/data/2011_Population_Estimates.csv'
> OVERWRITE INTO TABLE 2011_pop_est;
Loading data to table datasets.2011_pop_est
Table datasets.2011_pop_est stats: [numFiles=1, numRows=0, totalSize=2259, rawDataSize=0]
OK
Time taken: 0.209 seconds
If we were to take the file from the local filesystem, we would instead use 'LOAD DATA LOCAL INPATH.'
The 'SELECT' and 'FROM' clauses form the basis of many of Hive's queries. 'SELECT' will retrieve data from specific columns, whereas 'FROM' will identify the table, view, or nested query. Take this example:
hive> select * from 2011_pop_est limit 10;
OK
United States 308745538 311591917 2846379 0.92
Northeast 55317240 55521598 204358 0.37
Midwest 66927001 67158835 231834 0.35
South 114555744 116046736 1490992 1.3
West 71945553 72864748 919195 1.28
.District of Columbia 601723 617996 16273 2.7
.Texas 25145561 25674681 529120 2.1
.Utah 2763885 2817222 53337 1.93
.Alaska 710231 722718 12487 1.76
.Colorado 5029196 5116796 87600 1.74
Time taken: 0.054 seconds, Fetched: 10 row(s)
This query selects data from all columns in the 2011_pop_est table. The "limit 10" qualifier tells Hive to return only the first two results, as returning the entirety of a large dataset is often undesirable. You can also select specific columns:
hive> SELECT region, percent from 2011_pop_est limit 10;
OK
United States 0.92
Northeast 0.37
Midwest 0.35
South 1.3
West 1.28
.District of Columbia 2.7
.Texas 2.1
.Utah 1.93
.Alaska 1.76
.Colorado 1.74
Time taken: 0.051 seconds, Fetched: 10 row(s)
Using select statements, you can even perform calculations on the specified columns:
hive> SELECT upper(region), 2010_census, 2011_estimate,
> round(2011_estimate - 2010_census) FROM 2011_pop_est limit 10;
OK
UNITED STATES 308745538 311591917 2846379
NORTHEAST 55317240 55521598 204358
MIDWEST 66927001 67158835 231834
SOUTH 114555744 116046736 1490992
WEST 71945553 72864748 919195
.DISTRICT OF COLUMBIA 601723 617996 16273
.TEXAS 25145561 25674681 529120
.UTAH 2763885 2817222 53337
.ALASKA 710231 722718 12487
.COLORADO 5029196 5116796 87600
Time taken: 0.062 seconds, Fetched: 10 row(s)
The 'WHERE' clause works as a filter, working with a condition the user sets to return a specific set of records:
hive> SELECT * from 2011_pop_est WHERE percent > 2.0;
OK
.District of Columbia 601723 617996 16273 2.7
.Texas 25145561 25674681 529120 2.1
Time taken: 0.097 seconds, Fetched: 2 row(s)
You can also, for example, select from specific rows although with this example there aren't many returned records:
hive> select * from 2011_pop_est WHERE region = 'Midwest';
OK
Midwest 66927001 67158835 231834 0.35
Time taken: 0.097 seconds, Fetched: 1 row(s)
However, it's also possible to instead search within a string:
hive> select * from 2011_pop_est WHERE region LIKE ('%M%');
OK
Midwest 66927001 67158835 231834 0.35
.New Mexico 2059179 2082224 23045 1.12
.Maryland 5773552 5828289 54737 0.95
.Montana 989415 998199 8784 0.89
.Minnesota 5303925 5344861 40936 0.77
.Massachusetts 6547629 6587536 39907 0.61
.Mississippi 2967297 2978512 11215 0.38
.Missouri 5988927 6010688 21761 0.36
.Maine 1328361 1328188 -173 -0.01
.Michigan 9883640 9876187 -7453 -0.08
Time taken: 0.069 seconds, Fetched: 10 row(s)
A range of values can also be specified:
hive> SELECT * FROM 2011_pop_est WHERE change BETWEEN 10000 AND 20000;
OK
.District of Columbia 601723 617996 16273 2.7
.Alaska 710231 722718 12487 1.76
.North Dakota 672591 683932 11341 1.69
.Idaho 1567582 1584985 17403 1.11
.Hawaii 1360301 1374810 14509 1.07
.Nebraska 1826341 1842641 16300 0.89
.Kansas 2853118 2871238 18120 0.64
.Iowa 3046355 3062309 15954 0.52
.Mississippi 2967297 2978512 11215 0.38
Time taken: 0.073 seconds, Fetched: 9 row(s)
The 'ORDER BY' clause sorts data from the specified columns, either ascending or descending, depending on the query:
hive> SELECT region, 2010_census, 2010_census, percent FROM 2011_pop_est
> SORT BY percent ASC, 2010_census DESC LIMIT 20;
OK
delta 29837 29837 NULL
delta 29837 29837 NULL
.Rhode Island 1052567 1052567 -0.12
.Michigan 9883640 9883640 -0.08
.Maine 1328361 1328361 -0.01
.Ohio 11536504 11536504 0.07
.Vermont 625741 625741 0.11
.West Virginia 1852994 1852994 0.13
.New Hampshire 1316470 1316470 0.13
.Connecticut 3574097 3574097 0.18
.Illinois 12830632 12830632 0.3
.Pennsylvania 12702379 12702379 0.32
.New Jersey 8791894 8791894 0.33
Midwest 66927001 66927001 0.35
.Missouri 5988927 5988927 0.36
Northeast 55317240 55317240 0.37
.Mississippi 2967297 2967297 0.38
.Wisconsin 5686986 5686986 0.44
.New York 19378102 19378102 0.45
.Alabama 4779736 4779736 0.48
Time taken: 57.484 seconds, Fetched: 20 row(s)
Although 'GROUP BY' sounds like it might be similar in purpose to 'ORDER BY', it's usually used to perform aggregate functions, grouping the result by one or more columns. Examples of aggregate functions include COUNT, MAX, MIN, SUM, AVG:
hive> SELECT AVG(2011_estimate), region
> from 2011_pop_est
> GROUP BY region LIMIT 15;
OK
4802740.0 .Alabama
722718.0 .Alaska
6482505.0 .Arizona
2937979.0 .Arkansas
3.7691912E7 .California
5116796.0 .Colorado
3580709.0 .Connecticut
907135.0 .Delaware
617996.0 .District of Columbia
1.9057542E7 .Florida
9815210.0 .Georgia
1374810.0 .Hawaii
1584985.0 .Idaho
1.2869257E7 .Illinois
6516922.0 .Indiana
Time taken: 29.915 seconds, Fetched: 15 row(s)
JOINs
JOIN statements, as the name implies, allow you to join two tables together based on a common column. Use the following data and save it to a csv file :
OrderID,CustomerID,OrderDate
24561,18,2013-01-01
24562,22,2013-01-02
24563,95,2013-01-02
24564,28,2013-01-03
24565,11,2013-01-05
24566,14,2013-01-05
24567,56,2013-01-05
24568,72,2013-01-05
24569,44,2013-01-06
Import it into a Hive table called "orders." The "OrderDate" column is of the data type "date."
Then do the same with the following and call the table "customers":
CustomerID,CustomerName,Country,Registered
18,Victor Carpenter,Germany,true
22,Mark Williams,USA,false
95,Rebecca Rhodes,Canada,true
28,Marion Sealy,Spain,true
11,Patrick Lindsay,England,false
14,Roger Marini,Germany,false
56,Thelma Barkley,Russia,true
72,Joan Roland,USA,false
44,James Roberts,Brazil,true
The "Registered" field is of type "boolean."
We're going to look at an "inner join," which matches records that both tables share in common. It's necessary to tell Hive the columns that will be selected, as well as which column will form the basis of the join:
hive> SELECT a.order_id, a.order_date, b.customer_name
> FROM orders a JOIN customers b
> ON a.customer_id = b.customer_id;
OK
24565 2013-01-05 Patrick Lindsay
24566 2013-01-05 Roger Marini
24561 2013-01-01 Victor Carpenter
24562 2013-01-02 Mark Williams
24564 2013-01-03 Marion Sealy
24569 2013-01-06 James Roberts
24567 2013-01-05 Thelma Barkley
24568 2013-01-05 Joan Roland
24563 2013-01-02 Rebecca Rhodes
Time taken: 36.202 seconds, Fetched: 9 row(s)
Further than this, there are also Left Outer Joins, Right Outer Joins, and Full Outer Joins. Left Outer Joins will match all rows from the left table; any rows from the right that do not have a matching record will be NULL. The syntax is similar:
hive> SELECT a.order_id, a.order_date, b.customer_name
> FROM orders a LEFT OUTER JOIN customers b
> ON a.customer_id = b.customer_id;
A Right Outer Join follows the same principle, but in reverse: All rows from the right table will be matched, whereas any rows from the left table that do not have a matching record will be null. The syntax is as follows:
hive> SELECT a.order_id, a.order_date, b.customer_name
> FROM orders a RIGHT OUTER JOIN customers b
> ON a.customer_id = b.customer_id;
Finally, a Full Outer Join returns all rows from both tables that fulfill the JOIN condition. Any missing records in either table will be rendered NULL. The syntax is the same, except using "FULL OUTER JOIN":
hive> SELECT a.order_id, a.order_date, b.customer_name
> FROM orders a FULL OUTER JOIN customers b
> ON a.customer_id = b.customer_id;
This concludes our posts on Hive queries - obviously there is much more on the subject, but this should serve as a basic introduction. For much more detailed information about Hive queries and administration, I recommend the O'Reilly book, "Programming Hive."
Hive can import data from a variety of sources, but arguably the two most common are HDFS and the local filesystem. To transfer files from the local filesystem to HDFS, you would need to use the "put" command, like so:
> hdfs dfs -put /path/to/file/2011_Population_Estimates.csv /user/data
To load the data into the Hive table, we will need to use the "load data inpath" parameter:
hive> LOAD DATA INPATH '/user/data/2011_Population_Estimates.csv'
> OVERWRITE INTO TABLE 2011_pop_est;
Loading data to table datasets.2011_pop_est
Table datasets.2011_pop_est stats: [numFiles=1, numRows=0, totalSize=2259, rawDataSize=0]
OK
Time taken: 0.209 seconds
If we were to take the file from the local filesystem, we would instead use 'LOAD DATA LOCAL INPATH.'
The 'SELECT' and 'FROM' Clauses
The 'SELECT' and 'FROM' clauses form the basis of many of Hive's queries. 'SELECT' will retrieve data from specific columns, whereas 'FROM' will identify the table, view, or nested query. Take this example:
hive> select * from 2011_pop_est limit 10;
OK
United States 308745538 311591917 2846379 0.92
Northeast 55317240 55521598 204358 0.37
Midwest 66927001 67158835 231834 0.35
South 114555744 116046736 1490992 1.3
West 71945553 72864748 919195 1.28
.District of Columbia 601723 617996 16273 2.7
.Texas 25145561 25674681 529120 2.1
.Utah 2763885 2817222 53337 1.93
.Alaska 710231 722718 12487 1.76
.Colorado 5029196 5116796 87600 1.74
Time taken: 0.054 seconds, Fetched: 10 row(s)
This query selects data from all columns in the 2011_pop_est table. The "limit 10" qualifier tells Hive to return only the first two results, as returning the entirety of a large dataset is often undesirable. You can also select specific columns:
hive> SELECT region, percent from 2011_pop_est limit 10;
OK
United States 0.92
Northeast 0.37
Midwest 0.35
South 1.3
West 1.28
.District of Columbia 2.7
.Texas 2.1
.Utah 1.93
.Alaska 1.76
.Colorado 1.74
Time taken: 0.051 seconds, Fetched: 10 row(s)
Using select statements, you can even perform calculations on the specified columns:
hive> SELECT upper(region), 2010_census, 2011_estimate,
> round(2011_estimate - 2010_census) FROM 2011_pop_est limit 10;
OK
UNITED STATES 308745538 311591917 2846379
NORTHEAST 55317240 55521598 204358
MIDWEST 66927001 67158835 231834
SOUTH 114555744 116046736 1490992
WEST 71945553 72864748 919195
.DISTRICT OF COLUMBIA 601723 617996 16273
.TEXAS 25145561 25674681 529120
.UTAH 2763885 2817222 53337
.ALASKA 710231 722718 12487
.COLORADO 5029196 5116796 87600
Time taken: 0.062 seconds, Fetched: 10 row(s)
The WHERE Clause
The 'WHERE' clause works as a filter, working with a condition the user sets to return a specific set of records:
hive> SELECT * from 2011_pop_est WHERE percent > 2.0;
OK
.District of Columbia 601723 617996 16273 2.7
.Texas 25145561 25674681 529120 2.1
Time taken: 0.097 seconds, Fetched: 2 row(s)
You can also, for example, select from specific rows although with this example there aren't many returned records:
hive> select * from 2011_pop_est WHERE region = 'Midwest';
OK
Midwest 66927001 67158835 231834 0.35
Time taken: 0.097 seconds, Fetched: 1 row(s)
However, it's also possible to instead search within a string:
hive> select * from 2011_pop_est WHERE region LIKE ('%M%');
OK
Midwest 66927001 67158835 231834 0.35
.New Mexico 2059179 2082224 23045 1.12
.Maryland 5773552 5828289 54737 0.95
.Montana 989415 998199 8784 0.89
.Minnesota 5303925 5344861 40936 0.77
.Massachusetts 6547629 6587536 39907 0.61
.Mississippi 2967297 2978512 11215 0.38
.Missouri 5988927 6010688 21761 0.36
.Maine 1328361 1328188 -173 -0.01
.Michigan 9883640 9876187 -7453 -0.08
Time taken: 0.069 seconds, Fetched: 10 row(s)
A range of values can also be specified:
hive> SELECT * FROM 2011_pop_est WHERE change BETWEEN 10000 AND 20000;
OK
.District of Columbia 601723 617996 16273 2.7
.Alaska 710231 722718 12487 1.76
.North Dakota 672591 683932 11341 1.69
.Idaho 1567582 1584985 17403 1.11
.Hawaii 1360301 1374810 14509 1.07
.Nebraska 1826341 1842641 16300 0.89
.Kansas 2853118 2871238 18120 0.64
.Iowa 3046355 3062309 15954 0.52
.Mississippi 2967297 2978512 11215 0.38
Time taken: 0.073 seconds, Fetched: 9 row(s)
ORDER BY and GROUP BY
The 'ORDER BY' clause sorts data from the specified columns, either ascending or descending, depending on the query:
hive> SELECT region, 2010_census, 2010_census, percent FROM 2011_pop_est
> SORT BY percent ASC, 2010_census DESC LIMIT 20;
OK
delta 29837 29837 NULL
delta 29837 29837 NULL
.Rhode Island 1052567 1052567 -0.12
.Michigan 9883640 9883640 -0.08
.Maine 1328361 1328361 -0.01
.Ohio 11536504 11536504 0.07
.Vermont 625741 625741 0.11
.West Virginia 1852994 1852994 0.13
.New Hampshire 1316470 1316470 0.13
.Connecticut 3574097 3574097 0.18
.Illinois 12830632 12830632 0.3
.Pennsylvania 12702379 12702379 0.32
.New Jersey 8791894 8791894 0.33
Midwest 66927001 66927001 0.35
.Missouri 5988927 5988927 0.36
Northeast 55317240 55317240 0.37
.Mississippi 2967297 2967297 0.38
.Wisconsin 5686986 5686986 0.44
.New York 19378102 19378102 0.45
.Alabama 4779736 4779736 0.48
Time taken: 57.484 seconds, Fetched: 20 row(s)
Although 'GROUP BY' sounds like it might be similar in purpose to 'ORDER BY', it's usually used to perform aggregate functions, grouping the result by one or more columns. Examples of aggregate functions include COUNT, MAX, MIN, SUM, AVG:
hive> SELECT AVG(2011_estimate), region
> from 2011_pop_est
> GROUP BY region LIMIT 15;
OK
4802740.0 .Alabama
722718.0 .Alaska
6482505.0 .Arizona
2937979.0 .Arkansas
3.7691912E7 .California
5116796.0 .Colorado
3580709.0 .Connecticut
907135.0 .Delaware
617996.0 .District of Columbia
1.9057542E7 .Florida
9815210.0 .Georgia
1374810.0 .Hawaii
1584985.0 .Idaho
1.2869257E7 .Illinois
6516922.0 .Indiana
Time taken: 29.915 seconds, Fetched: 15 row(s)
JOINs
JOIN statements, as the name implies, allow you to join two tables together based on a common column. Use the following data and save it to a csv file :
OrderID,CustomerID,OrderDate
24561,18,2013-01-01
24562,22,2013-01-02
24563,95,2013-01-02
24564,28,2013-01-03
24565,11,2013-01-05
24566,14,2013-01-05
24567,56,2013-01-05
24568,72,2013-01-05
24569,44,2013-01-06
Import it into a Hive table called "orders." The "OrderDate" column is of the data type "date."
Then do the same with the following and call the table "customers":
CustomerID,CustomerName,Country,Registered
18,Victor Carpenter,Germany,true
22,Mark Williams,USA,false
95,Rebecca Rhodes,Canada,true
28,Marion Sealy,Spain,true
11,Patrick Lindsay,England,false
14,Roger Marini,Germany,false
56,Thelma Barkley,Russia,true
72,Joan Roland,USA,false
44,James Roberts,Brazil,true
The "Registered" field is of type "boolean."
We're going to look at an "inner join," which matches records that both tables share in common. It's necessary to tell Hive the columns that will be selected, as well as which column will form the basis of the join:
hive> SELECT a.order_id, a.order_date, b.customer_name
> FROM orders a JOIN customers b
> ON a.customer_id = b.customer_id;
OK
24565 2013-01-05 Patrick Lindsay
24566 2013-01-05 Roger Marini
24561 2013-01-01 Victor Carpenter
24562 2013-01-02 Mark Williams
24564 2013-01-03 Marion Sealy
24569 2013-01-06 James Roberts
24567 2013-01-05 Thelma Barkley
24568 2013-01-05 Joan Roland
24563 2013-01-02 Rebecca Rhodes
Time taken: 36.202 seconds, Fetched: 9 row(s)
Further than this, there are also Left Outer Joins, Right Outer Joins, and Full Outer Joins. Left Outer Joins will match all rows from the left table; any rows from the right that do not have a matching record will be NULL. The syntax is similar:
hive> SELECT a.order_id, a.order_date, b.customer_name
> FROM orders a LEFT OUTER JOIN customers b
> ON a.customer_id = b.customer_id;
A Right Outer Join follows the same principle, but in reverse: All rows from the right table will be matched, whereas any rows from the left table that do not have a matching record will be null. The syntax is as follows:
hive> SELECT a.order_id, a.order_date, b.customer_name
> FROM orders a RIGHT OUTER JOIN customers b
> ON a.customer_id = b.customer_id;
Finally, a Full Outer Join returns all rows from both tables that fulfill the JOIN condition. Any missing records in either table will be rendered NULL. The syntax is the same, except using "FULL OUTER JOIN":
hive> SELECT a.order_id, a.order_date, b.customer_name
> FROM orders a FULL OUTER JOIN customers b
> ON a.customer_id = b.customer_id;
This concludes our posts on Hive queries - obviously there is much more on the subject, but this should serve as a basic introduction. For much more detailed information about Hive queries and administration, I recommend the O'Reilly book, "Programming Hive."
Comments
Post a Comment