Setting Up A Data Transfer with Sqoop

According to its page on the Apache website, Sqoop is "a tool designed for efficiently transferring data between Apache Hadoop and structured datastores such as relational databases." This process is largely automated, due to the way that the database schema informs Sqoop about how the data is to be imported, as per the user guide. In this post, we're going to import data from a MySQL database and then transfer it to Hive.

Setting Up the MySQL Data


The first step will be to set up the MySQL databases and tables that will be used for the transfer. First we'll create a database called "sqoop_test":

mysql> CREATE DATABASE sqoop_test;
mysql> USE sqoop_test;


Next, the table "stocks" will be created, preparing it for a text file with stock data to be populated:

mysql> CREATE TABLE stocks (
id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
symbol VARCHAR(100),
quote_date VARCHAR(100),
open_price DOUBLE PRECISION,
high_price DOUBLE PRECISION,
low_price DOUBLE PRECISION,
close_price DOUBLE PRECISION,
volume INTEGER,
adj_close_price DOUBLE PRECISION
);

Next, the following stock information file will be used to populate the table:

mysql> LOAD DATA LOCAL INFILE '/root/TrainingOnHDP/dataset/stocks.txt'
INTO TABLE stocks
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
(symbol, @quote_date, open_price, high_price, low_price, close_price, volume, adj_close_price)
SET quote_date = DATE_FORMAT(@quote_date, '%Y-%m-%d');

Now we can see the information present in the table:

mysql> select * from stocks limit 10;
+----+--------+------------+------------+------------+-----------+-------------+----------+-----------------+
| id | symbol | quote_date | open_price | high_price | low_price | close_price | volume   | adj_close_price |
+----+--------+------------+------------+------------+-----------+-------------+----------+-----------------+
|  1 | AAPL   | 2009-01-02 |      85.88 |      91.04 |     85.16 |       90.75 | 26643400 |           90.75 |
|  2 | AAPL   | 2008-01-02 |     199.27 |     200.26 |    192.55 |      194.84 | 38542100 |          194.84 |
|  3 | AAPL   | 2007-01-03 |      86.29 |      86.58 |      81.9 |        83.8 | 44225700 |            83.8 |
|  4 | AAPL   | 2006-01-03 |      72.38 |      74.75 |     72.25 |       74.75 | 28829800 |           74.75 |
|  5 | AAPL   | 2005-01-03 |      64.78 |      65.11 |      62.6 |       63.29 | 24714000 |           31.65 |
|  6 | AAPL   | 2004-01-02 |      21.55 |      21.75 |     21.18 |       21.28 |  5165800 |           10.64 |
|  7 | AAPL   | 2003-01-02 |      14.36 |      14.92 |     14.35 |        14.8 |  6479600 |             7.4 |
|  8 | AAPL   | 2002-01-02 |      22.05 |       23.3 |     21.96 |        23.3 | 18910600 |           11.65 |
|  9 | AAPL   | 2001-01-02 |      14.88 |      15.25 |     14.56 |       14.88 | 16161800 |            7.44 |
| 10 | AAPL   | 2000-01-03 |     104.87 |      112.5 |    101.69 |      111.94 | 19144400 |           27.99 |
+----+--------+------------+------------+------------+-----------+-------------+----------+-----------------+
10 rows in set (0.00 sec)

Using Sqoop


Getting a list of the database tables in Sqoop can be done by using the "list-tables" command:

[root@sandbox-hdp ~]# sqoop list-tables -connect jdbc:mysql://localhost:3306/sqoop_test --username hip_sqoop_user -P;
Warning: /usr/hdp/2.6.4.0-91/accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
18/04/21 14:59:33 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6.2.6.4.0-91
Enter password:
18/04/21 14:59:40 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
stocks

The following command will start a mapreduce job that will import the "stocks" table into HDFS:

sqoop import --connect jdbc:mysql://localhost/sqoop_test --table stocks --fetch-size 10 --username etl_talk -P
Verify the contents of the imported content:

$ hdfs dfs -cat stocks/*
1,AAPL,2009-01-02,85.88,91.04,85.16,90.75,26643400,90.75
2,AAPL,2008-01-02,199.27,200.26,192.55,194.84,38542100,194.84
3,AAPL,2007-01-03,86.29,86.58,81.9,83.8,44225700,83.8
4,AAPL,2006-01-03,72.38,74.75,72.25,74.75,28829800,74.75
5,AAPL,2005-01-03,64.78,65.11,62.6,63.29,24714000,31.65
6,AAPL,2004-01-02,21.55,21.75,21.18,21.28,5165800,10.64
7,AAPL,2003-01-02,14.36,14.92,14.35,14.8,6479600,7.4
8,AAPL,2002-01-02,22.05,23.3,21.96,23.3,18910600,11.65
9,AAPL,2001-01-02,14.88,15.25,14.56,14.88,16161800,7.44
10,AAPL,2000-01-03,104.87,112.5,101.69,111.94,19144400,27.99
11,CSCO,2009-01-02,16.41,17.0,16.25,16.96,40980600,16.96
12,CSCO,2008-01-02,27.0,27.3,26.21,26.54,64338900,26.54
13,CSCO,2007-01-03,27.46,27.98,27.33,27.73,64226000,27.73
14,CSCO,2006-01-03,17.21,17.49,17.18,17.45,55426000,17.45
15,CSCO,2005-01-03,19.42,19.61,19.27,19.32,56725600,19.32
16,CSCO,2004-01-02,24.36,24.53,24.16,24.25,29955800,24.25
17,CSCO,2003-01-02,13.11,13.69,13.09,13.64,61335700,13.64
18,CSCO,2002-01-02,18.44,19.3,18.26,19.23,55376900,19.23
19,CSCO,2001-01-02,38.13,38.5,32.63,33.31,17384600,33.31
20,CSCO,2000-01-03,109.94,110.25,103.56,108.06,53076000,54.03
21,GOOG,2009-01-02,308.6,321.82,305.5,321.32,3610500,321.32
22,GOOG,2008-01-02,692.87,697.37,677.73,685.19,4306900,685.19
23,GOOG,2007-01-03,466.0,476.66,461.11,467.59,7706500,467.59
24,GOOG,2006-01-03,422.52,435.67,418.22,435.23,13121200,435.23
25,GOOG,2005-01-03,197.4,203.64,195.46,202.71,15844200,202.71
26,MSFT,2009-01-02,19.53,20.4,19.37,20.33,50084000,19.86
27,MSFT,2008-01-02,35.79,35.96,35.0,35.22,63004200,33.79
28,MSFT,2007-01-03,29.91,30.25,29.4,29.86,76935100,28.26
29,MSFT,2006-01-03,26.25,27.0,26.1,26.84,79973000,25.04
30,MSFT,2005-01-03,26.8,26.95,26.65,26.74,65002900,24.65
31,MSFT,2004-01-02,27.58,27.77,27.33,27.45,44487700,22.64
32,MSFT,2003-01-02,52.3,53.75,51.71,53.72,67025200,21.95
33,MSFT,2002-01-02,66.65,67.11,65.51,67.04,48124000,27.4
34,MSFT,2001-01-02,44.13,45.0,42.88,43.38,82413200,17.73
35,MSFT,2000-01-03,117.37,118.62,112.0,116.56,53228400,47.64
36,YHOO,2009-01-02,12.17,12.85,12.12,12.85,9514600,12.85
37,YHOO,2008-01-02,23.8,24.15,23.6,23.72,25671700,23.72
38,YHOO,2007-01-03,25.85,26.26,25.26,25.61,26352700,25.61
39,YHOO,2006-01-03,39.69,41.22,38.79,40.91,24227700,40.91
40,YHOO,2005-01-03,38.36,38.9,37.65,38.18,25482800,38.18
41,YHOO,2004-01-02,45.5,45.83,45.12,45.4,16480000,22.7
42,YHOO,2003-01-02,16.59,17.66,16.5,17.6,19640400,8.8
43,YHOO,2002-01-02,18.14,18.69,17.68,18.63,21903600,9.31
44,YHOO,2001-01-02,30.31,30.37,27.5,28.19,21939200,14.1
45,YHOO,2000-01-03,442.92,477.0,429.5,475.0,38469600,118.75

Moving It All To Hive


At this point, we will launch Hive and create a table using the same schema as the MySQL table:

hive> CREATE TABLE hiveSTOCKS (
    > id INT,
    > symbol STRING,
    > quote_date STRING,
    > open_price DOUBLE,
    > high_price DOUBLE,
    > low_price DOUBLE,
    > close_price DOUBLE,
    > volume INT,
    > adj_close_price DOUBLE)
    > ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
    > STORED AS TEXTFILE
    > LOCATION '/user/root/stocks';
OK
Time taken: 4.295 seconds

Verify that the table has been created:

hive> show tables;
OK
hivestocks
mtable_hbase_2
mytable_hbase
mytable_hbase_1
sample_07
sample_08
Time taken: 0.344 seconds, Fetched: 6 row(s)

Finally, we see that the data transferred correctly when we perform some queries:

hive> SELECT * FROM hivestocks limit 10;
OK
1       AAPL    2009-01-02      85.88   91.04   85.16   90.75   26643400        90.75
2       AAPL    2008-01-02      199.27  200.26  192.55  194.84  38542100        194.84
3       AAPL    2007-01-03      86.29   86.58   81.9    83.8    44225700        83.8
4       AAPL    2006-01-03      72.38   74.75   72.25   74.75   28829800        74.75
5       AAPL    2005-01-03      64.78   65.11   62.6    63.29   24714000        31.65
6       AAPL    2004-01-02      21.55   21.75   21.18   21.28   5165800 10.64
7       AAPL    2003-01-02      14.36   14.92   14.35   14.8    6479600 7.4
8       AAPL    2002-01-02      22.05   23.3    21.96   23.3    18910600        11.65
9       AAPL    2001-01-02      14.88   15.25   14.56   14.88   16161800        7.44
10      AAPL    2000-01-03      104.87  112.5   101.69  111.94  19144400        27.99

hive> SELECT * FROM hiveSTOCKS WHERE quote_date < '2003-01-02';
OK
8       AAPL    2002-01-02      22.05   23.3    21.96   23.3    18910600        11.65
9       AAPL    2001-01-02      14.88   15.25   14.56   14.88   16161800        7.44
10      AAPL    2000-01-03      104.87  112.5   101.69  111.94  19144400        27.99
18      CSCO    2002-01-02      18.44   19.3    18.26   19.23   55376900        19.23
19      CSCO    2001-01-02      38.13   38.5    32.63   33.31   17384600        33.31
20      CSCO    2000-01-03      109.94  110.25  103.56  108.06  53076000        54.03
33      MSFT    2002-01-02      66.65   67.11   65.51   67.04   48124000        27.4
34      MSFT    2001-01-02      44.13   45.0    42.88   43.38   82413200        17.73
35      MSFT    2000-01-03      117.37  118.62  112.0   116.56  53228400        47.64
43      YHOO    2002-01-02      18.14   18.69   17.68   18.63   21903600        9.31
44      YHOO    2001-01-02      30.31   30.37   27.5    28.19   21939200        14.1
45      YHOO    2000-01-03      442.92  477.0   429.5   475.0   38469600        118.75

hive> SELECT symbol, quote_date, volume, close_price FROM hivestocks WHERE close_price BETWEEN 30.0 and 90.0 ORDER BY symbol;
OK
AAPL    2007-01-03      44225700        83.8
AAPL    2006-01-03      28829800        74.75
AAPL    2005-01-03      24714000        63.29
CSCO    2001-01-02      17384600        33.31
MSFT    2008-01-02      63004200        35.22
MSFT    2003-01-02      67025200        53.72
MSFT    2002-01-02      48124000        67.04
MSFT    2001-01-02      82413200        43.38
YHOO    2006-01-03      24227700        40.91
YHOO    2005-01-03      25482800        38.18
YHOO    2004-01-02      16480000        45.4
Time taken: 14.893 seconds, Fetched: 11 row(s)

Comments

Popular posts from this blog

The Basics of IICS

Imperial to Metric Conversion (and vice-versa) Script

Real Estate Data Pipeline, Part 1