Setting Up A Data Transfer with Sqoop
According to its page on the Apache website, Sqoop is "a tool designed for efficiently transferring data between Apache Hadoop and structured datastores such as relational databases." This process is largely automated, due to the way that the database schema informs Sqoop about how the data is to be imported, as per the user guide. In this post, we're going to import data from a MySQL database and then transfer it to Hive.
The first step will be to set up the MySQL databases and tables that will be used for the transfer. First we'll create a database called "sqoop_test":
mysql> CREATE DATABASE sqoop_test;
mysql> USE sqoop_test;
Next, the table "stocks" will be created, preparing it for a text file with stock data to be populated:
mysql> CREATE TABLE stocks (
id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
symbol VARCHAR(100),
quote_date VARCHAR(100),
open_price DOUBLE PRECISION,
high_price DOUBLE PRECISION,
low_price DOUBLE PRECISION,
close_price DOUBLE PRECISION,
volume INTEGER,
adj_close_price DOUBLE PRECISION
);
Next, the following stock information file will be used to populate the table:
mysql> LOAD DATA LOCAL INFILE '/root/TrainingOnHDP/dataset/stocks.txt'
INTO TABLE stocks
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
(symbol, @quote_date, open_price, high_price, low_price, close_price, volume, adj_close_price)
SET quote_date = DATE_FORMAT(@quote_date, '%Y-%m-%d');
Now we can see the information present in the table:
Getting a list of the database tables in Sqoop can be done by using the "list-tables" command:
The following command will start a mapreduce job that will import the "stocks" table into HDFS:
sqoop import --connect jdbc:mysql://localhost/sqoop_test --table stocks --fetch-size 10 --username etl_talk -P
Verify the contents of the imported content:
$ hdfs dfs -cat stocks/*
1,AAPL,2009-01-02,85.88,91.04,85.16,90.75,26643400,90.75
2,AAPL,2008-01-02,199.27,200.26,192.55,194.84,38542100,194.84
3,AAPL,2007-01-03,86.29,86.58,81.9,83.8,44225700,83.8
4,AAPL,2006-01-03,72.38,74.75,72.25,74.75,28829800,74.75
5,AAPL,2005-01-03,64.78,65.11,62.6,63.29,24714000,31.65
6,AAPL,2004-01-02,21.55,21.75,21.18,21.28,5165800,10.64
7,AAPL,2003-01-02,14.36,14.92,14.35,14.8,6479600,7.4
8,AAPL,2002-01-02,22.05,23.3,21.96,23.3,18910600,11.65
9,AAPL,2001-01-02,14.88,15.25,14.56,14.88,16161800,7.44
10,AAPL,2000-01-03,104.87,112.5,101.69,111.94,19144400,27.99
11,CSCO,2009-01-02,16.41,17.0,16.25,16.96,40980600,16.96
12,CSCO,2008-01-02,27.0,27.3,26.21,26.54,64338900,26.54
13,CSCO,2007-01-03,27.46,27.98,27.33,27.73,64226000,27.73
14,CSCO,2006-01-03,17.21,17.49,17.18,17.45,55426000,17.45
15,CSCO,2005-01-03,19.42,19.61,19.27,19.32,56725600,19.32
16,CSCO,2004-01-02,24.36,24.53,24.16,24.25,29955800,24.25
17,CSCO,2003-01-02,13.11,13.69,13.09,13.64,61335700,13.64
18,CSCO,2002-01-02,18.44,19.3,18.26,19.23,55376900,19.23
19,CSCO,2001-01-02,38.13,38.5,32.63,33.31,17384600,33.31
20,CSCO,2000-01-03,109.94,110.25,103.56,108.06,53076000,54.03
21,GOOG,2009-01-02,308.6,321.82,305.5,321.32,3610500,321.32
22,GOOG,2008-01-02,692.87,697.37,677.73,685.19,4306900,685.19
23,GOOG,2007-01-03,466.0,476.66,461.11,467.59,7706500,467.59
24,GOOG,2006-01-03,422.52,435.67,418.22,435.23,13121200,435.23
25,GOOG,2005-01-03,197.4,203.64,195.46,202.71,15844200,202.71
26,MSFT,2009-01-02,19.53,20.4,19.37,20.33,50084000,19.86
27,MSFT,2008-01-02,35.79,35.96,35.0,35.22,63004200,33.79
28,MSFT,2007-01-03,29.91,30.25,29.4,29.86,76935100,28.26
29,MSFT,2006-01-03,26.25,27.0,26.1,26.84,79973000,25.04
30,MSFT,2005-01-03,26.8,26.95,26.65,26.74,65002900,24.65
31,MSFT,2004-01-02,27.58,27.77,27.33,27.45,44487700,22.64
32,MSFT,2003-01-02,52.3,53.75,51.71,53.72,67025200,21.95
33,MSFT,2002-01-02,66.65,67.11,65.51,67.04,48124000,27.4
34,MSFT,2001-01-02,44.13,45.0,42.88,43.38,82413200,17.73
35,MSFT,2000-01-03,117.37,118.62,112.0,116.56,53228400,47.64
36,YHOO,2009-01-02,12.17,12.85,12.12,12.85,9514600,12.85
37,YHOO,2008-01-02,23.8,24.15,23.6,23.72,25671700,23.72
38,YHOO,2007-01-03,25.85,26.26,25.26,25.61,26352700,25.61
39,YHOO,2006-01-03,39.69,41.22,38.79,40.91,24227700,40.91
40,YHOO,2005-01-03,38.36,38.9,37.65,38.18,25482800,38.18
41,YHOO,2004-01-02,45.5,45.83,45.12,45.4,16480000,22.7
42,YHOO,2003-01-02,16.59,17.66,16.5,17.6,19640400,8.8
43,YHOO,2002-01-02,18.14,18.69,17.68,18.63,21903600,9.31
44,YHOO,2001-01-02,30.31,30.37,27.5,28.19,21939200,14.1
45,YHOO,2000-01-03,442.92,477.0,429.5,475.0,38469600,118.75
At this point, we will launch Hive and create a table using the same schema as the MySQL table:
Verify that the table has been created:
Finally, we see that the data transferred correctly when we perform some queries:
Setting Up the MySQL Data
The first step will be to set up the MySQL databases and tables that will be used for the transfer. First we'll create a database called "sqoop_test":
mysql> CREATE DATABASE sqoop_test;
mysql> USE sqoop_test;
Next, the table "stocks" will be created, preparing it for a text file with stock data to be populated:
mysql> CREATE TABLE stocks (
id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
symbol VARCHAR(100),
quote_date VARCHAR(100),
open_price DOUBLE PRECISION,
high_price DOUBLE PRECISION,
low_price DOUBLE PRECISION,
close_price DOUBLE PRECISION,
volume INTEGER,
adj_close_price DOUBLE PRECISION
);
Next, the following stock information file will be used to populate the table:
mysql> LOAD DATA LOCAL INFILE '/root/TrainingOnHDP/dataset/stocks.txt'
INTO TABLE stocks
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
(symbol, @quote_date, open_price, high_price, low_price, close_price, volume, adj_close_price)
SET quote_date = DATE_FORMAT(@quote_date, '%Y-%m-%d');
Now we can see the information present in the table:
mysql> select * from stocks limit 10;
+----+--------+------------+------------+------------+-----------+-------------+----------+-----------------+
| id | symbol | quote_date | open_price | high_price | low_price | close_price | volume | adj_close_price |
+----+--------+------------+------------+------------+-----------+-------------+----------+-----------------+
| 1 | AAPL | 2009-01-02 | 85.88 | 91.04 | 85.16 | 90.75 | 26643400 | 90.75 |
| 2 | AAPL | 2008-01-02 | 199.27 | 200.26 | 192.55 | 194.84 | 38542100 | 194.84 |
| 3 | AAPL | 2007-01-03 | 86.29 | 86.58 | 81.9 | 83.8 | 44225700 | 83.8 |
| 4 | AAPL | 2006-01-03 | 72.38 | 74.75 | 72.25 | 74.75 | 28829800 | 74.75 |
| 5 | AAPL | 2005-01-03 | 64.78 | 65.11 | 62.6 | 63.29 | 24714000 | 31.65 |
| 6 | AAPL | 2004-01-02 | 21.55 | 21.75 | 21.18 | 21.28 | 5165800 | 10.64 |
| 7 | AAPL | 2003-01-02 | 14.36 | 14.92 | 14.35 | 14.8 | 6479600 | 7.4 |
| 8 | AAPL | 2002-01-02 | 22.05 | 23.3 | 21.96 | 23.3 | 18910600 | 11.65 |
| 9 | AAPL | 2001-01-02 | 14.88 | 15.25 | 14.56 | 14.88 | 16161800 | 7.44 |
| 10 | AAPL | 2000-01-03 | 104.87 | 112.5 | 101.69 | 111.94 | 19144400 | 27.99 |
+----+--------+------------+------------+------------+-----------+-------------+----------+-----------------+
10 rows in set (0.00 sec)
Using Sqoop
Getting a list of the database tables in Sqoop can be done by using the "list-tables" command:
[root@sandbox-hdp ~]# sqoop list-tables -connect jdbc:mysql://localhost:3306/sqoop_test --username hip_sqoop_user -P;
Warning: /usr/hdp/2.6.4.0-91/accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
18/04/21 14:59:33 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6.2.6.4.0-91
Enter password:
18/04/21 14:59:40 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
stocks
sqoop import --connect jdbc:mysql://localhost/sqoop_test --table stocks --fetch-size 10 --username etl_talk -P
Verify the contents of the imported content:
$ hdfs dfs -cat stocks/*
1,AAPL,2009-01-02,85.88,91.04,85.16,90.75,26643400,90.75
2,AAPL,2008-01-02,199.27,200.26,192.55,194.84,38542100,194.84
3,AAPL,2007-01-03,86.29,86.58,81.9,83.8,44225700,83.8
4,AAPL,2006-01-03,72.38,74.75,72.25,74.75,28829800,74.75
5,AAPL,2005-01-03,64.78,65.11,62.6,63.29,24714000,31.65
6,AAPL,2004-01-02,21.55,21.75,21.18,21.28,5165800,10.64
7,AAPL,2003-01-02,14.36,14.92,14.35,14.8,6479600,7.4
8,AAPL,2002-01-02,22.05,23.3,21.96,23.3,18910600,11.65
9,AAPL,2001-01-02,14.88,15.25,14.56,14.88,16161800,7.44
10,AAPL,2000-01-03,104.87,112.5,101.69,111.94,19144400,27.99
11,CSCO,2009-01-02,16.41,17.0,16.25,16.96,40980600,16.96
12,CSCO,2008-01-02,27.0,27.3,26.21,26.54,64338900,26.54
13,CSCO,2007-01-03,27.46,27.98,27.33,27.73,64226000,27.73
14,CSCO,2006-01-03,17.21,17.49,17.18,17.45,55426000,17.45
15,CSCO,2005-01-03,19.42,19.61,19.27,19.32,56725600,19.32
16,CSCO,2004-01-02,24.36,24.53,24.16,24.25,29955800,24.25
17,CSCO,2003-01-02,13.11,13.69,13.09,13.64,61335700,13.64
18,CSCO,2002-01-02,18.44,19.3,18.26,19.23,55376900,19.23
19,CSCO,2001-01-02,38.13,38.5,32.63,33.31,17384600,33.31
20,CSCO,2000-01-03,109.94,110.25,103.56,108.06,53076000,54.03
21,GOOG,2009-01-02,308.6,321.82,305.5,321.32,3610500,321.32
22,GOOG,2008-01-02,692.87,697.37,677.73,685.19,4306900,685.19
23,GOOG,2007-01-03,466.0,476.66,461.11,467.59,7706500,467.59
24,GOOG,2006-01-03,422.52,435.67,418.22,435.23,13121200,435.23
25,GOOG,2005-01-03,197.4,203.64,195.46,202.71,15844200,202.71
26,MSFT,2009-01-02,19.53,20.4,19.37,20.33,50084000,19.86
27,MSFT,2008-01-02,35.79,35.96,35.0,35.22,63004200,33.79
28,MSFT,2007-01-03,29.91,30.25,29.4,29.86,76935100,28.26
29,MSFT,2006-01-03,26.25,27.0,26.1,26.84,79973000,25.04
30,MSFT,2005-01-03,26.8,26.95,26.65,26.74,65002900,24.65
31,MSFT,2004-01-02,27.58,27.77,27.33,27.45,44487700,22.64
32,MSFT,2003-01-02,52.3,53.75,51.71,53.72,67025200,21.95
33,MSFT,2002-01-02,66.65,67.11,65.51,67.04,48124000,27.4
34,MSFT,2001-01-02,44.13,45.0,42.88,43.38,82413200,17.73
35,MSFT,2000-01-03,117.37,118.62,112.0,116.56,53228400,47.64
36,YHOO,2009-01-02,12.17,12.85,12.12,12.85,9514600,12.85
37,YHOO,2008-01-02,23.8,24.15,23.6,23.72,25671700,23.72
38,YHOO,2007-01-03,25.85,26.26,25.26,25.61,26352700,25.61
39,YHOO,2006-01-03,39.69,41.22,38.79,40.91,24227700,40.91
40,YHOO,2005-01-03,38.36,38.9,37.65,38.18,25482800,38.18
41,YHOO,2004-01-02,45.5,45.83,45.12,45.4,16480000,22.7
42,YHOO,2003-01-02,16.59,17.66,16.5,17.6,19640400,8.8
43,YHOO,2002-01-02,18.14,18.69,17.68,18.63,21903600,9.31
44,YHOO,2001-01-02,30.31,30.37,27.5,28.19,21939200,14.1
45,YHOO,2000-01-03,442.92,477.0,429.5,475.0,38469600,118.75
Moving It All To Hive
At this point, we will launch Hive and create a table using the same schema as the MySQL table:
hive> CREATE TABLE hiveSTOCKS (
> id INT,
> symbol STRING,
> quote_date STRING,
> open_price DOUBLE,
> high_price DOUBLE,
> low_price DOUBLE,
> close_price DOUBLE,
> volume INT,
> adj_close_price DOUBLE)
> ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
> STORED AS TEXTFILE
> LOCATION '/user/root/stocks';
OK
Time taken: 4.295 seconds
hive> show tables;
OK
hivestocks
mtable_hbase_2
mytable_hbase
mytable_hbase_1
sample_07
sample_08
Time taken: 0.344 seconds, Fetched: 6 row(s)
hive> SELECT * FROM hivestocks limit 10;
OK
1 AAPL 2009-01-02 85.88 91.04 85.16 90.75 26643400 90.75
2 AAPL 2008-01-02 199.27 200.26 192.55 194.84 38542100 194.84
3 AAPL 2007-01-03 86.29 86.58 81.9 83.8 44225700 83.8
4 AAPL 2006-01-03 72.38 74.75 72.25 74.75 28829800 74.75
5 AAPL 2005-01-03 64.78 65.11 62.6 63.29 24714000 31.65
6 AAPL 2004-01-02 21.55 21.75 21.18 21.28 5165800 10.64
7 AAPL 2003-01-02 14.36 14.92 14.35 14.8 6479600 7.4
8 AAPL 2002-01-02 22.05 23.3 21.96 23.3 18910600 11.65
9 AAPL 2001-01-02 14.88 15.25 14.56 14.88 16161800 7.44
10 AAPL 2000-01-03 104.87 112.5 101.69 111.94 19144400 27.99
hive> SELECT * FROM hiveSTOCKS WHERE quote_date < '2003-01-02';
OK
8 AAPL 2002-01-02 22.05 23.3 21.96 23.3 18910600 11.65
9 AAPL 2001-01-02 14.88 15.25 14.56 14.88 16161800 7.44
10 AAPL 2000-01-03 104.87 112.5 101.69 111.94 19144400 27.99
18 CSCO 2002-01-02 18.44 19.3 18.26 19.23 55376900 19.23
19 CSCO 2001-01-02 38.13 38.5 32.63 33.31 17384600 33.31
20 CSCO 2000-01-03 109.94 110.25 103.56 108.06 53076000 54.03
33 MSFT 2002-01-02 66.65 67.11 65.51 67.04 48124000 27.4
34 MSFT 2001-01-02 44.13 45.0 42.88 43.38 82413200 17.73
35 MSFT 2000-01-03 117.37 118.62 112.0 116.56 53228400 47.64
43 YHOO 2002-01-02 18.14 18.69 17.68 18.63 21903600 9.31
44 YHOO 2001-01-02 30.31 30.37 27.5 28.19 21939200 14.1
45 YHOO 2000-01-03 442.92 477.0 429.5 475.0 38469600 118.75
hive> SELECT symbol, quote_date, volume, close_price FROM hivestocks WHERE close_price BETWEEN 30.0 and 90.0 ORDER BY symbol;
OK
AAPL 2007-01-03 44225700 83.8
AAPL 2006-01-03 28829800 74.75
AAPL 2005-01-03 24714000 63.29
CSCO 2001-01-02 17384600 33.31
MSFT 2008-01-02 63004200 35.22
MSFT 2003-01-02 67025200 53.72
MSFT 2002-01-02 48124000 67.04
MSFT 2001-01-02 82413200 43.38
YHOO 2006-01-03 24227700 40.91
YHOO 2005-01-03 25482800 38.18
YHOO 2004-01-02 16480000 45.4
Time taken: 14.893 seconds, Fetched: 11 row(s)
Comments
Post a Comment