Posts

Showing posts from May, 2018

The Wizard of Oozie

Image
As per the official website , Apache Oozie is a "workflow scheduler system to manage Apache Hadoop jobs." It automates the running of Hadoop jobs through the use of a workflow engine and a coordinator engine . Oozie was made to work with other common Hadoop tools such as Pig, Hive, and Sqoop, but it can also can be extended to support custom Hadoop jobs. We'll start by creating a database in MySQL called "oozie". For now, we won't create a table or populate it with data since the jobs in the workflow will take care of that: mysql> create database oozie; mysql> use oozie; What the Workflow is All About The information used for this table will come from a file called business.csv and it will later be copied to a folder in HDFS, where it can be used to import into Hive. There are a number of files that will make up the processes in this job. There is one file that will retrieve the csv file from data.sfgov.org; there is another file that will