ETL Pipeline Demonstration Using Apache NiFi

Introduction: 

Apache NiFi is an integrated data logistics platform for automating the movement of data between disparate systems. It provides real-time control that makes it easy to manage the movement of data between any source and any destination. It is data source agnostic, supporting disparate and distributed sources of differing formats, schemas, protocols, speeds, and sizes such as machines, geolocation devices, click streams, files, social feeds, log files and videos and more. It is configurable plumbing for moving data around, similar to how FedEx, UPS or other courier delivery services move parcels around. And just like those services, Apache NiFi allows you to trace your data in real-time, just like you could trace a delivery.

Apache NiFi is based on technology previously called “Niagara Files” that was in development and used at scale within the NSA for the last eight years and was made available to the Apache Software Foundation through the NSA Technology Transfer Program. As such, it was designed from the beginning to be a field ready—flexible, extensible and suitable for a wide range of devices from a small lightweight network edge device such as a Raspberry Pi to enterprise data clusters and the cloud. Apache NiFi is also able to dynamically adjust to fluctuating network connectivity that could impact communications and thus the delivery of data.

Prerequisites: 

To follow this tutorial, you will need:

  • JAVA and set the JAVA_HOME to .bashrc file

Installation of Apache NiFi:

  • NiFi can be downloaded from the NiFi Downloads Page
  • Decompress and untar into the desired installation directory.
  • Navigate to the NiFi installation directory.
  • Then run the command: $ sudo bin/nifi.sh start
  • By following the command you can track Apache NiFi is started or not:

$ sudo bin/nifi.sh status

  • For example, we have downloaded the Apache NiFi and uncompressed under the Apache directory; please follow the below snapshot.

This image shows Apache NiFi being started via command line, page connection included.

Fig: Getting Started with Apache NiFi

  • After started the Apache NiFi, Go to a web browser and hit http://localhost:8080/nifi/
  • By default, Apache NiFi uses 8080 port but you can change the port number from configuration file i.e. nifi.properties which is located under conf directory. Name of the property is nifi.web.http.port. In our case, we have changed from 8080 to 9000.

This image shows the Apache NiFi canvas with available processor options, page connection included.

Fig: Apache NiFi Dashboard

Problem Statement:

  1. Download a zip file from an HTTP server and save it into a local machine.
  2. Then uncompress the downloaded zip file (note: downloaded zip file contains 3 csv file) and after that take a csv file and then clean the csv file because that csv file contains few junk values. 
  3. Then send that csv file through flowfile and dump the all values into MySQL.
  4. Then extract the all data which is stored in MySQL into a csv file.
  5. After that, upload the csv file in an Amazon S3 bucket and HDFS. 
  6. After successfully saving the file into desired locations it will send a confirmation mail to a specified user.

Overview:

This image shows a high-level ETL workflow in Apache NiFi, page connection included.

Solution:

  1. GetHTTP: Fetches data from an HTTP or HTTPS URL and writes the data to the content of a FlowFile.Drag and drop the GetHTTP processor and configure is as follows:

This image represents ETL processor controller configuration for web service API, page connection included.

Fig: GetHTTP Configuration

Properties:

URL: Paste the URL from where you want to extract data. In our case, we have used http://api.worldbank.org/v2/countries/all/indicators/NE.EXP.GNFS.ZS?downloadformat=csv

Filename: The filename to assign to the file when pulled.

SSL Context Service: Access the file over HTTP you have to generate SSL Context Service. In this case, we have created a StandardRestrictedSSLContextService and the  configuration is as follows:

StandardRestrictedSSLContextService:

This image represents controller service MySQL pool settings from ETL demonstration, page connection included.

Fig: StandardRestrictedSSLContextService Configuration

Properties:

Keystore Filename: The fully-qualified filename of the Keystore

  1. PutFile:It writes the contents of a FlowFile to the local file system. Drag and drop the PutFile Processor and configure it as follows:

This image represents ETL processor file directory and conflict resolution strategy, page connection included.

Fig: PutFile Configuration

Properties:

Directory: The directory to which files should be written.

  1. ExecuteStreamCommand:Drag and drop the ExecuteStreamCommand Processor and configure is as follows:

This image represents ETL processor external command path setup, page connection included.

Fig: ExecuteStreamCommand Configuration

  1. ExecuteScript: Drag and drop the ExecuteScript Processor and configure is as follows:

This image represents ETL processor Python script configuration, page connection included.

Fig: ExecuteScript Configuration

  1. PutDatabaseRecord: Drag and drop the PutDatabaseRecord Processor and configure is as follows:

This image represents ETL processor insert statement configuration, page connection included.

Fig: PutDatabaseRecord Configuration

DBCPConnectionPool:

This image represents controller service MySQL pool settings, page connection included.

Fig: DBCPConnectionPool Configuration

CSVReader: 

This image represents ETL processor CSV header schema settings, page connection included.

Fig: CSVReader Configuration

AvroSchemaRegistry:

This image represents ETL processor field validation schema panel, page connection included.

Fig: AvroSchemaRegistry Configuration

  1. ExecuteSQL: Drag and drop the ExecuteSQL Processor and configure is as follows:

This image represents ETL processor database connection configuration, page connection included.

Fig: ExecuteSQL Configuration

  1. ConvertRecord: Drag and drop the ConvertRecord Processor and configure is as follows:

This image represents ETL processor record reader and writer, page connection included.

Fig: ConvertRecord Configuration

AvroReader:

This image represents embedded Avro schema service settings, page connection included.

Fig: AvroReader Configuration

CSVRecordSetWriter:

This image represents ETL processor schema service configuration, page connection included.

Fig: CSVRecordSetWriter Configuration

  1. PutS3Object: Drag and drop the PutS3Object Processor and configure is as follows:

This image represents ETL processor cloud storage configuration panel, page connection included.

Fig: PutS3Object Configuration

  1. PutHDFS: Drag and drop the PutHDFS Processor and configure is as follows:

This image represents ETL processor Hadoop configuration panel, page connection included.

Fig: PutHDFS Configuration

  1. PutEmail: Drag and drop the PutEmail Processor and configure is as follows:

This image represents ETL processor email configuration panel, page connection included.This image represents ETL processor configuration in Apache NiFi, page connection included.

Fig: PutEmail Configuration

In this process, we have done the Data EXTRACTIONby API calls. Then we performed various TRANSFORMATION operations to derive meaningful data from it. After that, we LOADED the data in a SQL table to complete the steps of ETL Pipeline. And finally, SCHEDULEDthe operation using Apache NiFi.