Apache NiFi is an integrated data logistics platform for automating the movement of data between disparate systems. It provides real-time control that makes it easy to manage the movement of data between any source and any destination. It is data source agnostic, supporting disparate and distributed sources of differing formats, schemas, protocols, speeds, and sizes such as machines, geolocation devices, click streams, files, social feeds, log files and videos and more. It is configurable plumbing for moving data around, similar to how FedEx, UPS or other courier delivery services move parcels around. And just like those services, Apache NiFi allows you to trace your data in real-time, just like you could trace a delivery.
Apache NiFi is based on technology previously called “Niagara Files” that was in development and used at scale within the NSA for the last eight years and was made available to the Apache Software Foundation through the NSA Technology Transfer Program. As such, it was designed from the beginning to be a field ready—flexible, extensible and suitable for a wide range of devices from a small lightweight network edge device such as a Raspberry Pi to enterprise data clusters and the cloud. Apache NiFi is also able to dynamically adjust to fluctuating network connectivity that could impact communications and thus the delivery of data.
Decompress and untar into the desired installation directory.
Navigate to the NiFi installation directory.
Then run the command: $ sudo bin/nifi.sh start
By following the command you can track Apache NiFi is started or not:
$ sudo bin/nifi.sh status
For example, we have downloaded the Apache NiFi and uncompressed under the Apache directory; please follow the below snapshot.
Fig: Getting Started with Apache NiFi
After started the Apache NiFi, Go to a web browser and hit http://localhost:8080/nifi/
By default, Apache NiFi uses 8080 port but you can change the port number from configuration file i.e. nifi.properties which is located under conf directory. Name of the property is nifi.web.http.port. In our case, we have changed from 8080 to 9000.
Fig: Apache NiFi Dashboard
Download a zip file from an HTTP server and save it into a local machine.
Then uncompress the downloaded zip file (note: downloaded zip file contains 3 csv file) and after that take a csv file and then clean the csv file because that csv file contains few junk values.
Then send that csv file through flowfile and dump the all values into MySQL.
Then extract the all data which is stored in MySQL into a csv file.
After that, upload the csv file in an Amazon S3 bucket and HDFS.
After successfully saving the file into desired locations it will send a confirmation mail to a specified user.
GetHTTP: Fetches data from an HTTP or HTTPS URL and writes the data to the content of a FlowFile.Drag and drop the GetHTTP processor and configure is as follows:
Keystore Filename: The fully-qualified filename of the Keystore
PutFile: It writes the contents of a FlowFile to the local file system. Drag and drop the PutFile Processor and configure it as follows:
Fig: PutFile Configuration
Directory: The directory to which files should be written.
ExecuteStreamCommand: Drag and drop the ExecuteStreamCommand Processor and configure is as follows:
Fig: ExecuteStreamCommand Configuration
ExecuteScript: Drag and drop the ExecuteScript Processor and configure is as follows:
Fig: ExecuteScript Configuration
PutDatabaseRecord: Drag and drop the PutDatabaseRecord Processor and configure is as follows:
Fig: PutDatabaseRecord Configuration
Fig: DBCPConnectionPool Configuration
Fig: CSVReader Configuration
Fig: AvroSchemaRegistry Configuration
ExecuteSQL: Drag and drop the ExecuteSQL Processor and configure is as follows:
Fig: ExecuteSQL Configuration
ConvertRecord: Drag and drop the ConvertRecord Processor and configure is as follows:
Fig: ConvertRecord Configuration
Fig: AvroReader Configuration
Fig: CSVRecordSetWriter Configuration
PutS3Object: Drag and drop the PutS3Object Processor and configure is as follows:
Fig: PutS3Object Configuration
PutHDFS: Drag and drop the PutHDFS Processor and configure is as follows:
Fig: PutHDFS Configuration
PutEmail: Drag and drop the PutEmail Processor and configure is as follows:
Fig: PutEmail Configuration
In this process, we have done the Data EXTRACTION by API calls. Then we performed various TRANSFORMATION operations to derive meaningful data from it. After that, we LOADED the data in a SQL table to complete the steps of ETL Pipeline. And finally, SCHEDULED the operation using Apache NiFi.