ETL on AWS

Date: August 4, 2018Author: kmandal0 Comments

(1) AWS Native ETL Service with AWS Glue

Recommended if we need to move data between different AWS services.

Use multi-user Zepplin via Glue Build to create PySpark scripts.

Set up scheduled Jobs , kick off saved Jobs with Lambdas, and even trigger Jobs with the completion of earlier Jobs, like a work chain. Logs are retained in CloudWatch.

Monitoring:

The AWS Glue provides status of each job and pushes all notifications to Amazon CloudWatch. You can setup SNS notifications via CloudWatch actions to be informed of job failures or completions.

Overall, Glue is great when we don’t want to manage the infrastructure.

Specific Example: Running Scheduled Glue Jobs with Step Functions

A state machine can be triggered through Amazon CloudWatch based on a schedule
Here we are using the long-running worker approach with activity tasks as opposed to Lambda Functions (limit ~ 15 mins)
ETL Runners are invoked by activity tasks in Step Functions
ETL Runners need to periodically poll the AWS Step Functions state machine for tasks
As soon as an ETL Runner receives a task, it starts the respective ETL job.
An ETL Runner maintains a state of active jobs in an Amazon DynamoDB table.
When an active ETL job completes, the ETL Runners notifies the AWS Step Functions state machine.
This allows the ETL workflow in AWS Step Functions to proceed to the next step.
A CloudWatch Events event handler is configured on an Amazon S3 bucket
So that when dataset files are uploaded to the bucket
Amazon S3 invokes an AWS Lambda function.
The Lambda function then signals the waiting state machine to exit the activity task corresponding to the uploaded dataset.
The subsequent ETL job is then invoked by the state machine.

Reference:

Open Questions:

Does Glue Native support for Cassandra ?
- Third-party Cassandra Support for Glue: https://www.cdata.com/kb/tech/cassandra-jdbc-aws-glue.rst
Does Glue support heavyweight Spark job in a managed cluster ?
Does Glue Spark Cluster support Auto-scale ?
Does Glue support Streaming Jobs ?

Per AWS docs, AWS Glue ETL is batch oriented, and one can schedule your ETL jobs at a minimum of 5 min intervals. While it can process micro-batches, it does not handle streaming data. If our use case requires to ETL data while we stream it in, we can perform the first leg of the ETL using Amazon Kinesis Data Firehose or Amazon Kinesis Data Analytics, and then store data to either S3 or DynamoDB or RDS or Redshift and trigger an AWS Glue ETL job to pick up that dataset and continue applying additional transformations to that data.

Having said that, following AWS Forum discussion claims that its possible to run a Spark Streaming Job through Glue https://forums.aws.amazon.com/thread.jspa?threadID=262269&tstart=0

(2) AWS ETL – Spark Jobs on EMR

(2.a) Submit Spark Jobs using Step Functions

It works if we want to develop a simple workflow (running a few spark jobs in a specific order) and if we don’t want to spend time orchestrating those jobs.

Specific Example: Create the entire workflow in AWS Step Functions and interact with Spark on Amazon EMR through Apache Livy.

CloudWatch Events supports AWS Step Functions as a target
Define CloudWatch rule for an S3 event
First stage in the state machine triggers an AWS Lambda
- AWS Lambda submits Spark Job through Apache Livy
State machine waits a few seconds before checking the Spark job status
The job finishes, and the state machine updates with its final status.
- A Lambda function sends an alert email
Submit multiple Spark jobs in parallel, you can use the Parallel state type in AWS Step Functions

What is Apache Livy ?

It is a service that enables easy interaction with a Spark cluster over a REST interface.
It lets you send simple Scala or Python code over REST API calls instead of having to manage and deploy large jar files
This helps because it scales data pipelines easily with multiple spark jobs running in parallel, rather than running them serially using EMR Step API

Reference:

(2.b) Launch ETL Cluster via AWS Lambda

A Scheduled Lambda (Lambda setup with scheduled event) or a Triggered Lambda can launch an EMR Cluster

Reference:

https://anchormen.nl/blog/big-data-services/event-driven-data-processing/

(2.c) AWS Data Pipeline

Just like AWS Glue – AWS Data Pipeline helps moving data between AWS-supported data nodes and mysql/postgress dbs. On the flip side, it doesn’t offer Serverless deployment.

(2.d) AirFlow Workflow Framework

— AirFLow is one of the most widely adopted ETL WorkFlow management Tool

— We can define dependency and hierarchy of jobs

Airflow server uses a LocalExecutor (tasks are executed as a subprocess), which helps to parallelize tasks locally (Dev)
For production workloads, you should consider scaling out with the CeleryExecutor on a cluster with multiple worker nodes.
One can define the steps in Airflow to submit spark job in EMR cluster

Reference:

Ability to perform aggressive Pre-Aggregation in Ingestion layer and fire Continuous Queries on real streams and historical dataset

(3) AWS Batch

AWS Batch, there is no need to install and manage batch computing software or server clusters.

Whilst not ‘serverless’ in the generally understood sense AWS Batch does do all the provisioning and scaling of compute resources automatically, allowing jobs to be efficiently scheduled and executed with minimal administration. Using a Lambda-like container you can schedule jobs in much the same way as the Lambda service does – with the advantage that they can run for as long as you like. It allows you to instead focus on analyzing results and solving problems. AWS Batch plans, schedules, and executes your batch computing workloads using Amazon EC2 and Spot Instances.

Its great for running some memory-intensive or cpu-bound computations on existing data with known without performing any aggregations or processing.

Deep learning, genomics analysis, financial risk models, Monte Carlo simulations, animation rendering, media transcoding, image processing, and engineering simulations are all excellent examples of batch computing applications

(4) Airflow as a complete Hybrid ETL Workflow Solution

Airflow can combine multiple tasks interacting with heterogeneous data sources and sinks.

Task1 submits Spark Job to EMR / Databricks
- https://robinhood.engineering/why-robinhood-uses-airflow-aed13a9a90c8
Task2 can use Athena
- https://datacenternotes.com/2018/07/21/build-a-data-pipeline-with-aws-athena-and-airflow-part-2/
Task3 talks to Redshift
- https://sonra.io/2018/01/01/using-apache-airflow-to-build-a-data-pipeline-on-aws/

Pros: AirFlow offers significant advantage over Cron

Applications get bloated using Cron
Performance of batch jobs improved prioritization
Data Engineers with different skill sets can build a pipeline together
Scale the scheduling of our batch jobs
Rerunning jobs , viewing log , job history become trivial
Easily generate Alerting levels
Monitor the external dependencies of their jobs so as not to impact the SLAs of other jobs

Ref: https://medium.com/videoamp/what-we-learned-migrating-off-cron-to-airflow-b391841a0da4

Cons: Provision and manage the Airflow server.

(5) AWS ETL – Databricks for managing Spark Jobs

Automatic scaling of compute

Autoscaling compute is a basic capability that many big data platforms provide today.

But most of these tools expect a static resource size allocated for a single job, which doesn’t take advantage of the elasticity of the cloud. Resource schedulers like YARN then take care of “coarse-grained” autoscaling between different jobs, releasing resources back only after a Spark job finishes. This suffers from two problems:

Estimating a good size for the job requires a lot of trial and error.
Users typically over-provision the resources for the maximum load based on time of day, the day of the week or some special occasions like black Friday.

Databricks autoscaling is more dynamic and based on fine-grained Spark tasks that are queued on the Spark scheduler.

This allows clusters to scale up and down more aggressively in response to load and improves the utilization of cluster resources automatically without the need for any complex setup from users.

Databricks autoscaling helps you save up to 30% of the cloud costs based on your workloads.

See the optimized autoscaling blog to learn more on how you save cloud costs by using Databricks autoscaling.

Databricks sizing on AWS : https://docs.databricks.com/user-guide/clusters/sizing.html

Cluster Sizing in Autopilot mode: https://databricks.com/blog/2018/07/17/apache-spark-clusters-in-autopilot-mode.html

Complete Job Management (Multi-User Access, Access Control, Configuration Management, Runtime Metrics, Production job scheduler)
Cluster auto-scaling
SQL, Python, Scala and R notebooks
Interactive dashboards
Integrations: GitHub, IDE, BI tools
- Notebooks connected to Github and Databricks Build Servers
- https://databricks.com/blog/2017/10/30/continuous-integration-continuous-delivery-databricks.html
- https://docs.databricks.com/user-guide/notebooks/github-version-control.html
REST APIs
Notebook workflows
Role-based access control for notebooks, clusters and jobs
Single sign-on (SSO) with SAML 2.0 support
End-to-end encryption
JDBC/ODBC authentication
Native Support for S3 (preferred storage) , Support for AWS RDS and Redshift
Transparent autoscaling to take care of Data skew and disk utilization.
- Within a single job, some workers may require more disk space than others.
- https://databricks.com/blog/2017/12/01/transparent-autoscaling-of-instance-storage.html
Spark-EMR Jobs: Goal is to focus more on optimizing job code and scaling job not managing clusters
- For example, Facebook uses Databricks Spark-SQL and other features instead of working on writing scripts to query Hive tables
- https://databricks.com/blog/2017/02/28/voice-facebook-using-apache-spark-large-scale-language-model-training.html

Imp Reference: https://databricks.com/blog/2018/05/02/introducing-databricks-optimized-auto-scaling.html

Moreover, ‘Airflow Databricks integration’ lets you take advantage of the the optimized Spark engine offered by Databricks with the scheduling features of Airflow.

(6) AWS ETL on Redshift

For terabytes of data , we either use Hive or Redshift as Data Warehouse.

It appears that in most cases, Redshift will be the cheaper option. In their own test, Airbnb’s data engineering team concluded that their setup would cost approximately 4 times more on Hadoop than using Redshift.

Matillion is the best ETL Tool for Redshift and Spectrum.

———————————————————————————————————————-

(7) AWS Streaming ETL

Streams can originate from many different external sources (IOT Gateways) and intermediate processed results (DynamoDB Streams / intermediate Kafka topics) and then Lambda function can submit long-running ETL job (Spark-EMR) or execute queries on micro-batches

Sample Example of Streaming ETL Assembly Lines

Common ETL Requirements

Multi-tenant Job Processing
Multi-user Job Script Management with ACL
Ability to define Job Dependency and groups of jobs per vertical
Ability to scale individual Job Group
Ability to maintain mix of Streaming and Batch Jobs
Ability to identify memory-intensive or cpu-intensive Jobs
Ability to automate submission of the jobs to the cluster
Ability to get complete visibility of the runtime metrics and log history of the jobs
Ability to scale the physical resources automatically
Ability to perform aggressive Pre-Aggregation in Ingestion Jobs and publish results to relevant streams
- Usually we should view all our jobs as follows:

(Ingestion Job , Migration Job) => [Streams]—(Real-time Analysis , Persistence Job)—[Streams, Storage] => (Historical Analysis Job)—[Streams, Storage ]

For example, when we capture business data from different sources, we should be able to run either continuous-query pipeline OR data-enrichment and processing pipeline to generate contextual streams like ‘time-series stream’ , ‘aggregation stream’ , ‘graph-stream’ , ‘event-alert’ streams etc.
Then another streaming Pipeline should be able to pick up the relevant streams and push to corresponding storage for further analysis.

ETL on AWS

(1) AWS Native ETL Service with AWS Glue

(2) AWS ETL – Spark Jobs on EMR

(2.a) Submit Spark Jobs using Step Functions

(2.b) Launch ETL Cluster via AWS Lambda

(2.c) AWS Data Pipeline

(2.d) AirFlow Workflow Framework

(3) AWS Batch

(4) Airflow as a complete Hybrid ETL Workflow Solution

(5) AWS ETL – Databricks for managing Spark Jobs

Automatic scaling of compute

(6) AWS ETL on Redshift

(7) AWS Streaming ETL

Common ETL Requirements

Primary Services

Pre-Built Applications

Data & AI Solutions

Get Exclusive Insights

Insights

Talk To Us

ETL on AWS

(1) AWS Native ETL Service with AWS Glue

(2) AWS ETL – Spark Jobs on EMR

(2.a) Submit Spark Jobs using Step Functions

(2.b) Launch ETL Cluster via AWS Lambda

(2.c) AWS Data Pipeline

(2.d) AirFlow Workflow Framework

(3) AWS Batch

(4) Airflow as a complete Hybrid ETL Workflow Solution

(5) AWS ETL – Databricks for managing Spark Jobs

Automatic scaling of compute

(6) AWS ETL on Redshift

(7) AWS Streaming ETL

Common ETL Requirements

Share This Article

Related Posts

Data Mesh VS Data Fabric: A Comparative Analysis of Modern Data Management Approaches

AWS Lambda

Amazon Kinesis

AWS Glue

Primary Services

Pre-Built Applications

Data & AI Solutions

Get Exclusive Insights

Insights

Talk To Us