ETL on AWS

(1) AWS Native ETL Service with AWS Glue

Recommended if we need to move data between different AWS services.

Use multi-user Zepplin via Glue Build to create PySpark scripts.

Set up scheduled Jobs , kick off saved Jobs with Lambdas, and even trigger Jobs with the completion of earlier Jobs, like a work chain. Logs are retained in CloudWatch.

Monitoring:

The AWS Glue provides status of each job and pushes all notifications to Amazon CloudWatch. You can setup SNS notifications via CloudWatch actions to be informed of job failures or completions.

Overall, Glue is great when we don’t want to manage the infrastructure.

Specific Example:  Running Scheduled Glue Jobs with Step Functions

  • A state machine can be triggered through Amazon CloudWatch based on a schedule
  • Here we are using the long-running worker approach with activity tasks as opposed to Lambda Functions (limit ~ 15 mins)
  • ETL Runners are invoked by activity tasks in Step Functions
  • ETL Runners need to periodically poll the AWS Step Functions state machine for tasks
  • As soon as an ETL Runner receives a task, it starts the respective ETL job.
  • An ETL Runner maintains a state of active jobs in an Amazon DynamoDB table.
  • When an active ETL job completes, the ETL Runners notifies the AWS Step Functions state machine.
  • This allows the ETL workflow in AWS Step Functions to proceed to the next step.
  • A CloudWatch Events event handler is configured on an Amazon S3 bucket
  • So that when dataset files are uploaded to the bucket
  • Amazon S3 invokes an AWS Lambda function.
  • The Lambda function then signals the waiting state machine to exit the activity task corresponding to the uploaded dataset.
  • The subsequent ETL job is then invoked by the state machine.

Reference:

Open Questions:

Per AWS docs, AWS Glue ETL is batch oriented, and one can schedule your ETL jobs at a minimum of 5 min intervals. While it can process micro-batches, it does not handle streaming data. If our use case requires to ETL data while we stream it in, we can perform the first leg of the ETL using Amazon Kinesis Data Firehose or Amazon Kinesis Data Analytics, and then store data to either S3 or DynamoDB or RDS or Redshift and trigger an AWS Glue ETL job to pick up that dataset and continue applying additional transformations to that data.

Having said that, following AWS Forum discussion claims that its possible to run a Spark Streaming Job through Glue https://forums.aws.amazon.com/thread.jspa?threadID=262269&tstart=0

(2) AWS ETL – Spark Jobs on EMR

(2.a) Submit Spark Jobs using Step Functions

It works if we want to develop a simple workflow (running a few spark jobs in a specific order) and if we don’t want to spend time orchestrating those jobs.

Specific Example: Create the entire workflow in AWS Step Functions and interact with Spark on Amazon EMR through Apache Livy.

What is Apache Livy ?

  • It is a service that enables easy interaction with a Spark cluster over a REST interface.
  • It lets you send simple Scala or Python code over REST API calls instead of having to manage and deploy large jar files
  • This helps because it scales data pipelines easily with multiple spark jobs running in parallel, rather than running them serially using EMR Step API

Reference:

(2.b)  Launch ETL Cluster via AWS Lambda

  • A Scheduled Lambda (Lambda setup with scheduled event) or a Triggered Lambda can launch an EMR Cluster

Reference:

(2.c) AWS Data Pipeline

Just like AWS Glue – AWS Data Pipeline helps moving data between AWS-supported data nodes and mysql/postgress dbs. On the flip side, it doesn’t offer Serverless deployment.

(2.d) AirFlow Workflow Framework

— AirFLow is one of the most widely adopted ETL WorkFlow management Tool

— We can define dependency and hierarchy of jobs

  • Airflow server uses a LocalExecutor (tasks are executed as a subprocess), which helps to parallelize tasks locally (Dev)
  • For production workloads, you should consider scaling out with the CeleryExecutor on a cluster with multiple worker nodes.
  • One can define the steps in Airflow to submit spark job in EMR cluster

Reference:

Ability to perform aggressive Pre-Aggregation in Ingestion layer and fire Continuous Queries on real streams and historical dataset

(3) AWS Batch

AWS Batch, there is no need to install and manage batch computing software or server clusters.

Whilst not ‘serverless’ in the generally understood sense AWS Batch does do all the provisioning and scaling of compute resources automatically, allowing jobs to be efficiently scheduled and executed with minimal administration. Using a Lambda-like container you can schedule jobs in much the same way as the Lambda service does – with the advantage that they can run for as long as you like. It allows you to instead focus on analyzing results and solving problems. AWS Batch plans, schedules, and executes your batch computing workloads using Amazon EC2 and Spot Instances.

Its great for running some memory-intensive or cpu-bound computations on existing data with known without performing any aggregations or processing.

Deep learning, genomics analysis, financial risk models, Monte Carlo simulations, animation rendering, media transcoding, image processing, and engineering simulations are all excellent examples of batch computing applications

(4) Airflow as a complete Hybrid ETL Workflow Solution

Airflow can combine multiple tasks interacting with heterogeneous data sources and sinks.

Pros: AirFlow offers significant advantage over Cron

  • Applications get bloated using Cron
  • Performance of batch jobs improved prioritization
  • Data Engineers with different skill sets can build a pipeline together
  • Scale the scheduling of our batch jobs
  • Rerunning jobs , viewing log , job history become trivial
  • Easily generate Alerting levels
  • Monitor the external dependencies of their jobs so as not to impact the SLAs of other jobs

Ref: https://medium.com/videoamp/what-we-learned-migrating-off-cron-to-airflow-b391841a0da4

Cons: Provision and manage the Airflow server.

(5) AWS ETL – Databricks for managing Spark Jobs

Automatic scaling of compute

Autoscaling compute is a basic capability that many big data platforms provide today.

But most of these tools expect a static resource size allocated for a single job, which doesn’t take advantage of the elasticity of the cloud. Resource schedulers like YARN then take care of “coarse-grained” autoscaling between different jobs, releasing resources back only after a Spark job finishes. This suffers from two problems:

  • Estimating a good size for the job requires a lot of trial and error.
  • Users typically over-provision the resources for the maximum load based on time of day, the day of the week or some special occasions like black Friday.

Databricks autoscaling is more dynamic and based on fine-grained Spark tasks that are queued on the Spark scheduler.

This allows clusters to scale up and down more aggressively in response to load and improves the utilization of cluster resources automatically without the need for any complex setup from users.

Databricks autoscaling helps you save up to 30% of the cloud costs based on your workloads.

See the optimized autoscaling blog to learn more on how you save cloud costs by using Databricks autoscaling.

Databricks sizing on AWS :  https://docs.databricks.com/user-guide/clusters/sizing.html

Cluster Sizing in Autopilot mode: https://databricks.com/blog/2018/07/17/apache-spark-clusters-in-autopilot-mode.html

Imp Reference: https://databricks.com/blog/2018/05/02/introducing-databricks-optimized-auto-scaling.html

Moreover, ‘Airflow Databricks integration’ lets you take advantage of the the optimized Spark engine offered by Databricks with the scheduling features of Airflow.

(6) AWS ETL on Redshift

For terabytes of data , we either use Hive or Redshift as Data Warehouse.

It appears that in most cases, Redshift will be the cheaper option. In their own test, Airbnb’s data engineering team concluded that their setup would cost approximately 4 times more on  Hadoop than using Redshift.

Matillion is the best ETL Tool for Redshift and Spectrum.

———————————————————————————————————————-

(7) AWS Streaming ETL

Streams can originate from many different external sources (IOT Gateways)  and intermediate processed results  (DynamoDB Streams / intermediate Kafka topics) and then Lambda function can submit long-running ETL job (Spark-EMR) or execute queries on micro-batches

Sample Example of Streaming ETL Assembly Lines

Common ETL Requirements

  • Multi-tenant Job Processing
  • Multi-user Job Script Management with ACL
  • Ability to define Job Dependency and groups of jobs per vertical
  • Ability to scale individual Job Group
  • Ability to maintain mix of Streaming and Batch Jobs
  • Ability to identify memory-intensive or cpu-intensive Jobs
  • Ability to automate submission of the jobs to the cluster
  • Ability to get complete visibility of the runtime metrics and log history of the jobs
  • Ability to scale the physical resources automatically
  • Ability to perform aggressive Pre-Aggregation in Ingestion Jobs and publish results to relevant streams
    • Usually we should view all our jobs as follows:

(Ingestion Job , Migration Job) => [Streams]—(Real-time Analysis , Persistence Job)—[Streams, Storage]  => (Historical Analysis Job)—[Streams, Storage ]

  • For example, when we capture business data from different sources, we should be able to run either continuous-query pipeline  OR data-enrichment and processing pipeline to generate contextual streams like ‘time-series stream’   , ‘aggregation stream’ , ‘graph-stream’ , ‘event-alert’ streams etc.
  • Then another streaming Pipeline should be able to pick up the relevant streams and push to corresponding storage for further analysis.
By |2021-07-22T01:34:46+00:00August 8th, 2018|Amazon, Amazon AWS, Amazon AWS, AWS Cloud, Data Engineering, ETL|
CONTACT US