Does Glue support heavyweight Spark job in a managed cluster ?
Does Glue Spark Cluster support Auto-scale ?
Does Glue support Streaming Jobs ?
Per AWS docs, AWS Glue ETL is batch oriented, and one can schedule your ETL jobs at a minimum of 5 min intervals. While it can process micro-batches, it does not handle streaming data. If our use case requires to ETL data while we stream it in, we can perform the first leg of the ETL using Amazon Kinesis Data Firehose or Amazon Kinesis Data Analytics, and then store data to either S3 or DynamoDB or RDS or Redshift and trigger an AWS Glue ETL job to pick up that dataset and continue applying additional transformations to that data.
Ability to perform aggressive Pre-Aggregation in Ingestion layer and fire Continuous Queries on real streams and historical dataset
(3) AWS Batch
AWS Batch, there is no need to install and manage batch computing software or server clusters.
Whilst not ‘serverless’ in the generally understood sense AWS Batch does do all the provisioning and scaling of compute resources automatically, allowing jobs to be efficiently scheduled and executed with minimal administration. Using a Lambda-like container you can schedule jobs in much the same way as the Lambda service does – with the advantage that they can run for as long as you like. It allows you to instead focus on analyzing results and solving problems. AWS Batch plans, schedules, and executes your batch computing workloads using Amazon EC2 and Spot Instances.
Its great for running some memory-intensive or cpu-bound computations on existing data with known without performing any aggregations or processing.
Deep learning, genomics analysis, financial risk models, Monte Carlo simulations, animation rendering, media transcoding, image processing, and engineering simulations are all excellent examples of batch computing applications
(4) Airflow as a complete Hybrid ETL Workflow Solution
Airflow can combine multiple tasks interacting with heterogeneous data sources and sinks.
Autoscaling compute is a basic capability that many big data platforms provide today.
But most of these tools expect a static resource size allocated for a single job, which doesn’t take advantage of the elasticity of the cloud. Resource schedulers like YARN then take care of “coarse-grained” autoscaling between different jobs, releasing resources back only after a Spark job finishes. This suffers from two problems:
Estimating a good size for the job requires a lot of trial and error.
Users typically over-provision the resources for the maximum load based on time of day, the day of the week or some special occasions like black Friday.
Databricks autoscaling is more dynamic and based on fine-grained Spark tasks that are queued on the Spark scheduler.
This allows clusters to scale up and down more aggressively in response to load and improves the utilization of cluster resources automatically without the need for any complex setup from users.
Databricks autoscaling helps you save up to 30% of the cloud costs based on your workloads.
Moreover, ‘Airflow Databricks integration’ lets you take advantage of the the optimized Spark engine offered by Databricks with the scheduling features of Airflow.
(6) AWS ETL on Redshift
For terabytes of data , we either use Hive or Redshift as Data Warehouse.
It appears that in most cases, Redshift will be the cheaper option. In their own test, Airbnb’s data engineering team concluded that their setup would cost approximately 4 times more on Hadoop than using Redshift.
Matillion is the best ETL Tool for Redshift and Spectrum.
(7) AWS Streaming ETL
Streams can originate from many different external sources (IOT Gateways) and intermediate processed results (DynamoDB Streams / intermediate Kafka topics) and then Lambda function can submit long-running ETL job (Spark-EMR) or execute queries on micro-batches
Sample Example of Streaming ETL Assembly Lines
Common ETL Requirements
Multi-tenant Job Processing
Multi-user Job Script Management with ACL
Ability to define Job Dependency and groups of jobs per vertical
Ability to scale individual Job Group
Ability to maintain mix of Streaming and Batch Jobs
Ability to identify memory-intensive or cpu-intensive Jobs
Ability to automate submission of the jobs to the cluster
Ability to get complete visibility of the runtime metrics and log history of the jobs
Ability to scale the physical resources automatically
Ability to perform aggressive Pre-Aggregation in Ingestion Jobs and publish results to relevant streams
For example, when we capture business data from different sources, we should be able to run either continuous-query pipeline OR data-enrichment and processing pipeline to generate contextual streams like ‘time-series stream’ , ‘aggregation stream’ , ‘graph-stream’ , ‘event-alert’ streams etc.
Then another streaming Pipeline should be able to pick up the relevant streams and push to corresponding storage for further analysis.