[SAA] 32. Data Engineering

AWS Batch Overview

Run batch jobs as Docker images
Dynamic provisioning of the instances (EC2 & Spot Instances) - in VPC
Optimal quantity and type based on volume and requirements
No need to manage clusters, fully serverless
You just pay for the underlying EC2 instance
Example: batch process of images, running thousands of concurrent jobs
Schedule Batch Jobs using CloudWatch Events
Orchestrate Batch Jobs using AWS Step Functions

Lambda vs Batch

Lambda

Time limit: 15 mins
Limted runtime
Limited temporary disk space
Serverless

Batch

No time limit
Any runtinme as long as it's package as a Docker image
Rely on EBS / instance store for disk space
Relies on EC2 (can be managed by AWS)

Compute Environments

Managed Compute Environment

AWS Batch managed the capacity and instance types within the environment
You can choose On-Demand or Spot Instance
You can set a maximum price for Spot instance
Launched within your own VPC
- If you launch within your own private subnet, make sure it has access to the ECS service
- Either using a NAT Gateway / instance or using VPC Endpoint for ECS

Unmanaged Compute Environment

You control and manage instance configuration, provisioning and scaling

Kinesis

CloudWatch cannot send to Kinesis Data Firehose or Kinesis Data Streams

Near real-time: Kinesis Data Firehose

Kinesis agent can directly configured to send data to Kinesis Data Firehose

Firehose can connect to S3

Kinesis Data Firehose is near real-time

Using Lambda to send to ElasticSearch

Athena

Quicksight for visiulization dashboard
CloudTrail can stream logs to CloudWatch

EMR can choose to use Spot Fleet to control the cost
Athena: data must stay in S3
Redshift Spectrum for serverless queries on S3

posted @ 2021-09-23 13:59 Zhentiw 阅读(276) 评论(0) 编辑收藏举报

刷新页面返回顶部