[SAA] 32. Data Engineering

AWS Batch Overview

  • Run batch jobs as Docker images
  • Dynamic provisioning of the instances (EC2 & Spot Instances) - in VPC
  • Optimal quantity and type based on volume and requirements
  • No need to manage clusters, fully serverless
  • You just pay for the underlying EC2 instance
  • Example: batch process of images, running thousands of concurrent jobs
  • Schedule Batch Jobs using CloudWatch Events
  • Orchestrate Batch Jobs using AWS Step Functions

 

 

 

Lambda vs Batch

Lambda

  • Time limit: 15 mins
  • Limted runtime
  • Limited temporary disk space
  • Serverless

Batch

  • No time limit
  • Any runtinme as long as it's package as a Docker image
  • Rely on EBS / instance store for disk space
  • Relies on EC2 (can be managed by AWS)

 

Compute Environments

Managed Compute Environment

  • AWS Batch managed the capacity and instance types within the environment
  • You can choose On-Demand or Spot Instance
  • You can set a maximum price for Spot instance
  • Launched within your own VPC
    • If you launch within your own private subnet, make sure it has access to the ECS service
    • Either using a NAT Gateway / instance or using VPC Endpoint for ECS

 

Unmanaged Compute Environment

  • You control and manage instance configuration, provisioning and scaling

 

 

Kinesis

CloudWatch cannot send to Kinesis Data Firehose or Kinesis Data Streams

Near real-time: Kinesis Data Firehose

Kinesis agent can directly configured to send data to Kinesis Data Firehose

Firehose can connect to S3

 

Kinesis Data Firehose is near real-time

Using Lambda to send to ElasticSearch

 

Athena

  • Quicksight for visiulization dashboard
  • CloudTrail can stream logs to CloudWatch

 

 

  • EMR can choose to use Spot Fleet to control the cost
  • Athena: data must stay in S3
  • Redshift Spectrum for serverless queries on S3

 

 

 

 

 

 

posted @ 2021-09-23 13:59  Zhentiw  阅读(276)  评论(0编辑  收藏  举报