[Cloud Architect] 3. Monitor, React, and Recover

Lesson Outline

Monitoring
Alerting
Recovering
Automating

In this lesson, you will learn how to use AWS tools to monitor and alert on the systems that you build. You'll create alerts and think through how to find problems and recover from them.

Overview

Without monitoring, you are blind to what is happening in your systems. Without having knowledgable folks alerted when things go wrong, you're deaf to system failures. Creating systems that reach out to you and ask you for help when they need it, or better yet, let you know that they might need help soon, is critical to meeting your business goals and sleeping easier at night.

Once you have master monitoring and alerting, you can begin to think about how your systems can fix themselves. At least for routine problems, automation can be a fantastic tool for keeping your platform running seamlessly.

Recovering all your systems

Monitoring and responding are core to every vital system. When you architect a platform, you should always think about how you will know if something is wrong with that platform early on in the design process. There are many different kinds of monitoring that can be applied to many different facets of the system, and knowing which types to apply where it can be the difference between success and failure.

Always ask yourself how you would diagnose issues with an application, how would you understand it's health, what are it's choke points, how would you identify them and what would you do when something breaks. While thinking through these concepts is important, it is very difficult to foresee every possible scenario.

This is why advanced organizations employ techniques like "chaos engineering" to intentionally cause breakage in their environments in a controlled manner. If you build a resilient system, it should be resilient, so why not terminate a random server? It may be hard to get accustomed to this idea, but it can provide insight that would otherwise be impossible to gain.

Monitoring in AWS

AWS provides robust monitoring capabilities for their services. This is vital for understanding how your systems are performing not just at the moment, but also over time. CloudWatch Metrics tracks metrics on AWS services. Any metric that AWS makes available is presented via CloudWatch. You can also create your own metrics with a "custom metric". Taking a variety of related metrics and putting them on a CloudWatch Dashboard is an effective way to gaining visibility into your system without spending a lot of time doing it. Investing in understanding what metrics are available on the services you use, what each metric means, and what your usage is are imperative to running a highly available platform.

Refer to the following for links to every AWS service that pushes metrics into CloudWatch, and what those metrics mean: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/aws-services-cloudwatch-metrics.html

Alerting

Proper alerting will help you keep tabs on your systems and will help you meet your SLAs. Alerting in ways that bring attention to important issues will keep everyone informed and prevent your customers from being the ones to inform you of problems. CloudWatch Alarms integrates with CloudWatch Metrics. Any metric in CloudWatch can be used as the basis for an alarm. These alarms are sent to SNS topics, and from there, you have a whole variety of options for distributing information such as email, text message, Lambda invocation or third party integration.

Alerting when problems occur is critical, but alerting when problems are about to occur is far better. Understanding the design and architecture of your platform is key to being able to set thresholds correctly. You want to set your thresholds so that your systems are quiet when the load is within their capacity, but to start speaking up when they head toward exceeding their capacity. You will need to determine how much advanced warning you will need to fix issues.

Recovering From Failure

The key to recovering from failure is to understand how the failure occurred. Once you have this understanding, you can be sure that you've fixed the root cause, and you will know how to prevent a reoccurrence. Finding a root cause can be straightforward is there is a direct cause and effect (we changed A, and B immediately happened). Some issues are harder to identify, and some can only be identified by understanding "what changed?".

CloudTrail is a great tool for determining what changed. It allows you to audit and review changes and commands run with all AWS credentials associated with your account. Once you've discovered what was changed and who/what changed it, you can resolve the issue and ensure that the incident is not repeated.

Who Changed?

Your application
A third party
Something expired:
- SSL certificate
- Licenses

Automating Recovery

Automating service recovery and creating "self-healing" systems can take you to the next level of system architecture. Some solutions are quite simple. Using autoscaling within AWS, you can handle single instance/server failures without missing a beat. These solutions will automatically replace a failed server or will create or delete servers based on the demand at any given point in time.

Beyond the simple tasks, many types of failure can be automatically recovered from, but this can involve significant work. Many failure events can generate notifications, either directly from the service, or via an alarm generated out of CloudWatch. These events can have a Lambda function attached to them, and from there, you can do anything you need to in order to recover the system. Do be cautious with this type of automation where you are, in essence, turning over some control of the platform - to the platform. Just like with a business application, there can be defects. However, as with any software, proper and thorough testing can help ensure a high-quality product.

Edge Cases

Many applications and services lend themselves to being monitored and maintained. When you run into an application that does not, it is no less important (it's like more important) to monitor, alert and maintain these applications. You may find yourself needing to go to extremes in order to pull these systems into your monitoring framework, but if you do not, you are putting yourself at risk for letting faults go undetected. Ensuring coverage of all of the components of your platform, documenting and training staff to understand the platform and practicing what to do in the case of outages will help ensure the highest uptime for your company.

Lesson Recap

Monitoring
Alerting
Recovering
Automating

Lesson Objectives

You will be able to:

Monitor AWS applications
Alert on problems in applications
Recover failures in your platform
Understand testing and tradeoffs in automating recovery from failure

In this lesson, you learned how to monitor and maintain systems in AWS. You also looked at what and how to recover systems that have failed. The larger your application grows, the more parts and services it will have. The more complex it grows, the more things that can go wrong. The more things that can go wrong, the more frequently they will go wrong. Expect failures, and plan to address and recover from them.

Glossary

SSL certificate: Cryptographic certificate for encrypting traffic between two computers.
Source of truth: When data is stored in multiple places or ways, the "source of truth" is the one that is used when there is a discrepancy between the multiple sources.
Monitoring: Systems to track and make visible metrics that are useful in identifying system performance.
Alerting: Systems to attract attention when performance thresholds are crossed.
Chaos Engineering: Intentionally causing issues in order to validate that a system can respond appropriately to problems.

Answer1215