AWS Solutions Architect Associate Certification
Designing for Reliability
Turning up Reliability on Compute Services Part 3
In this article, we shift our focus from container-based compute services to serverless computing with AWS Lambda. AWS Lambda is a fully managed serverless platform where you don't need to manage the underlying servers. Instead, Lambda automatically runs your code in response to specific triggers. For example, if your function is invoked 1,000 times in a short amount of time, 1,000 independent instances are created concurrently. Each execution is isolated, which ensures optimal resiliency. This isolation, coupled with built-in features like retries, dead-letter queues, and log monitoring, significantly enhances the overall reliability of your serverless applications.
AWS Lambda Performance and Resiliency
One common challenge with Lambda is managing cold starts. A cold start happens when a Lambda function initializes from scratch, which can add latency to function execution. To address this, AWS provides a feature called provisioned concurrency. With provisioned concurrency, you can reserve a set number of Lambda instances to remain “hot,” thereby reducing the latency caused by cold starts.
Lambda Resiliency Note
AWS Lambda manages many aspects of scaling, fault tolerance, and performance under the hood. However, you can also customize retry policies and configure dead-letter queues to handle function failures.
Security permissions are equally important. Incorrect IAM permissions can prevent a function from executing as expected. Lambda allows you to configure automatic retries and route failed events to a dead-letter queue for further analysis or reprocessing. You can customize these retry policies—setting the number of attempts, intervals, and whether to apply exponential backoff.
Monitoring and tracing are integrated into AWS Lambda. Standard tools include CloudWatch for logs and metrics, CloudTrail for API activity monitoring, and AWS Config for tracking configuration changes. AWS X-Ray further enhances visibility by providing in-depth analysis of your Lambda functions and their interactions with other services.
When troubleshooting performance issues or resource limitations, CloudWatch metrics can help pinpoint throttling events or resource overuse. By default, an AWS account typically has 1,000 concurrent Lambda invocations, although these limits can be adjusted using the Service Quotas console.
AWS Step Functions: Orchestrating Serverless Workflows
AWS Step Functions offer a robust method to coordinate multiple AWS services into complex, serverless workflows. When processing large datasets in parallel, the failure of a single task should not break the entire workflow. The recommended approach is to integrate catchers within the parallel state. These catchers capture errors from individual tasks, allowing the state machine to continue execution by rerouting the error-handling flow.
For non-transient errors, it is essential to incorporate a catch field into your state definitions. This practice ensures that even when exceptions occur, the state machine can gracefully transition to an alternate execution path.
The Serverless Application Model (SAM)
The Serverless Application Model (SAM) is a framework that simplifies the development and deployment of serverless applications. SAM abstracts the underlying infrastructure, with AWS handling many resiliency features. For example, to ensure high availability and quick redeployment across regions, you can define a SAM template and use AWS CloudFormation StackSets.
For enhanced resiliency, include configurations such as dead-letter queues in your SAM template. SAM also supports semantic versioning, which is beneficial for managing deployment rollbacks or upgrades in case of issues with new versions.
AWS Serverless Application Repository (SAR)
AWS Serverless Application Repository (SAR) provides prebuilt Lambda templates and functions, which can be a great starting point for your serverless projects. While SAR itself has limited options for configuring resiliency directly, you can improve reliability by leveraging semantic versioning. This approach enables clear version control and straightforward rollbacks or forward rollouts in case of deployment issues.
AWS Amplify
AWS Amplify streamlines the deployment and hosting of web applications, particularly those built on serverless architectures. Amplify integrates with a variety of AWS services, including API Gateway, Lambda, and database services, and features tools like GraphQL transform to manage API versioning and maintain backward compatibility.
Amplify automatically integrates with CloudWatch to provide detailed performance insights through custom metrics, while also supporting AWS CloudTrail and AWS Config for enhanced security and operational monitoring.
Hybrid Computing – AWS Outposts
Hybrid computing solutions bring the power of AWS services into your on-premises data centers with AWS Outposts. Outposts extends your Amazon VPC into your data center, connecting via AWS Direct Connect for a reliable private link between your on-premises network and AWS cloud services.
A reference architecture for AWS Outposts includes an Amazon VPC, an Outpost subnet, and a service anchor that routes traffic over a Direct Connect line to a customer edge router. Local infrastructure components, such as dedicated VLANs and local gateways, integrate AWS services like RDS and EC2. Although Outposts benefit from traditional data center resiliency—such as redundancy across racks and locations—many resiliency settings are determined by your local infrastructure design.
Another diagram demonstrates an Outposts deployment supporting EKS with dual subnets: one for the management/control plane and another for data (including ALBs, volumes, and EC2 instances). While AWS provides robust connectivity and integration, the ultimate responsibility for ensuring a resilient data center lies with you.
For regulatory compliance or low-latency needs, Outposts can also serve as primary or secondary failover sites, keeping critical data on-premises while maintaining connectivity with AWS cloud services.
ECS and EKS Anywhere & VMware Cloud
AWS offers container orchestration solutions that extend beyond the cloud. ECS and EKS Anywhere enable you to run containerized applications on-premises while leveraging AWS management and control planes. In these cases, resiliency hinges on the robustness of your local data center architecture, including redundancy, security, and performance measures.
Similarly, VMware Cloud on AWS enables customers to integrate their existing on-premises vCenter environments with AWS resources. Here, resiliency strategies focus on ensuring operational consistency between on-premises VMware environments and the AWS cloud.
The Snow Family
Within the Snow family, devices are engineered for specific computing and data transfer use cases. For example, Snowmobile is designed primarily for large-scale storage and transfer, while Snowcone offers a compact, portable solution with built-in compute capabilities. Snowball devices incorporate resiliency through features such as RAID and error correction, with settings preconfigured by AWS to ensure security and durability.
For scenarios that require reliable data collection and transfer in remote or connectivity-constrained environments, Snowcone is an ideal choice—provided your dataset fits within its storage limitations.
Summary
This module explored the resiliency considerations across a diverse range of compute services including:
- AWS Lambda: Achieves reliability through automatic scaling, provisioned concurrency, retries, and dead-letter queues.
- AWS Step Functions: Enhances workflow resiliency by incorporating catchers in parallel states to handle individual task failures.
- Serverless Frameworks (SAM and SAR): Utilize CloudFormation StackSets and semantic versioning to support high availability and straightforward rollbacks.
- AWS Amplify: Simplifies web application deployment and integrates seamlessly with AWS monitoring tools to maintain backend API resiliency.
- Hybrid Solutions (AWS Outposts, ECS/EKS Anywhere, VMware Cloud): Require robust local data center architectures while leveraging AWS connectivity.
- Edge Devices (Snow Family): Are preconfigured for resilient operations in data transfer and remote computing use cases.
Implementing redundancy through load balancing and autoscaling, coupled with effective monitoring, retry mechanisms, and data replication strategies, is key to developing resilient applications that withstand failures and minimize downtime.
Up next, we'll explore database resiliency in our next article.
For additional resources, please refer to:
- AWS Lambda Developer Guide
- AWS Step Functions Documentation
- Serverless Application Model (SAM)
- AWS Outposts
- VMware Cloud on AWS
Happy architecting!
Watch Video
Watch video content
Practice Lab
Practice lab