Turning up Reliability on Storage Services

Future Solutions Architects,

In this lesson, we explore how to enhance reliability in AWS storage services. Although AWS inherently provides a high level of reliability, implementing best practices and configurations can further improve availability and resiliency. These enhancements are particularly important for the Associate-level exam and for designing robust architectures.

Below is an organized overview of storage services—covering block storage, file storage, and object storage—as well as backup and disaster recovery strategies.

Block Storage

AWS block storage includes services such as EBS volumes, EFS file systems, snapshots, S3 backups, FSx for Lustre, FSx for NetApp, and more. These services typically create at least three redundant copies of your data. For instance, AWS automatically makes three copies when taking an EBS snapshot or using an EFS file system.

The image is a diagram of an AWS cloud architecture, illustrating components like VPC, EBS, S3, and RDS across different regions and availability zones, with connections for backup and restore processes.

EBS Volumes

EBS volumes serve as durable, reliable hard drives attached to EC2 instances. To enhance data protection, regularly schedule snapshots. It is important to note that adjusting performance parameters such as switching to GP3 or increasing IOPS does not change the underlying redundancy, which is already maximized by AWS. Snapshots ensure fast recovery in the event of data corruption or accidental deletion, but they do not increase the intrinsic fault tolerance of the volume.

Instance Store

In contrast, instance store volumes, which are local storage on the EC2 instance's host, deliver excellent performance (up to 150,000–185,000 IOPS) via NVMe storage. However, they are ephemeral and do not persist when the instance fails or is relocated. Therefore, for scenarios where data durability is essential, always choose EBS over instance store.

The image presents a scenario about a gaming company considering EC2 instance store volumes for their application, with four statements evaluating the durability and persistence of data. The focus is on understanding the ephemeral nature of instance store volumes.

File and Network Storage

Amazon EFS

Amazon EFS (Elastic File System) is designed for inherent redundancy. It spans multiple Availability Zones (AZs) within a region, ensuring data is automatically backed up with at least three copies. EFS can also be replicated across regions if required.

The image is a diagram of an AWS cloud architecture showing two regions with VPCs, availability zones, and various services like EBS, S3, EFS, and RDS, illustrating backup and restore processes.

EFS offers configurable performance options and lifecycle management, such as intelligent tiering that moves infrequently accessed files to a different storage class similar to that available in Amazon S3. However, while these features enhance performance and cost-efficiency, they do not modify the inherent fault tolerance. Choosing the EFS One Zone option will reduce resiliency compared to the default multi-AZ setup.

EFS lifecycle management can automate data movement based on usage patterns:

The image is a flowchart illustrating the Amazon Elastic File System (EFS) lifecycle management process, including setting lifecycle policies, managing data movement, and optimizing storage costs with infrequent access and intelligent tiering.

FSx Services

FSx for Windows File Server
This service integrates with Active Directory and provides full redundancy through AWS-managed replication. There are no additional configurations to enhance reliability beyond enforcing strong security practices.
FSx for Lustre
FSx for Lustre is commonly employed in high-performance computing (HPC) simulations where both performance and reliability are critical. For enhanced backup and redundancy, it is frequently linked to Amazon S3. Additional redundancy settings are not available because the service is designed with inherent reliability.
OpenZFS and ONTAP in FSx for OpenZFS
These services offer advanced features such as snapshot volumes and clone volumes. While they do not add extra built-in reliability controls, proper configuration can significantly improve backup speed and data availability. Reliability is largely determined by how the features are utilized rather than by enabling an extra reliability option.

Amazon ONTAP

Similar to OpenZFS, Amazon ONTAP is built with resiliency in mind. Although features like deduplication, compression, and encryption improve efficiency and indirectly support reliability, the core reliability mechanisms rely on AWS’s default three-copy storage design.

Object Storage: Amazon S3

Amazon S3 delivers simple and reliable storage with a built-in design that maintains at least three copies of your data (excluding the S3 One Zone option, which stores data in a single AZ). Standard security measures include private buckets, server-side encryption, IAM controls, bucket policies, and NACLs for access management.

The image presents a scenario where a startup plans to use Amazon S3 for storing user-uploaded images securely, with four options for managing access and security. Each option suggests different configurations for S3 buckets, encryption, and access control.

Understanding how S3 evaluates permissions is crucial. Even though measures like public access blocks and logging are essential for security, they do not increase the service’s inherent reliability. Amazon S3 automatically provides high availability through its redundant storage design.

The image presents a question about the order in which Amazon S3 evaluates permissions, with four multiple-choice options detailing different sequences of IAM policies, bucket policies, ACLs, and public access firewall.

When replicating data between buckets—such as configuring a source bucket with disabled replication and a secondary bucket for compliance or disaster recovery—remember that S3’s automatic redundancy typically negates the need for additional reliability adjustments.

Backup, Disaster Recovery, and Redundancy

Ensuring data reliability extends beyond primary storage configurations. Robust backup and disaster recovery strategies are essential for restoring services in case of data corruption or accidental deletion. Using EBS snapshots, EFS snapshots, or FSx snapshots, you can safeguard your data by storing backups in encrypted vaults. Often, these backups are replicated to further mitigate data loss.

The image is a diagram illustrating a cross-account backup process in AWS, showing the transfer of encrypted Amazon EBS snapshots from a source account to a destination account using backup vaults and customer-managed keys.

A typical backup scenario might involve:

Storing critical backups with encryption via a customer-managed key (CMK).
Implementing lifecycle policies that prevent inadvertent deletion or alteration.
Using a VPC interface endpoint for secure private data transfers during backup replication and recovery.

The image outlines steps for a financial institution to ensure AWS Backup data is encrypted using a customer-managed key (CMK) in AWS KMS, including enabling default encryption, selecting the CMK during backup creation, modifying IAM policies, and encrypting data at the source.

The image presents a scenario where a media company needs to ensure AWS backups are not deleted or altered due to a legal notice, and it lists four AWS Backup features as potential solutions.

For secure, private data transfers with added redundancy, consider using a VPC interface endpoint when communicating with AWS Backup services.

Elastic Disaster Recovery (EDR)

Elastic Disaster Recovery (EDR) is vital in ensuring rapid application recovery during outages. By replicating on-premises data to a staging area in AWS, EDR enables you to launch EC2 instances quickly in the event of an emergency. The EDR console offers insights into job execution and instance details, reinforcing your application's resiliency.

The image presents a scenario about a corporation planning to use AWS Elastic Disaster Recovery and asks which statement about the AWS Elastic Disaster Recovery Console is true, offering four options.

The image is a diagram illustrating a data replication and recovery architecture using AWS services, including components like AWS Replication Agent, EC2, S3, and EBS volumes, with data flow and network protocols indicated.

Remember, in failover and failback procedures, “fail back” refers to reverting to the primary environment after a recovery event or test. EDR’s capabilities ensure that even during outages, your applications remain available.

Storage Gateway

Storage Gateway bridges on-premises data centers with AWS storage solutions. It is available as a volume gateway, tape gateway, or file gateway, each with its own considerations:

A storage gateway appliance is deployed in your corporate data center and connects with AWS over the internet or via a private link.
The appliance itself represents a single point of failure. Although you can restart the virtual machine, the attached disks must be properly managed to avoid data loss.
For file and tape gateways, the underlying AWS storage (typically S3 or Glacier) ensures redundancy, even if the local gateway experiences issues.

The image is a network diagram illustrating the integration of a corporate data center with AWS Cloud services, showing data and control flows between an application server, storage gateway, and Amazon S3 via AWS Direct Connect and Site-to-Site VPN.

The image is a diagram illustrating the integration of a corporate data center with AWS Cloud services, showing data and control flow between an application server, storage gateway appliance, and Amazon S3 via AWS Direct Connect and Site-to-Site VPN.

For FSx File Gateway and Tape Gateway deployments, ensure that you have a well-established recovery procedure. Running multiple instances of the gateway does not automatically provide active-active redundancy because these gateways typically attach to a single disk or storage volume at any given time.

The image illustrates a Tape Gateway architecture, showing how data from data centers is backed up to virtual tapes in Amazon S3 and archived in Amazon Glacier. It includes components like Gateway VMs, tape drives, and media changers, with connections to AWS for storage and retrieval.

The image illustrates two AWS Volume Gateway configurations: Stored Mode, where the entire dataset is stored both on-premises and in AWS, and Cache Mode, where frequently accessed data is stored on-premises with the rest in AWS.

Final Thoughts

In summary, AWS storage services are engineered to be highly redundant and reliable by default—with at least three copies of your data in most cases. However, understanding the differences between services such as EBS versus instance store and the significance of automated backups and disaster recovery is crucial for building resilient architectures. Backup strategies and Elastic Disaster Recovery (EDR) are essential for ensuring data restoration and maintaining service availability in the event of failures.

As you design your systems, keep in mind:

AWS employs multiple copies of your data to ensure durability, similar to how scaling your application across multiple EC2 instances maintains performance and availability.
Proper backup and data recovery plans are vital parts of your strategy.

I'm Michael Forrester. Thank you for reviewing this lesson. In upcoming lessons, we will delve into compute services and related concepts.

Note

For more detailed information on AWS storage solutions, consider visiting the AWS Documentation.

Watch Video

Watch video content

Practice Lab

Practice lab