Turning up Reliability on Network Services

Welcome to this detailed guide on designing reliable network services in AWS. This article covers core concepts, best practices, and architectural decisions to enhance the resilience of your network infrastructure. Throughout the guide, you will find technical diagrams that illustrate each concept clearly.

Core Networking and VPC Architecture

When discussing core networking, we primarily refer to Virtual Private Clouds (VPCs). In a traditional VPC setup, you typically see the following components:

The AWS account, region, and the VPC itself.
Internet Gateway and IPv6 egress-only gateway.
Public and private subnets within a specific Availability Zone.
A load balancer spanning public subnets.
A NAT gateway performing outbound internet traffic translation.
DNS resolvers and route tables assigned to each subnet.
Security groups safeguarding EC2 instances in private subnets.
VPC endpoints for services (gateway endpoints for S3/DynamoDB and interface endpoints for other services).

Additionally, some AWS services—such as Route 53 and Simple Storage Service (S3)—are hosted in separate data centers within the same region, ensuring isolation from EC2 subnets.

The image is a diagram of an AWS Virtual Private Cloud (VPC) setup, showing components like public and private subnets, internet and egress-only gateways, NAT gateways, and VPC endpoints within a cloud account.

Connecting Departmental VPCs

Consider a scenario where departmental VPCs need to be interconnected with a shared VPC while maintaining isolation between departments. The following options are commonly considered:

AWS Transit Gateway: Offers a highly scalable solution for connecting multiple VPCs.
VPC Peering: Suitable for simpler setups with four or five VPCs, but less scalable.
Direct Connect and VPN connections: Generally used to connect back to your corporate network rather than interconnecting VPCs directly.

The image presents a scenario where an enterprise is expanding its AWS infrastructure using a multi-VPC architecture, and it lists four solutions for connecting departmental VPCs with a shared service VPC. The options include using AWS Transit Gateway, VPC Peering, AWS Direct Connect, and VPN connections.

For multi-VPC environments, AWS Transit Gateway is the recommended solution. In simpler architectures, VPC Peering might be acceptable.

Enhancing VPC Reliability and High Availability

To build a highly reliable network, deploy applications across multiple Availability Zones within your VPC. Key considerations include:

Hosting workloads in private subnets.
Positioning load balancers in public subnets for secure, internet-facing access.
Implementing a transit VPC connected with VPN gateways across several VPCs for inherent resiliency.

The image depicts a network architecture with multiple VPCs connected to a central Transit VPC via VPN Gateways. Each VPC is linked to the Transit VPC, illustrating a hub-and-spoke model.

Note that while default VPCs are convenient for testing, they are not recommended for production environments. Creating a custom VPC with configurations that prevent single points of failure is best practice. Spreading resources across multiple subnets ensures effective failover; for example, if a load balancer spans two public subnets and one goes down, the other can continue to handle traffic.

Routing and Traffic Segmentation

Routing is critical for maintaining network reliability. By using separate routing tables at both the subnet and VPC levels, you can control traffic flow effectively. Consider a multi-tier application where:

The web tier requires access only to the Internet Gateway.
The application tier communicates with both the web tier and the database tier.
The database tier remains isolated from direct internet access to enhance security.

The image presents a scenario about a multinational corporation deploying a multi-tier web application on AWS, with a question on managing traffic flow between tiers and the internet. It offers four options for configuring routing tables to ensure secure and functional connectivity.

This segmented approach minimizes vulnerabilities and reduces dependency on a single routing table.

Internet Gateways, NAT Gateways, and Egress-Only Gateways

AWS handles critical network components with built-in redundancy:

Internet Gateway: Although it appears as a single device, it consists of multiple devices managed transparently by AWS.
NAT Gateway: Recommended for instances in private subnets to handle outbound traffic. A NAT gateway should be deployed in each Availability Zone.
Egress-Only Gateway: Specifically used for enabling outbound IPv6 connectivity without allowing inbound traffic.

When exam scenarios require outbound IPv6 access without inbound connectivity, an egress-only Internet gateway is the ideal choice.

Managing IP Addresses and Elastic IPs

Effective IP address management is essential for network reliability. Best practices include:

Allocating a large CIDR block for your VPC.
Assigning smaller CIDR blocks to public subnets and larger ones to private subnets to avoid IP range overlap.
Supporting both IPv4 and IPv6 structures.

Elastic IP addresses improve network resiliency by allowing reassignment to other network interfaces or EC2 instances in case of failure. This is especially useful for legacy applications that do not support load balancing.

The image presents a question about configuring subnets in an AWS VPC for optimal network reliability and flexibility, with four suggested practices for IP addressing.

For high-throughput workloads, choose advanced network interfaces such as the Elastic Network Adapter (ENA) or Elastic Fabric Adapter (EFA).

The image presents a scenario where a company needs to choose a network interface option for EC2 instances requiring high throughput and low-latency connectivity, with five options listed: Elastic Network Interface (ENI), Elastic Network Adapter (ENA), Elastic Fabric Adapter (EFA), Secondary private IP addresses, and Placement groups.

For instance, ENA can improve throughput and reduce latency, critical for real-time applications such as multiplayer gaming backends or high-performance computing.

The image presents a scenario where a gaming company is deploying a multiplayer game backend on AWS EC2, requiring high-performance networking. It lists four approaches to ensure network performance and reliability.

For ultra-low latency scenarios, consider the Elastic Fabric Adapter, which bypasses parts of the operating system to accelerate performance.

The image presents a scenario where a financial analytics company is deploying a high-performance computing application on AWS, requiring low latency and high throughput. It lists four approaches to ensure network performance and reliability for the application.

Security Devices and Load Balancers

AWS leverages virtual network security devices with built-in redundancy:

Network ACLs and Security Groups: These devices prevent single points of failure while enforcing security across multiple Availability Zones.
Application and Network Load Balancers: These load balancers distribute traffic across Availability Zones. Enabling cross-zone load balancing enhances the resiliency of your applications.

The image presents a scenario where a company is migrating a mission-critical application to AWS, requiring reliable traffic distribution across EC2 instances in multiple Availability Zones. It lists five potential approaches, highlighting the implementation of an Application Load Balancer with cross-zone load balancing as a solution.

For UDP-based applications, a Network Load Balancer can be configured per Availability Zone with cross-zone load balancing enabled.

The image presents a scenario where a company is migrating a UDP-based application to AWS, requiring high availability, and lists five load balancing options, highlighting the third option: setting up a Network Load Balancer per Availability Zone.

Gateway load balancers, used for traffic inspection with firewall appliances, are also engineered for resilience and often deployed with auto-scaling groups to handle failovers effortlessly.

Transit Networking: VPN, Direct Connect, and Transit Gateway

Hybrid networking scenarios require redundant connections for high availability. When migrating an application to AWS that must connect to on-premises networks, consider the following:

Transit Gateway with Dual VPN Connections: Enhances resiliency through multiple connectivity paths.
Redundant Direct Connect Circuits: Important for ensuring continuous high-bandwidth connections.

The image presents a scenario where a company is migrating its application to AWS and needs a high-redundancy hybrid connection between the VPC and on-premises network. It lists five implementation options, including a transit gateway with dual VPN connections and a CloudHub router with redundant VPN tunnels.

Since both Direct Connect and VPN offer single connectivity paths, adding a second connection is critical for true redundancy.

The image presents a scenario where a company needs a high-bandwidth, low-latency hybrid connection between their HQ and AWS, requiring redundancy. It lists five deployment options, highlighting "Two Direct Connect connections to different DCs" as a potential solution.

Within AWS, interconnections such as VPC Peering and Transit Gateway benefit from AWS’s highly redundant physical network infrastructure.

The image presents a diagram illustrating a network architecture using a Transit Gateway to connect a corporate data center to various AWS VPCs (Production, Test, Development, and Infrastructure Shared Services) via a site-to-site VPN connection.

Endpoints and PrivateLink

VPC endpoints provide secure and low-latency connectivity without traversing the public Internet. This is achieved by using:

Gateway Endpoints: For services like S3 and DynamoDB.
Interface Endpoints via PrivateLink: These endpoints, while not placed in subnets, offer high scalability and redundancy through AWS’s internal infrastructure.

Edge Networking with CloudFront

Amazon CloudFront is AWS’s global content delivery network (CDN), featuring numerous edge locations that provide low latency and high availability worldwide. Key points include:

A single CloudFront distribution leverages multiple edge locations to ensure resilience.
Combining CloudFront with DNS failover and regional origins offers robust global content delivery.
CloudFront routes requests to regional servers that retrieve the content from the origin (e.g., an S3 bucket) and then cache it at the edge.

The image presents a scenario where a company needs to deliver web content globally with low latency and redundancy. It lists five options, highlighting "A CloudFront distribution with multiple edge locations" as the recommended solution.

The image is a diagram showing a client accessing Amazon CloudFront, which routes requests to two different S3 buckets based on the path.

For dynamic content, AWS offers CloudFront Functions and Lambda@Edge, which execute code at edge locations for low-latency processing of viewer requests and origin responses.

The image presents a scenario where a company needs to optimize media files delivered through CloudFront, with five potential solutions listed, highlighting "Lambda@Edge functions triggered on CloudFront requests" as the primary option.

The image illustrates the flow of requests and responses in a Lambda@Edge architecture, showing interactions between a client, CloudFront edge caches, regional edge caches, and the origin server. It highlights the roles of CloudFront Functions and Lambda@Edge Functions in processing viewer and origin requests.

DNS and Route 53 for Network Discovery

AWS Route 53 provides Domain Name System (DNS) services with a 100% SLA, utilizing multiple DNS resolvers per subnet to avoid single points of failure. Key features include:

Use of various routing policies (latency-based, weighted, failover, geolocation) to direct traffic across regions.
In a multi-region setup, Route 53 health checks trigger failover routing to a standby site when the primary site is unavailable.
The Route 53 Application Recovery Controller conducts deep analytics to facilitate automated failover decisions.

The image presents a scenario for deploying a multi-region web application for disaster recovery, asking which two approaches could implement an automated failover routing solution. Options include using CloudFront, Route 53 health checks, latency-based routing, failover routing, and AWS Shield Advanced.

The image is a diagram illustrating the architecture of a Route 53 Application Recovery Controller setup, featuring two AWS regions (us-east-1 as active and us-west-1 as standby) with TicTacToe game servers and Amazon DynamoDB for data management.

Summary

This guide reviewed several critical aspects of ensuring network service reliability on AWS:

Designing VPCs with multiple subnets across different Availability Zones.
Connecting departmental VPCs using AWS Transit Gateway and VPC Peering.
Configuring routing tables to segregate traffic among different application tiers.
Securing outbound connectivity using NAT and egress-only gateways.
Improving network resiliency with Elastic IPs and advanced network interfaces.
Leveraging global AWS services such as CloudFront and Route 53 designed with high availability in mind.

Key Takeaway

AWS’s robust and inherently redundant infrastructure allows you to focus on deploying resilient applications without worrying about the underlying network reliability.

Thank you for reading this guide on enhancing network service reliability. Continue to explore further AWS services to deepen your understanding of best practices in network design and resilience.

For additional resources and in-depth documentation, please refer to the AWS Documentation and the AWS Well-Architected Framework.

Watch Video

Watch video content