Turning up Reliability on Data Services

Welcome back! In this article, we explore how to design for reliability in data, machine learning, and AI services on AWS. We’ll cover a range of AWS services and discuss best practices to ensure that your architecture remains resilient and scalable.

Let's dive right in.

Data Ingestion with Kinesis

Kinesis is a powerful streaming bus service that can handle data in transit with built-in redundancy and automatic scaling. The service achieves scalability by splitting data across multiple shards, each with its own throughput capacity. This allows you to distribute records for independent processing across shards.

When your workloads experience fluctuations, enabling auto scaling on your Kinesis shards (based on utilization) ensures that your data stream scales to meet demand.

The image illustrates the concept of shard management in Kinesis, showing the transition from an open shard to a closed parent shard, which then splits into two open child shards. It highlights Kinesis's features of handling data-in-transit with built-in redundancy and automatic scaling.

When you create a stream, you can choose between provisioned mode, where you set the minimum and maximum capacities, or on-demand mode, which automatically offers about 200 megabits per second (or 200,000 records per second write capacity). If your usage exceeds available capacity, throttling occurs, and you may need to adjust your service quota.

The image is a slide discussing how Kinesis Data Streams can help a company automatically scale their data stream for workload fluctuations, listing four methods: auto-scaling based on utilization, multi-region stream replication, using the Kinesis Producer Library, and enhanced fan-out to consumers.

On-demand mode is ideal for companies experiencing seasonal spikes because AWS manages scaling automatically. Additionally, Kinesis allows you to configure data retention periods ranging from 24 hours to 365 days to meet compliance or processing requirements.

The image is a screenshot of a configuration interface for creating a data stream in AWS Kinesis, highlighting two capacity modes: On-demand and Provisioned. It includes details about data stream capacity and pricing information.

Managed Service for Kafka (MSK)

Amazon MSK provides a managed Kafka experience with added insights that simplify cluster management. The typical data flow in MSK involves:

Incoming data connecting to ZooKeeper.
Distribution from ZooKeeper to dedicated broker nodes in a cluster.
Brokers, each defined by its machine size, handling portions of the data load.

The image is a diagram illustrating the architecture of a Managed Service for Kafka (MSK) setup, highlighting the roles of ZooKeeper and Broker nodes across different availability zones. It emphasizes Kafka reliability and automatic broker recovery.

ZooKeeper manages cluster operations such as rearranging nodes and automatic broker recovery. MSK also features:

Auto-detection and replacement of failed brokers (auto-healing).
Cluster-to-cluster asynchronous replication for cross-region data synchronization.

The image explains how Amazon MSK provides automatic failure detection and recovery for Kafka brokers, highlighting features like auto healing, Multi-AZ deployments, and encryption in transit.

For high availability, you can deploy Kafka clusters across multiple Availability Zones or regions. Tasks like certificate renewal are automated to maintain secure intra-cluster communications.

The image shows a slide titled "Designing for Reliability – Managed Service for Kafka," featuring a certificate status interface with details like status, type, and key algorithm. It also mentions "Automatic Certificate Renewal" at the bottom.

The architecture that integrates ZooKeeper with broker nodes provides an inherently reliable framework for managing Kafka clusters.

The image is a diagram illustrating the architecture of a managed service for Kafka on AWS, showing the integration of ZooKeeper and brokers within a VPC setup. It highlights the control and data planes, including producers, consumers, and cluster operations.

Data Transformation with Glue and EMR

AWS Glue

AWS Glue is a managed ETL (Extract, Transform, Load) service based on open-source PySpark. It automatically crawls your data sources, creates a catalog, and launches ETL jobs. Glue integrates seamlessly with S3, Athena, and QuickSight for processing and visualization.

In terms of reliability, AWS Glue:

Supports automated retries on failed jobs.
Scales compute with Data Processing Units (DPUs) according to job requirements.

You pay based on the number of DPUs provisioned per job, so it’s important to adjust these resources based on your workload.

The image is a slide discussing how AWS Glue provides reliability for ETL jobs, listing options like automated retry of failed jobs, using Glue crawlers, Glue DataBrew, and zonal VPC endpoints.

The image is a flowchart illustrating the process of designing for reliability using AWS Glue, involving components like Amazon RDS, AWS Glue Crawlers, AWS Glue ETL, Amazon S3, and Amazon Athena. It highlights the flexibility of Glue with DPUs.

Amazon EMR

Amazon EMR is a managed cluster service that supports frameworks such as Hadoop, MapReduce, and Spark. Reliability is enhanced through:

Node redundancy by deploying multiple primary nodes.
Resource management using Yarn.
Multi-AZ deployments to provide geographic resiliency.

The image is a diagram illustrating the architecture of Amazon Elastic MapReduce (EMR) clusters, showing master, core, and task nodes, with a focus on reliability.

EMR also integrates EMR-FS for direct S3 access and supports CloudFront, CloudWatch, and logging tools to provide added visibility and resilience.

The image is a diagram illustrating the design for reliability using AWS Elastic MapReduce (EMR) across multiple availability zones to enhance geographic resiliency. It includes components like Amazon EC2, IAM roles, VPCs, and Amazon S3 for Spark resources.

The image illustrates the architecture of Amazon EMR with core and task nodes, showing how HBase and other components interact with Amazon S3 for reliable data access. It highlights the use of EMRFS for accessing data stored in S3 with high reliability.

Monitoring your clusters is simple with EMR’s integration with CloudWatch, enabling you to review logs and track health across core or task nodes.

Data Preparation with Glue DataBrew

AWS Glue DataBrew is a fully managed service for data preprocessing, enrichment, and transformation. It streamlines tasks such as filling missing values and computing averages. While DataBrew supports automatic job retries, it is designed with minimal configuration options to simplify your workflow.

The image is a diagram illustrating the data processing workflow using AWS Glue DataBrew, showing the flow from datasets to data storage, processing, and consumption by business analysts. It highlights components like AWS Glue, Amazon S3, Amazon Athena, and Amazon QuickSight.

The image explains how AWS Glue DataBrew can support data preparation at scale for analytics, highlighting features like automatic retries, integration with AWS Lake Formation, use of zonal VPC endpoints, and encryption of job artifacts.

Data Storage and Presentation

Lake Formation

AWS Lake Formation simplifies setting up a secure data lake using S3, Glue, EMR, and other services under one management pane. However, its reliability largely depends on the underlying services like S3.

The image is a diagram illustrating the architecture of Amazon Lake Formation, showing various AWS services like AWS Glue, Amazon EMR, and AWS Lambda, with a focus on data flow and reliability.

Amazon Athena

Amazon Athena is a serverless query service that operates on data stored in S3 and other sources. It provides resilient and scalable query capabilities with minimal configuration—making it ideal for querying large datasets reliably.

The image is a diagram explaining Amazon Athena, a serverless analytics service, showing its integration with various data sources and tools like ML and BI applications. It highlights Athena's capabilities in querying data from multiple sources and its serverless nature backed by S3.

Similarly, Amazon QuickSight is an inherently scalable dashboard service that requires minimal setup for robust analytics.

The image is a diagram illustrating the architecture of a Quicksight service, showing components like a client, webserver, Quicksight engine, SPICE, and a direct query database, with notes on scaling.

Redshift is another reliable storage system that plays an essential role in reporting and analytics.

The image is a diagram illustrating the design for reliability using Amazon Redshift, showing data flow from various sources to Redshift producers and consumers, with integration through Amazon Route 53 for external and internal users.

Machine Learning and AI Services

SageMaker

Amazon SageMaker supports model training, hosting, and testing with instance-based infrastructure. To enhance reliability, it implements strategies such as canary testing, load balancing, and multi-instance deployment. SageMaker also supports A/B testing for model evaluation and distributed training across multiple instances.

The image presents a question about how Amazon SageMaker can support testing multiple model variants, with four suggested methods: A/B testing, integration with AWS Glue ETL, hyperparameter tuning, and automatic rollout to production.

The image illustrates Amazon SageMaker's capability to distribute training jobs across multiple P3 instances, enhancing reliability and speed, with data stored in Amazon S3.

Monitoring is crucial; with CloudWatch and CloudTrail, you can track SageMaker’s performance. You can also configure notebook instances in a no-internet mode through a VPC without an Internet Gateway to meet strict security requirements.

The image provides steps for configuring Amazon SageMaker to operate in a "No Internet" mode for secure genomic data analysis, including using a VPC without an Internet Gateway and setting offline modes.

Highly Managed Services: Rekognition, Polly, Lex, and Comprehend

Highly managed AWS services like Amazon Rekognition (image and video analysis), Polly (text-to-speech), Lex (chatbots), and Comprehend (NLP) are designed to be inherently reliable. They handle auto scaling, retries, and failover internally. For monitoring:

Use CloudWatch metrics and logs to track performance and errors.
Enable AWS X-Ray tracing for deeper insight where needed.

The image is a flowchart illustrating the integration of 3xLOGIC cameras with AWS services like Amazon Kinesis, Rekognition, S3, and SNS for managed video monitoring. It highlights the process of video streaming and event handling for reliability.

The image is a diagram illustrating a cloud architecture for "Designing for Reliability" using AWS services like CloudFront, API Gateway, Lambda, and Polly. It highlights a challenge with tracking outputs and reliability management.

The image presents a question about monitoring an Amazon Lex chatbot, offering four approaches: enabling CloudWatch metrics, reviewing Lex logs with CloudWatch Logs Insights, using AWS X-Ray tracing, and pinging the Lex runtime API with a Lambda function.

The image shows a graph and table displaying metrics related to AWS Comprehend, focusing on reliability through metrics like Consumed Inference Units and Inference Utilization.

For real-time monitoring of services like Comprehend and Amazon Forecast, CloudWatch and CloudTrail remain your primary tools.

The image shows a screenshot of the Amazon Forecast interface, displaying metrics and parameters for a predictive model named "taxi_PM_predictor_DAILY." It includes details like predictor metrics, weighted quantile loss values, and backtest parameters.

Augmented AI and Fraud Detection

Augmented AI (A2I) introduces a human review stage into machine learning outputs, adding an extra layer of reliability. This service integrates human insights with automated predictions for improved outcomes.

The image is a flowchart illustrating the process of designing for reliability using augmented AI on AWS Cloud, involving Amazon Translate, Amazon Lambda, and human-driven post-edits. It highlights the integration of human review in machine learning outputs for improved reliability.

AWS Fraud Detector reliably identifies anomalous transactions using pre-built models and continuous improvement mechanisms, ensuring constant vigilance against fraud.

The image is a flowchart illustrating the process of a fraud detection system, showing how client transaction data is processed, risk scores are predicted, and results are stored and reviewed. It highlights the steps involved in automated model creation and continuous improvement within a secure environment.

Language Services: Transcribe, Translate, and Textract

Amazon Transcribe converts audio to text at scale with built-in reliability features. Using Transcribe for tasks like video captioning leverages its API-driven, resilient design.
Amazon Translate provides robust language translation with automatic recovery, retries, and failover capabilities.
Amazon Textract extracts text from documents and images with high accuracy and reliability. Integrated monitoring with CloudWatch and AWS X-Ray (if needed) provides further operational visibility.

Summary

This article covered a broad range of AWS services—from data ingestion with Kinesis and MSK, through transformation with Glue and EMR, to advanced machine learning and AI services like SageMaker, Rekognition, Polly, Lex, Comprehend, Transcribe, Translate, and Textract. Key reliability features include:

Auto scaling and automatic recovery.
Cross-AZ and multi-region deployments.
Integrated monitoring using CloudWatch and CloudTrail.

Highly managed services such as QuickSight, Lake Formation, and the language services are designed to be resilient by default, reducing the operational overhead required to maintain continuous availability and reliability.

The image is a summary slide highlighting the reliability features of AI/ML services, mentioning autoscaling, autorecovery, and logging to CloudTrail and CloudWatch. It includes four main points with colorful numbered icons.

In conclusion, AWS services are built to scale and recover automatically, ensuring your data and AI architectures remain robust under diverse workloads.

Thank you for reading, and we hope you now have a better understanding of how to design and maintain reliability across various AWS data and AI services.

Watch Video

Watch video content