AWS Solutions Architect Associate Certification
Designing for Security
Turning up Security on Data Services Part 1
Welcome to this detailed lesson on designing secure data and machine learning services. In this article, we explore how security considerations evolve as services become more abstracted from the underlying infrastructure. Our focus will shift toward IAM permissions, access control, and monitoring with tools like CloudWatch. While some services, like SageMaker, still maintain many of the traditional configuration options found in EC2 or container-based systems, the primary discussion here centers on managing data inputs and outputs securely.
The Power of Data Ingestion with Amazon Kinesis
Amazon Kinesis is a managed streaming service similar to Apache Kafka. It operates as a secure streaming bus where records are encrypted by default. When data is ingested, it is temporarily stored in Kinesis stream storage before being moved to an alternative storage solution. Kinesis supports both server-side encryption to protect data at rest and SSL/TLS to secure data in transit, ensuring robust data protection.
Kinesis offers three service types—data streams, video streams, and Firehose. All leverage CloudTrail and CloudWatch for monitoring while supporting various producers and consumers via SDKs. Note that the Kinesis agent is a standalone software component designed to send data to Kinesis Data Streams and Firehose (it does not support video streams).
For instance, a global e-commerce platform might use a Java-based Kinesis agent to stream real-time clickstream data from web servers. The agent aggregates and forwards this data to either Kinesis Data Streams or Firehose. It is essential to understand that the agent is simply a software forwarder, not a piece of hardware or a standalone cloud service.
Managed Service for Kafka (MSK)
Managed Service for Kafka (MSK) on AWS delivers a similar streaming solution focused on open-source Kafka. Unlike Kinesis with its shard-based model, MSK uses a more complex architecture involving Kafka brokers and ZooKeeper nodes for cluster management. AWS manages much of the heavy lifting—provisioning, configuration, and scaling—allowing you to focus on specifying the appropriate cluster size.
MSK enforces encryption at rest and in transit by default. Data at rest is secured using AWS KMS, while TLS encryption safeguards data as it moves between Kafka brokers and clients.
When TLS is enabled for your MSK cluster, encrypted connections automatically extend to both brokers and ZooKeeper nodes, ensuring secure communication throughout the cluster.
Broker traffic logs can be exported to Amazon S3 for in-depth analysis by services such as AWS Glue, Athena, or Redshift. This capability distinguishes MSK from Kinesis, where detailed broker logging is not available.
Data Extraction, Transformation, and Load with AWS Glue
AWS Glue is a robust service for ETL (Extract, Transform, Load) operations powered by PySpark. Glue comprises multiple components:
- A data crawler that scans various sources (like S3) to populate a data catalog.
- An ETL job engine to execute data transformation tasks.
- Glue Studio, which offers a graphical interface for creating and managing ETL jobs (with security managed by AWS).
Security Best Practice
When securing AWS Glue, it is advisable to use VPC endpoints for secure, private connectivity, restrict access with specific IAM roles and policies, and avoid storing sensitive data in plaintext within scripts.
Glue job bookmark data is encrypted using the default AWS KMS key, ensuring that job progress tracking remains confidential.
Data Transformation with Glue DataBrew
Glue DataBrew allows users to visually transform and prepare data without the need to write code. Its intuitive drag-and-drop interface simplifies data cleaning, normalization, and preparation for analysis. DataBrew runs in a highly managed environment with security enforced by core AWS services, meaning encryption and logging rely on the settings of services such as S3 and VPC endpoints.
DataBrew integrates with both gateway endpoints for S3 and VPC interface endpoints for Glue, enhancing network isolation and security. It also supports encryption for both data in transit and at rest.
When addressing questions about Glue DataBrew’s encryption, the correct explanation is that it leverages AWS KMS for data at rest and adheres to standard logging protocols.
Designing for Security in Data Storage with Lake Formation
Lake Formation provides a robust security and governance layer over AWS data lakes. It integrates closely with IAM (and AWS SSO/Identity Center) to manage access control and permissions. Lake Formation supports fine-grained, cell- and row-level security while leveraging logging tools such as CloudWatch, CloudTrail, and AWS Config to track data access and compliance.
In addition, Lake Formation supports data classification and tag-based access control, which enforce security policies across databases and tables.
Data Presentation and Query with Amazon Athena
Amazon Athena is a serverless SQL engine perfect for ad hoc analysis on data stored in locations such as S3, Redshift, or data cataloged by Glue and Lake Formation. Rather than storing data, Athena accesses it externally, leveraging pre-existing encryption settings (for example, server-side encryption with KMS or S3-managed keys). Athena relies on IAM for access control and integrates with CloudWatch and CloudTrail for monitoring and logging activity.
Athena supports querying data encrypted with AWS-managed keys, whether using S3’s SSE-S3 or AWS KMS. It’s important to remember that once data is loaded into memory for query execution, its security configuration is dictated by the source.
Elastic MapReduce (EMR) and Its Security Considerations
Amazon EMR is a powerful tool for data processing and batch analysis that runs on clusters of EC2 instances. An EMR cluster typically consists of master nodes (for coordination), core nodes (for storage and processing), and task nodes (for additional, ephemeral compute capacity). Although a serverless option is now available, most exam-relevant deployments involve traditional server-based clusters.
For security, EMR clusters are deployed within a VPC, combining security groups with IAM permissions to enforce access control. Although EMR passively handles operating system patching by replacing old instances with new, patched ones, managing your own servers requires using tools like AWS Systems Manager for continuous patch management.
EMR also integrates seamlessly with CloudWatch, CloudTrail, and custom logging solutions to provide comprehensive operational insights.
In terms of data encryption, EMR offers HDFS transparent encryption (managed via KMS) and disk-level encryption for EBS volumes. These features, combined with robust logging via CloudWatch Logs and CloudTrail, ensure end-to-end security and monitoring of your data processing environment.
This comprehensive overview has covered the security designs integrated into various AWS data services. By understanding encryption, access controls, and monitoring mechanisms—from Kinesis to EMR—you are better equipped to architect secure data pipelines and processing frameworks on AWS.
For more information on best practices and service-specific guidance, be sure to explore the official documentation and related resources.
Happy architecting!
Watch Video
Watch video content