AWS Certified Developer - Associate
Data Analytics
Kinesis Data Streams
This guide explores Amazon Kinesis Data Streams—a powerful streaming data service engineered to capture, process, and store data at scale. Think of it as a continuously flowing pipeline where producers inject data and consumers extract and process it efficiently.
Architecture Overview
At the heart of the system is the Kinesis Data Stream acting as a conduit for data. Applications using the Kinesis Client Library (KCL), IoT devices, or other SDK-enabled producers send records into the stream. Consumers—ranging from AWS Lambda functions and SDK-based applications to ECS tasks—read and process the records. Notably, records can be retained for up to 365 days.
When setting up a Kinesis Data Stream, you can choose one of two capacity modes:
On-Demand Mode:
With no requirement for capacity planning, Kinesis automatically scales to meet demand without needing manual shard management.Provisioned Mode:
Here, you explicitly set the number of shards. The stream’s overall capacity is the sum of its individual shard capacities. You may adjust the shard count based on your changing data volumes.
Data Flow: Producers and Consumers
Data flows into the Kinesis stream in the following sequence:
- Producers send records using the PutRecord API.
- Each record includes a partition key along with up to 1 MB of data. The partition key is hashed to determine which shard receives the record.
- A shard’s throughput is limited to 1 MB per second or 1,000 messages per second.
On the consumer side:
- Consumers retrieve records via the GetRecords API (classic pull-based model) or use the SubscribeToShard API (enhanced push-based model).
- In the shared (classic fan-out) mode, all consumers share a total throughput of 2 MB per shard.
- With enhanced fan-out, every consumer gets an individual throughput of 2 MB per shard, providing improved scalability.
Partition Keys and Shard Distribution
When multiple devices (e.g., cameras) transmit telemetry or video data, each device’s identifier can serve as an effective partition key. For instance, in a data stream with three shards:
- A camera with an ID of 120 might be directed to shard one.
- A camera with an ID of 320 could end up on shard one or two, depending on the hash.
- A camera with an ID of 550 might be routed to shard three.
Selecting a partition key with sufficient entropy is vital. A limited set of keys can lead to "hot partitions"—shards overloaded beyond their throughput limits (1 MB/s or 1,000 messages/s).
Tip
Choose partition keys that evenly distribute data across shards to avoid hot partitions.
If a shard becomes overwhelmed, consider these strategies:
- Use exponential backoff for retries.
- Split an overburdened shard.
- Merge underutilized shards to optimize cost.
Consuming Data: Classic vs. Enhanced Fan-Out
Consumers of Kinesis Data Streams can operate in two distinct modes:
- Classic Fan-Out (Shared Mode):
In this mode, all consumers share the available throughput of 2 MB/s per shard. For example, two consumers on a shard may each get up to 1 MB/s if both are active, and the total bandwidth is divided as more consumers join.
- Enhanced Fan-Out:
Each consumer receives an exclusive throughput of 2 MB/s per shard. This is enabled by the SubscribeToShard API, allowing each consumer to process data at a high rate regardless of how many consumers are attached.
The Kinesis Client Library (KCL)
The Kinesis Client Library (KCL) simplifies the development of consumers by abstracting common tasks such as:
- Automatic recovery from consumer failures.
- Checkpointing to track progress using DynamoDB.
- Supporting resharding events to ensure smooth data consumption.
KCL is flexible and can run on platforms like EC2, on-premises, or Elastic Beanstalk. However, note that only one instance of the KCL can run per shard:
- With 2 shards, you can run a maximum of 2 KCL instances.
- With 7 shards, you are limited to 7 instances.
For example, if a data stream has two shards, two KCL instances can operate concurrently—each reading from a different shard and maintaining checkpoints in DynamoDB. Scaling the number of shards requires a corresponding increase in KCL instances to maintain optimal performance.
Summary
Amazon Kinesis Data Streams enables real-time data ingestion, processing, and analysis, making it an ideal solution for log aggregation, real-time metrics, and IoT telemetry applications. Key takeaways include:
- Producers send records using the PutRecord API, with records containing partition keys and up to 1 MB of data.
- Each shard supports a throughput of 1 MB/s or 1,000 messages/s.
- Proper selection and management of partition keys are vital to avoid hot partitions.
- Hot partitions can be mitigated through exponential backoff, shard splitting, or merging underutilized shards.
- In classic fan-out mode, consumers share a 2 MB/s throughput per shard, whereas enhanced fan-out mode allocates 2 MB/s per consumer per shard.
- The Kinesis Client Library automates consumer management, checkpointing, and resharding, with a one-instance-per-shard limitation.
Understanding these concepts will empower you to design and optimize your Kinesis Data Streams implementations effectively.
Watch Video
Watch video content