- Managed alternative to Apache Kafka
- Use-cases
- "Real-time" big data e.g. application logs, metrics, IoT, clickstreams.
- Streaming processing frameworks (Spark, NIFI, etc...)
- High availability through data replication to 3 AZ.
- Has 4 different products:
- Kinesis Streams (data stream): Ingest & process streaming data
- Kinesis Firehouse (delivery stream): load streams into S3, Redshift, ElasticSearch
- Kinesis Analytics: perform real-time analytics on streams using SQL
- Kinesis video streams: Process & analyze streaming media in real-time.
- 💡 Common integration: Feed Kinesis Streams -> Analyze data in real time using Kinesis Analytics -> Load streams into Kinesis Firehose -> S3/Redshift for long time retention.
- Security
- Control access / authorization using IAM policies
- Encryption in flight using HTTPS endpoints
- Encryption at rest using KMS
- Possibility to encrypt / decrypt data client side (harder)
- VPC Endpoints available for Kinesis to access within VPC
Attribute | Data Streams | Firehouse | Analytics | Video streams |
---|---|---|---|---|
Description | Blob (max 1 MB) ingestion | Convert & load blob | Query engine | Video ingestion |
Throughput/Limits | Per shard and message size/amount per read/write, max 1 MB blob | None: source data (max 1MB blob) | None: source stream | stream level (5TPS/H), connection-level (1/s), bandwidth MB/s, fragment-level |
Latency | Real time | Near real time (60 secs latency) | Real time | Rendering + encryption (if stored) |
Data retention | 1 day (default), up to 7 days | Retries during 24-hours then fails | Can create streams based on queries | Min 1 hour, max 10 years (in S3) |
Scaling | Add shards | Auto-scales | Auto-scales | Auto-scales |
Concepts | Shards, Producers, Consumers | Delivery stream, Record, Destination | Input, Code, Application | Video stream, Fragment, Producer, Consumer, Chunk |
Use-cases | Streams | Convert & load streaming data into e.g. Redshift, S3, ElasticSearch, Splunk | Real-time analytics using SQL | Machine learning & analytics |
- Low latency streaming ingest at scale like Kafka.
- Streams are divided in ordered Shards / Partitions
- Data goes to any shard & consumers consume from any shard
- Producers -> Stream (Shard 1 / Shard 2 / Shard 3) -> Consumers
- Pub/sub through streams (topic)
- Multiple applications can consume the same stream
- You get sequence number (checkpoint offset, should be saved) back for each data added
- To scale: Add shards -> Increased throughput
- Pricing is per shard provisioned, can have as many shard as possible
- Data retention is 1 day by default, can go up to 7 days
- Data
- The message is base64-encoded blob
- ❗ Data (before encoding) cannot exceed 1 MB
- Ability reprocess / replay data as data is not removed after message is handled.
- Immutable: Once data is inserted in Kinesis, it cannot be deleted
- Batching available or per message calls
- 💡 Use batching with
PutRecords
API to reduce costs and increase throughput
- 💡 Use batching with
- Security: You can enable server-side encryption with an AWS KMS master key
- Monitoring
- You can capture shard level metrics with CloudWatch at additional cost
- For e.g.
IncomingBytes
,IncomingRecords
,OutgoingBytes
,WriteProvisionedThroughputExceeded
,ReadProvisionedThroughputExceeded
.
- For e.g.
- You can capture shard level metrics with CloudWatch at additional cost
- Shards
- One stream is made of many different shards
- The number of shards can evolve over time (re-shard / merge)
- Records are ordered per shard
- Partition key
- Partition key gets hashed to determine the shard ID
- Ensures ordering in a shard and same key always goes to same shard.
- 💡 Choose a partition key that's highly distributed
- 📝 Helps preventing hot partition or hot shard
- 💡 Choose a partition key that's highly distributed
- Throughput
- ❗ 1 MB/s or 1000 messages/s at write PER SHARD
- ❗ 2 MB/s at read PER SHARD
ProvisionedThroughputExceeded
if you go over the limits- 💡 Solution
- Use retries with exponential back-off
- Increase shards (scaling)
- Ensure your partition key is a good one e.g. you don't have a hot shard
- SDKs
- Normal consumer (CLI, SDK, etc...)
- Kinesis Client Library (in Java, Node, Python, Ruby, .NET)
- Uses DynamoDB to checkpoint offsets
- KCL uses DynamoDB to track other workers and share the work amongst shards
- Fully managed service, no administration, automatic scaling
- Near real time (60 seconds latency)
- Load data into Redshift / Amazon S3 / ElasticSearch / Splunk
- Support many data format (pay for conversion)
- Pay for the amount of data going through Firehose
- Concepts
- Delivery stream: Create & send data into it
- Record: Data of interest your data producer sends to a delivery stream
- ❗ Max size before base-64 encoding is 1000KB
- Destination: data storage, S3, Amazon Redshift, Amazon Elasticsearch Service, Splunk.
- Perform real-time analytics on Kinesis Streams using SQL
- Fully managed with auto scaling to match source stream throughput.
- Pay for actual consumption rate
- Can create streams out of the real-time queries
- Ingest video streams for e.g. analytics and machine learning.
- HTTP Live Streaming (HLS) enables you to playback video for live and on-demand viewing
- Uses Amazon S3 as the underlying data store.
- Data at rest using KMS, IAM roles, and data in transit using TLS.
- Concepts
- Video stream: AWS resource that encrypts & time-stamps & stores.
- Fragment: self-contained sequence of frames
- Source: e.g. video-generating device, such as a security camera, a body-worn camera
- Consumer: consume and process data in Kinesis video streams e.g. EC2 / AWS AI services / 3rd party.
- Chunk: Stored data, consists of the actual media fragment