DynamoWave Chat - A serverless, real-time chat Application

DynamoWave Chat is a modern and scalable serverless real-time chat application.

It's built on top of AWS Services --> Lambda, DynamoDB & API Gateway

➡️ Our core focus here: Enhancing the application to ensure a high-performance delivery.

System Architecture & Components

How does the workflow look like?

              We'll first establish the websocket connection
                    ⬇️
              It'll trigger the ConnectHandler lambda
                    ⬇️
              This function would insert connectionId into the ConnectionsTable 
                    ⬇️
               The client would be notified when we're done with establishing a connection
                    ⬇️
              SendMessageHandler lambda would iterate through connectionIds + sends messages to the connected clients
                    ⬇️
              Once the session ends --> DisconnectHandler function would remove the connectionId from the registry
                    ⬇️
              The connection closes now

Services we've used plus their purpose

We'll quickly define the purpose of each component in our architecture:-

Please make sure to check out the cloudFormation template above, for the initial configurations plus the deployment

Services we've used	Identifiers	Purpose --> Why we've used?
API Gateway	Web-socket-api	➡️ Real-time communication in our application
DynamoDB	ConnectionsTable	Our connection registry. This is for tracking & managing connections
AWS Lambda	ConnectHandler	--> Every new connection must be recorded --> Helps us ensure we're good from an operational standpoint
	DisconnectHandler	▶️ Removing entries from our table wrt inactive connections
	SendMessageHandler	--> We'll need this for reliable communication across
	DefaultHandler	Helps notify our client when we're through with establishing a connection

Now, that we're through with the functionality, let's now shift our attention to the NFRs

Design considerations:-

How did we improvise on the application's availability/ reliability?

1 --> We've set reserved concurrency for our important Lambdas.
Our critical lambdas should always have access to sufficient compute for operational functionality / Service continuity purposes, We're ensuring we've got a certain quota of concurrency apportioned to the lambda. this'll enable a very fair distribution of compute amongst all the lambdas.

This helps us prevent critical Lambdas from being throttled during peak times. I'd say that we're "allocating" a portion of the total concurrency to the spciifc lambda function. It'll always have the resources it needs to run/ function smoothly 👍

2 --> We need to be cognizant of the data durability aspect as well.

We'll enable PITR for our DynamoDB Table --> Point-In-Time Recovery. Reason? It'll enable us to restore data to any second in the last 35 days. This means we're improvising on the data's availability and fault-tolerance capabilities, Wherein there might've been situations wherein data is accidentally overwritten or is deleted.

In mission-critical scenarios, we shouldn't rely upon traditional backup methods, (that're scheduled at a fixed time of the day). Once, we've enabled PITR - It'll capture changes to the dynamo table, continuously, allowing us to retrieve up-to-date information, --> Benefit? We've negated any potential data loss that might occur. Plus, this doesn't involve operational overhead or costs associated with capacity planning/ over-provisioning. It essentially simplifies the entire process of reverting back to our desired state👍

3 --> Our gateway should be capable of sustaining backpressure scenarios. Our backend services won't be overwhelmed. (Because we've limited the rate of incoming connections through both API Throttling & Rate Limiting.

Why exactly did we implement this?

--> We're predefining a threshold on the maximum number of requests that can hit the gateway per second --> max number of requests per second) --> Plus, a cap on the total number of requests originating from a particular client --> in a specific time window. This has a strategic advantage to it, from an availability standpoint, This'll prevent downstream services from being overwhelmed. ➡️ So, The API would stay responsive to legitimate users, Plus, helps us avert a potential DDoS, that might bring down our entire system.

4 --> Multi-AZ Deployments => Data Redundancy => High Availability

--> we're resilient to zonal failures in a region 👍

This is a very use-case specific pointer. Had we been dealing with production systems, wherein you need super-high availability across AWS Regions,

I'd suggest opting in for DynamoDB global tables. --> For scenarios like:-
i. You've got a geographically distributed user base, and you'd want the data to be positioned near your users
ii. it's a mission critical application, and needs to be available even in the event of a regional outage.
iii. There're some regulatory compliance commitments, due to which data should not leave a region

Something important that I'd like to highlight:- 👉 it necessitates the need of deploying other supporting components in multiple regions as well --> for you to have a full secondary failover mechanism in another region, up and running.

I'd say that this is a pure cost - availability tradeoff. We should be carefully weigh if the expenses actually justify/ weigh against the needs of our application 📍

Code optimisations that'll help enhancing lambda from a reliability standpoint

╰┈➤ We've incorporated some error handling mechanisms within our function logic, This will make sure we're preventing any potential errors/ issues from cascading down, --> Errors can be gracefully handled by our lambda, --> Application's stability ++ 👍

╰┈➤ We felt it would be necessary for our critical lambda - "SendMessageHandler" to recover from transient/ temporary errors, for instance, network communication errors, or DB Operations, This way we'd be increasing the probability of a successful message delivery.

Because, it'll re-attempt the operations multiple times, before it's considered failed.

╰┈➤ Incessant retries would lead us to a scenario wherein the function might enter into an infinite loop, and we'd want to prevent this, Hence, we had to make usre we've got some exponential backoffs in place too, along with the retry mechanisma, This would then iteratively increase the time interval between two subsequent retries.

So, we're not only giving the error more time to resolve, we're also reducing backpressure on our downstream systems. This is what I call a Graceful Error Retry 👍

╰┈➤ An additional enhancement, we'd consider making in the second iteration, as a part of refining my architecture further, would be to set up a DLQ - dead Letter Queue to store failed delivery messages --> this would mean zero data loss, and also provides a potential opportunity to re-process and analyse these messages further.

If I were to improvise on API Gateway's availability further:-

solution:-- Even though API Gateway is a managed service, its inherently resilient to zonal failures, there might be situations wherein we'd like to implement regional redundancy for API Gateway. 📌 This could be done by deploying the gateway in multiple regions, and then utilising Route 53 for a DNS Failover.

I mean configuring a DNS health check to automatically failover to the API Gateway in the secondary region.

(We'd also have to ensure that the supporting components too are up and running in another region!)

More of a cost-redundancy tradeoff here, Will need to weigh in the benefits against the potential costs incurred, and it really justifies against the current needs of the application 👍

Cost-effective Scalability. How?

1 --> I'd come across adaptive auto-scaling for DynamoDB, and I knew I had to utilise this!

Benefit it brings in:- Reduced Costs 👍 Dynamo would automatically adjust the workload based on the fluctauting requirements. This means it's making resource utilisation all the more efficient. 👍

We had to choose between On-demand throughput versus Provisoned Throughput + Auto-scaling. Which one did we opt while designing this?

i. We had to consider the price point here. The "Per-unit cost" was turning out to be expensive for the on-demand mode than the provisioned counterpart + Autoscaling.

Yes, this was the catch here.

It's absolutely wonderful when you've got unpredictable access patterns, and capacity planning looks difficult.

But since it charges you on the actual read/ writes, the cost per unit capacity, turns out to be way higher.

If we'd be in a scenario, where there's consistent traffic coming in, levels are pretty much predictable, I'll advise it'll be way more cost-effective to go with a baseline set up for provisioned read./ write units plus auto-scale based on utilisation thresholds. 💡

--

2 --> We had to eliminate lambda cold starts for improvising on the performance plus scalability of the application

The simple equation, I often mention :-
Prewarming a set of lambda instances = Reduces cold starts = Reduces latency 👍

How could we actually optimise for performance in Lambda (while still taking the costs into consideration)?

We had two options at hand:-

Approach I --> Through setting provisioned concurrency

We'd have a certain number of execution environments - or rather, lambda instances would be running at all times

🚩Potential Red Flag:- We're incurring charges for uptime irrespective of the actual usage.

Scenario where this would work:-

Where absolutely zero cold starts are essential, and we need to minimize latency at all costs. Also, in cases where we've got predictable and consistent traffic patterns.

This is something I'd definitely recommend, when we're dealing with production systems. However, given our usage pattern and subsequent impact on the price point, we'd go in for a custom lambda warmer for now

Our approach --> Implementing a custom Lambda Warmer

Why?

1 ➔ Our application had sporadic usage patterns

2 ➔ I could not compromise on my performance-critical aspects. For me, application execution is equally important.

How did we solve this challenge?

Implementing a custom Lambda Warmer. 💡

--> We've added a new Lambda function specifically designed to warm up our critical functions. ➤ Configured to invoke the critical functions in a manner that "mimics typical user interactions" without altering my application state.

--> Configured a CloudWatch event that triggers the warmer function
That could be either every 5 minutes, or fixed, at a time when peak usage is anticipated.

--> We needed IAM Role and Policy that grants the warmer function permission to invoke other Lambda functions and log to CloudWatch

How exactly is Provisoned Concurrency different from the reserved counterpart?

Point 1 --> When I'm talking about Provisioned Concurrency, it all about eliminating cold starts, reducing the initialisation latency. While reserved Concurrency is about ensuring we've got a certain portion of the Total Concurrency dedicated to this lambda.

Point 2 --> Provisioned concurrency is geared towards enhancing performance, while reserved counterpart is about managing resource limits.

Prevents a lambda function from consuming too many resources. 👍 --> Fair resource utilisation amongst functions

Point 3 --> PC means you're incurring costs of keeping such instances ready at all times, while RC means you've sanctioned limits, no costs per se

From a security standpoint

Point 1 --> We've defined API Gateway resource Policies to enforce requests by denying requests that do not use HTTPS.

We had a couple of options here, first through resource-based policies, (specifying the gateway ARN for the resource) This denies requests upfront if "aws:securetransport" = false.

Second, we could have a custom domain, attach a ssl/ tls certificate from certificate manager, and update the DNS Settings. It does achieve the objective of allowing only HTTPS requests.

But too much of a roundabout, We'd rather go in for the first alternative.

Point 2 --> Data Encryption at rest through KMS Encryption (DynamoDB)

We decided against implementing FGAC or fine-grained access control for DynamoDB, It's main purpose is controlling access to specific attributes, or specific database items.

--> Our schema is pretty simple and straightforward. A single-attribute schema not at all warrants the kind of complexity FGAC brings in. We went ahead with standard IAM policies with table-level access controls

--> But, yes, it's super helpful when we've got multiple attributes or maybe need specific teams/roles to access only certain partitions of the data. Or we're seeking absolutely locked down security at a very granular level - (Something that IAM Policies lack - IAM Policies operate only at the table level --> FGAC on the other hand operates at the db item / attribute level) - Useful when we're working under strict compliance requirements 👍

Point 3 --> We've pruned down IAM policies for the service roles attached to the lambdas, dynamo; strictly to what the component actually needs for its functioning/ access. Lesser risk of Privilege Escalation

Point 4 --> Implementing throttling / rate limiting mitigates a potential DDoS, We've mentioned this above in the availability section too --> this is because we're controlling the number of requests that a user / bot can hit the gateway ✔️

If we'd be looking at amping up the security aspects of our application, Cognito User Pools would be a better bet, It's not just a solution for user creation/ managaement, it lets you authenticate requests, has MFA, password recovery. 👍

What kind of refinements could make my current design even more secure ?

I'd consider implementing a WAF on top of the gateway. A significant enhancement this brings along is that it protect's application availability, --> it prevents excessive consumption of resources downstream + averts potential security compomises 👍

We could utilise AWS managed rules here, they are pre-configured web security rules, they're designed + maintained by AWS plus they're automatically updated from time to time, This means it abstracts out the need of manual maintenance of IP sets, plus simplifies deployment of such ACLs, redcuing teh operational burden of maintaing these sets.
I feel there might be certain situations, wherein we need to explicitly block specific IP addresses, this does necessitate the need of using a more comprehensive solution having both Custom Rules plus Managed Rule groups in a Web ACL,
This would then make up for a perfect security enhancement, wherein both security and operational overhead associated, have been considered 👍

On a side note, I'd recommend logging these metrics to CloudWatch, And incrementing the CW metric , in case, it matches specific regex/ attack patterns, You could also configure a CloudWatch Alarm to trigger off a notification via a SNS topic / mail/ message, in the event of the threshold being breached. --> Real -Time Alerting 📌

How can I achieve a better Performance Optimisation, while maintaining costs?

A consistent high-volume read/write traffic would be a signal for me to explore DynamoDB batch operations. Particularly, for bulk write or delete operations, such as managing multiple connections in the ConnectHandler and DisconnectHandler Lambda functions. For individual, sporadic requests, where the application rarely receives bursts of traffic, the system would need to wait for a certain number of requests to accumulate before processing them together. In a scenario with sporadic requests, this delay might be noticeable. Not recommended for applications with low, sporadic traffic.

Contributions

Contributions are most welcome!

--> If you've got suggestions on how I could further improvise on the architectural / configurational aspects, please feel free to drop a message on [email protected]. I'd love to hear your thoughts on this!

Credit Attribution

I'm grateful to AWS for providing an excellent blog that served as the foundation for this project. This -> [https://docs.aws.amazon.com/apigateway/latest/developerguide/websocket-api-chat-app.html], was instrumental in guiding the implementation of the base architecture.

Name		Name	Last commit message	Last commit date
Latest commit History 84 Commits
CloudFormationTemplateForServerlessChatApp.yaml		CloudFormationTemplateForServerlessChatApp.yaml
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CloudFormationTemplateForServerlessChatApp.yaml

CloudFormationTemplateForServerlessChatApp.yaml

README.md

README.md

Repository files navigation

DynamoWave Chat - A serverless, real-time chat Application

System Architecture & Components

How does the workflow look like?

Services we've used plus their purpose

Design considerations:-

How did we improvise on the application's availability/ reliability?

Code optimisations that'll help enhancing lambda from a reliability standpoint

If I were to improvise on API Gateway's availability further:-

Cost-effective Scalability. How?

We had to choose between On-demand throughput versus Provisoned Throughput + Auto-scaling. Which one did we opt while designing this?

How could we actually optimise for performance in Lambda (while still taking the costs into consideration)?

Approach I --> Through setting provisioned concurrency

Our approach --> Implementing a custom Lambda Warmer

How did we solve this challenge?

How exactly is Provisoned Concurrency different from the reserved counterpart?

From a security standpoint

What kind of refinements could make my current design even more secure ?

How can I achieve a better Performance Optimisation, while maintaining costs?

Contributions

Credit Attribution

About

Releases

Packages

TanishkaMarrott/ServerlessChatApp-WebSocket-API-Lambda-DynamoDB-Integration

Folders and files

Latest commit

History

CloudFormationTemplateForServerlessChatApp.yaml

CloudFormationTemplateForServerlessChatApp.yaml

README.md

README.md

Repository files navigation

DynamoWave Chat - A serverless, real-time chat Application

System Architecture & Components

How does the workflow look like?

Services we've used plus their purpose

Design considerations:-

How did we improvise on the application's availability/ reliability?

Code optimisations that'll help enhancing lambda from a reliability standpoint

If I were to improvise on API Gateway's availability further:-

Cost-effective Scalability. How?

We had to choose between On-demand throughput versus Provisoned Throughput + Auto-scaling. Which one did we opt while designing this?

How could we actually optimise for performance in Lambda (while still taking the costs into consideration)?

Approach I --> Through setting provisioned concurrency

Our approach --> Implementing a custom Lambda Warmer

How did we solve this challenge?

How exactly is Provisoned Concurrency different from the reserved counterpart?

From a security standpoint

What kind of refinements could make my current design even more secure ?

How can I achieve a better Performance Optimisation, while maintaining costs?

Contributions

Credit Attribution

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages