Host Your Generative AI App on AWS Like a Pro

Hosting a generative AI application on AWS offers many opportunities but also presents unique challenges. The primary goal is to leverage serverless event-driven technologies to save costs, but there are technical limitations to consider, such as API Gateway’s integration timeout. This guide will explore various strategies to achieve a serverless architecture that supports real-time updates without being constrained by timeout limits.

Comparison of recommended approaches

Host Your Generative AI App on AWS Like a Pro

Hosting a generative AI application on AWS offers many opportunities but also presents unique challenges. The primary goal is to leverage serverless event-driven technologies to save costs, but there are technical limitations to consider, such as API Gateway’s integration timeout. This guide will explore various strategies to achieve a serverless architecture that supports real-time updates without being constrained by timeout limits.

Challenges of Hosting Generative AI on AWS

  1. API Gateway Timeout: The 29-second integration timeout of API Gateway can be problematic for chat applications where responses may take longer. Although AWS has announced a possibility to increase this limit as of June 4, 2024, a challenge of sending results from the server to UI incrementally (by words, for instance) still remains.

  2. Complexity of Real-Time Updates: We could use either HTTP Server Sent Events or HTTP Streaming for delivering response from server to UI incrementally but API Gateway doesn’t support neither of these two.

  3. Containerized Solutions: Another option would be using an Application Load Balancer (ALB) with containers hosted in ECS or EKS. This, while being serverless (if AWS Fargate is used), however, deviates from the event-driven architecture paradigm, requiring always-on container instances with corresponding cost implications.

Below we will list the solutions that maintain a serverless architecture while enabling real-time updates and overcoming API Gateway integration timeout limitations. Below we will list the solutions that maintain a serverless architecture while enabling real-time updates and overcoming API Gateway integration timeout limitations. Additionally, we will consider a few problematic solutions that are not recommended along with architectures that won’t work for generative AI applications.

Serverless Solutions

Lambda Streaming with Function URLs

Lambda response streaming allows for sending real-time updates directly from Lambda functions. This method allows to integrate seamlessly with AWS Cognito for authentication (from inside Lambda though, via writing a simple integration code based on AWS SDK), ensuring secure access. Among all serverless solutions this one would arguably be the simplest and most straightforward. The downside of this approach is having to implement auth checking by yourself inside Lambda.

References:

API Gateway + Lambda via WebSockets

WebSocket support in API Gateway provides a more interactive user experience, suitable for applications requiring real-time updates. However, it does not support broadcasting to multiple clients. WebSockets is not the simple technology to use either. The advantage of this solution is its flexibility and native integration of AWS Gateway which can accept JWT tokens from AWS Cognito. You would normally use this solution if you need maximum flexibility and ready to invest time and effort into deploying a more complex infrastructure. To simplify this task AWS has even creation Generative AI Application Builder on AWS implementation guide that contains a ready to use AWS CDK template which you can tailor-fit for your specific use-case. Being able to start with the existing template provided and maintain by AWS might be a significant time saver, especially if you are working on PoC.

References:

AppSync Subscriptions

AWS AppSync is AWS’s managed GraphQL API service that supports real-time updates sent from the server back to users via GraphQL subscriptions (which works via WebSockets under the hood and support broadcasting). This approach is a versatile choice for generative AI applications if GraphQL is a requirement. It also integrates well with AWS Amplify Gen1 and AWS Amplify Gen2 for simplified deployment which can give you a huge productivity boost, especially if your project uses JavaScript or TypeScript. This doesn’t prevent you from deploying AWS AppSync directly via CDK or CloudFormation without AWS Amplify.

References:

AWS IoT Core

Leveraging MQTT on top of serverless WebSockets, AWS IoT Core supports broadcasting and secure communication via Cognito, albeit with some limitations in direct WebSocket usage. This solution assumes that you do not use WebSockets directly and rely purely on MQTT for bi-direction communication between your server and clients. This limitation means that you need an extra dependency - either the AWS SDK or MQTT.js on your client side which can be suboptimal in certain cases. AWS IoT Core integrates natively with AWS Cognito and can be easily used not just for IoT devices but also for building a full-features Chat Applications as this AWS IoT Chat Application example shows.

References:

Non-Serverless Solutions

Containerized AWS Services with ALB

Using ECS, EKS, or AppRunner with ALB allows for HTTP/2 SSE, HTTP/2 streaming, or WebSockets. This approach works around the limitations of API Gateway and can even be based on a serverless technology such as AWS Farget (when used with ECS or EKS for instance). However, this approach being an example of non event-driven architecture can potentially lead to higher costs as it requires containers that run 24/7 regardless of wether we have incoming traffic or not. Moreover, we would have to make sure that user sessions are always served by the same pods via using sticky sessions. The container-based solution has one important advantage though - its cost is under control as you pay as much for your containers regardless of the traffic (unless you use some-sort of auto-scaling, of course). This is not the case with Lambdas and in case of DDoS attacks there is a risk of huge AWS bill due to a sudden spike of Lambda invocations if you are not careful.

Problematic Solutions

Just increasing API Gateway Integration Timeout without using WebSockets

While increasing integration timeout allows for longer-running requests, it does not support partial responses, forcing the entire response to be sent at once.

EC2+ALB, AppRunner, ECS, EKS + Short or Long Polling

These solutions involve significant overhead due to multiple requests and the need for sticky sessions in multi-container setups.

API Gateway + Lambda + SQS

Combining these technologies can lead to complex architectures with scalability issues, increased latency, higher costs, and integration challenges.

Solutions That Won’t Work

Short and Long-Polling with Lambda

Lambda’s stateless nature makes it unsuitable for maintaining state between invocations, which is necessary for effective polling.

HTTP/2 SSE with API Gateway or Lambda Function URLs

Neither API Gateway nor Lambda function URLs support SSE, making them unsuitable for streaming updates. AWS recommends Lambda streaming instead (see above).

To help you decide on the best approach for hosting your generative AI application on AWS, the following table provides a comparison of the recommended serverless and non-serverless solutions: ![[Host Your Generative AI App on AWS Like a Pro.png]]

Conclusion

Working around the complexities of hosting a generative AI app on AWS requires careful consideration of the available serverless and non-serverless solutions. By leveraging technologies like Lambda response streaming, API Gateway WebSockets, AppSync, and AWS IoT Core, you can maintain a cost-effective, scalable, and responsive application. While non-serverless solutions provide alternative approaches, they come with trade-offs in terms of potentially higher costs, maintenance and complexity. The key is to choose the right architecture that balances your needs for real-time updates, cost-efficiency, and scalability.