Understanding and addressing "Too many connections" 500/503 responses from AWS Bedrock
AWS Bedrock generative AI applications that do not use provisioned throughput behind the scenes may occasionally see the following HTTP status code 500 exceptions logged in observability tools:
500 Too many connections, please wait before trying again. (Service: BedrockRuntime, Status Code: 503, Request ID: <the-id-here>
You may also see invocation server errors logged in AWS CloudWatch for any Bedrock dashboards.
(Yes, those are two different HTTP status codes in one response)
I encountered this error response in a production generative AI that I support. And first, I assumed based on the "too many connections" statement that I had exceeded some unknown quota, or some service threshold that I was unaware of. It "felt" similar somehow to the 429 throttling exception.
I found that although the exception is a 500, HTTP status codes 500 and 503 are distinct entries listed in Bedrock docs:
- 500 - The request processing has failed due to server error
- 503 - The service is temporarily unable to handle the request
Neither listed cause actually matches the error message logged for either 500 or 503... and the suggested solutions for both situations is to implement exponential backoff.
Aside from the obvious impact this has on latency sensitive applications, here are a few other observations that may be helpful to know:
- cross-region inference does not help availability here. If the secondary region is experiencing high utilization, traffic can still be deprioritized, leading to the same error
- this error can also be returned if the generative AI application does not generate sufficient load
- this error still incurs input and output token costs. I am highlighting this last point because it is a behavior I do not understand and have asked our AWS support rep about. If CloudWatch recorded an invocation server error event, how can it be that tokens are returned? My suspicion is that streaming was attempted somehow anyway; I will update this post once I have more information
- Another way to mitigate this error is to make a provisioned throughput commitment (which is probably hard to justify if this error is thrown because of low utilization)
So when considering how to approach this error, it is crucial to understand some of the nuance behind it, in order to consider the right solution continuing to balance latency and cost requirements.