Blog

AWS Flawed: Silent But Deadly; why AWS Errors Stink

September 29, 2022

“VcpuLimitExceeded: You have requested more vCPU capacity than your current vCPU limit…”

The sky is green. Kubernetes is easy. Node.js is web-scale. Isn’t it nice when you know something is wrong, and the correct answer is glaringly obvious? I wish it were always that easy to answer the questions we have, but that’s not the case.

We recently learned this at CtrlStack while investigating an AWS vCPU Limit Exceeded error. We were sure that we hadn’t requested more than our stated limit, but all signs pointed to that being the problem. Here’s how we dug deeper and unmasked an issue with AWS silent retries.

The Simple Answer

As far as AWS-provided error messages go, this is one of the more simple ones to troubleshoot & resolve 99% of the time. Every EC2 Instance Type & Size has a number of vCPUs associated with it. When a request is made to launch or start an instance, AWS (as part of its verification process) will ensure that requested number of vCPUs does not exceed the account limit in the requested region. If the requested number of vCPUs does exceed the account limit, then you’ll see this error. AWS provides an easy way for users to view their existing limits and request an increase.

In our case, we requested an increase to our limit which helped resolve the issue. However, we were left with a question we couldn’t immediately answer: ”Why are we seeing this error when we aren’t requesting more vCPUs than our account limit?”

Piecing Together The Clues

Our team had recently started running tests on an Accelerated Computing Instance offered by AWS. At the time, the vCPU account limit for the associated region and instance type was only set to 4 vCPUs. This worked fine for us, given that the instance type we had used had 4 vCPUs, and we’d only ran 1 instance at a time. Our process for starting up that instance began at 8a Eastern every weekday. For a few weeks we observed no issues. Then one day we received the VcpuLimitExceeeded error.

Naturally, our first guess was that a team member had accidentally left their test instance running from a previous day. However, we were able to verify (using our own solution) that our single instance was in a stopped state.

Therefore we shouldn’t have received that vCPU limit error. We needed to dig further, so we inspected the raw response from the EC2 API. That’s when we discovered the silent retry leading to our vCPU Limit Exceeded red herring.

AWS Silent Retries: The Unmasking

The raw response we inspected showed our attempt to start the EC2 instance, but to our surprise we saw a completely different error message. Instead of the vCPU limit error, we saw an error for InsufficientInstanceCapacity:

Action=StartInstances&InstanceId.1=i-0db18f4d2b19185ea&Version=2016-11-15
-----------------------------------------------------
2022/05/27 14:00:33 DEBUG: Response ec2/StartInstances Details:
---[ RESPONSE ]--------------------------------------
HTTP/1.1 500 Internal Server Error
Connection: close
Transfer-Encoding: chunked
Cache-Control: no-cache, no-store
Content-Type: text/xml;charset=UTF-8
Date: Fri, 27 May 2022 14:00:32 GMT
Server: AmazonEC2
Strict-Transport-Security: max-age=31536000; includeSubDomains
Vary: accept-encoding
X-Amzn-Requestid: 26b03fad-ae5d-45ac-a256-e23ed5d4c105-----------------------------------------------------
2022/05/27 14:00:33 <?xml version="1.0" encoding="UTF-8"?>
<Response><Errors><Error><Code>InsufficientInstanceCapacity</Code><Message>Insufficient capacity.</Message></Error></Errors><RequestID>26b03fad-ae5d-45ac-a256-e23ed5d4c105</RequestID></Response>

This didn’t match the error message we originally saw, so we continued to inspect the raw response. We identified that AWS immediately retried the start request due to the initial error. The retry attempt also generated an error, but this time it was the vCPU limit error:

Action=StartInstances&InstanceId.1=i-0db18f4d2b19185ea&Version=2016-11-15
-----------------------------------------------------
2022/05/27 14:00:34 DEBUG: Response ec2/StartInstances Details:
---[ RESPONSE ]--------------------------------------
HTTP/1.1 400 Bad Request
Connection: close
Transfer-Encoding: chunked
Cache-Control: no-cache, no-store
Content-Type: text/xml;charset=UTF-8
Date: Fri, 27 May 2022 14:00:33 GMT
Server: AmazonEC2
Strict-Transport-Security: max-age=31536000; includeSubDomains
Vary: accept-encoding
X-Amzn-Requestid: 205551f0-98db-48e5-a5da-f8f2ab6405d9-----------------------------------------------------
2022/05/27 14:00:34 <?xml version="1.0" encoding="UTF-8"?>
<Response><Errors><Error><Code>VcpuLimitExceeded</Code><Message>You have requested more vCPU capacity than your current vCPU limit of 4 allows for the instance bucket that the specified instance type belongs to. Please visithttp://aws.amazon.com/contact-us/ec2-request to request an adjustment to this limit.</Message></Error></Errors><RequestID>205551f0-98db-48e5-a5da-f8f2ab6405d9</RequestID></Response>

It turns out that AWS will silently retry the EC2 start|launch request when the 1st attempt results in an InsufficientInstanceCapacity error. In the majority of cases, AWS silent retries would be appreciated by the end user. However, AWS doesn’t immediately release the vCPU capacity reservation before attempting the retry! Our established limit only allowed for 4 vCPUs, which was the equivalent of 1 test instance at a time. Therefore — any subsequent retry that didn’t release vCPU capacity reservation first would fail 100% of the time, due to exceeding our vCPU limit.

We submitted a request to increase our limit, but that wouldn’t address the original issue for insufficient capacity. We ultimately decided that attempting to launch or start a new accelerated computing instance at 8a Eastern each weekday was more likely to hit this issue, due to various capacity demands from other AWS customers as well. In addition to increasing our limit, we also scheduled our setup process to begin 7 minutes after the hour. We haven’t seen the issue since.

What Else Is Hidden From You?

To be clear, I don’t believe for a single moment that AWS intentionally does this to mislead their end users. In fact, most of the time the vCPU Limit Exceeded error you receive is going to be accurate, and can be easily resolved by requesting a limit increase from AWS.

However, it’s important to not assume an error message is accurate when your initial investigation puts it into question. So dig in and see if you can discover your own unknown unknowns!

About Author
Jason Goocher
Founding CS Engineer