Every year Amazon Prime members wait for the day when they can be even more so glued to their computers. This years Prime day was actually a day and a half long due to pass success of the gazillion dollars amazon makes on the day. Users are anxious to get their hands on the lightning deals where only a product is available for only so long and there is only a limited number of items available. This past Prime day was a bit different this time around. It had a bit of a hiccup. Fifteen minutes into the start of Prime Day Amazon’s website crashed. For hours the website was down for hours during one of their biggest sales days. So what happened?
At least a few WHYS into the root cause analysis shows that Amazon wasn’t able to handle the traffic surge and did not to secure enough servers to meet the demand. Their auto-scale feature also failed. This led to a cascading event of other systems failing. This combination along with frustrated users upset with the user experience they were getting, continuing to try to bride the gulf of execution by refreshing and redoing process caused the website to be down for hours. They also had to cut off all international customers from the site and had to strip down their Prime Day web design to the basics to get the website back up and running. So why wasn’t enough server spaced secured? Especially at a company that specializing in server space (AWS).
This was a failure at the base of the requirements pyramid to ensure project success in system requirements which lead to business requirements and user requirements to also fail as the foundation wasn’t sound. Server space was a critical constraint in this case.
What Elicitation techniques should have been used and should be used in the future before the next prime day to ensure success? I believe one would be a requirements workshop and prototyping with diagrams to see how many actions have to hit the server per user.
What else could be done to ensure this doesn’t happen again?