February 28 Incident Post-Mortem
Posted on March 01, 2017
By Jason Roelofs
Around 1pm Eastern on February 28, 2017, our monitoring tools notified us that all of Harmony, both the Administration Portal and Live Sites, was erroring and inaccessible. It took us around 20 minutes to track down the underlying issue and another 10 minutes to put together and deploy partial mitigation. The Administration Portal and Live Sites were partially back up after a 30 minute downtime, with assets and images disabled across all of Harmony. We were able to re-enable assets and images by 4:30pm Eastern, after which Harmony was completely functional once again.
Specifically, the reason Harmony went offline is because Amazon S3, where we store all Images and Files, had an outage such that any request for files would never receive a response. This led to all of our web servers to fill up waiting for these requests, bringing our ability to handle any traffic to zero. Once we realized where the error lay, we were able to deploy a change to Harmony that let us disable requests to S3 entirely, letting the application return to mostly normal operation.
Due to the wide-spread nature of the outage of Amazon S3, we quickly realized that our normal tool for customer communication, Intercom, was also affected by the outage and we found ourselves unable to notify our customers of the outage and the reason for image-less websites. Also, some customers noticed that our status page which was hosted by Pingdom erroneously said that Harmony was fully live instead of in a state of partial availability. We also then realized that we could not add other notices and information to this status board to inform users of our current status.
The outage of Amazon S3 was an access-only outage and there was no threat to any of your site’s content. However, seeing your site disappear with an error page, and then function without images, with no communication from us, is not a good customer experience. We apologize for not having adequate communication channels in place for this outage.
While there is little we can do if Amazon S3 has another similar outage, we should be able to communicate more effectively the current state of Harmony. In that vein we are going to do the following:
- Move our status page to StatusPage.io which will let us add incident information and provide better communication to customers during an outage.
- Review our other communication and informational tools and make changes to those tools which will let us get the word out more quickly.
- Set up further monitoring such that we can catch a full outage of this nature quicker and make sure Harmony and customer live sites stay running.
Again we apologize for the lack of communication during this incident. We understand how frustrating it can be when your sites are not functional and you don’t know why, and we are working to make sure this doesn’t happen again.