Post-mortem report of EU services degradation at August 10, 2021
On Tuesday August 10, 2021 the GoBright platform experienced service degradation, causing errors shown to users when using the GoBright platform.
This report shows the timeline and future steps.
Timeline:
August 10, 2021 - 10:32 AM CEST:
During routine maintenance and doing regular zero-downtime upgrades, we came aware that the platform was experiencing problems. Our monitoring tooling showed the issues arising and soon after that customers experienced the problems.
August 10, 2021 - 10:50 AM CEST:
Quickly after the issue arose, it was clear that connection issues internally in the platform where generating the problems.
August 10, 2021 - 11:10 AM CEST:
After we investigated the connection issues the caching service which is an essential part of the platform was not accepting new connections.
We started scaling up extra server resources to the caching services.
August 10, 2021 - 11:37 AM CEST:
Scaling of the caching services was finished and the platform returned to it's normal state.
We started monitoring the services to make sure they kept up.
August 10, 2021 - 11:52 AM CEST:
Services kept stable, the issue was fully resolved.
Root causes & lessons learned
The root cause which caused this event was that when doing the routine zero-downtime maintenance, server capacity of the updated servers was scaled up.
This led to more connections to the caching services, which were hitting the limits of the connections, whilst having no pressure on other aspects like CPU and Memory.
Because only the connections were hitting the limits the caching services were not identified of having problems, and therefore not scaled.
We now applied the amount of connections as an extra monitoring metric, which is now also evaluated in scaling scenario's.
This will prevent this from happening in the future.