Platform portal / api showing errors

Update

10 August 2021 at 09:52

Update

10 August 2021 at 09:52

We just resolved the issue!

Resolved

10 August 2021 at 09:52

Resolved

10 August 2021 at 09:52

Post-mortem report of EU services degradation at August 10, 2021

On Tuesday August 10, 2021 the GoBright platform experienced service degradation, causing errors shown to users when using the GoBright platform.

This report shows the timeline and future steps.

Timeline:

August 10, 2021 - 10:32 AM CEST:

During routine maintenance and doing regular zero-downtime upgrades, we came aware that the platform was experiencing problems. Our monitoring tooling showed the issues arising and soon after that customers experienced the problems.

August 10, 2021 - 10:50 AM CEST:

Quickly after the issue arose, it was clear that connection issues internally in the platform where generating the problems.

August 10, 2021 - 11:10 AM CEST:

After we investigated the connection issues the caching service which is an essential part of the platform was not accepting new connections.

We started scaling up extra server resources to the caching services.

August 10, 2021 - 11:37 AM CEST:

Scaling of the caching services was finished and the platform returned to it's normal state.

We started monitoring the services to make sure they kept up.

August 10, 2021 - 11:52 AM CEST:

Services kept stable, the issue was fully resolved.

Root causes & lessons learned

The root cause which caused this event was that when doing the routine zero-downtime maintenance, server capacity of the updated servers was scaled up. This led to more connections to the caching services, which were hitting the limits of the connections, whilst having no pressure on other aspects like CPU and Memory.

Because only the connections were hitting the limits the caching services were not identified of having problems, and therefore not scaled.

We now applied the amount of connections as an extra monitoring metric, which is now also evaluated in scaling scenario's.

This will prevent this from happening in the future.

Monitoring

10 August 2021 at 09:37

Monitoring

10 August 2021 at 09:37

We implemented a fix and currently monitoring the result.

Identified

10 August 2021 at 09:10

Identified

10 August 2021 at 09:10

We are currently applying a fix.

Investigating

10 August 2021 at 08:32

Investigating

10 August 2021 at 08:32

The platform is currently experiencing issues, which are shown to users by errors in the portal and apps. We are currently investigating this incident.

GoBright - Platform portal / api showing errors – Incident details

Post-mortem report of EU services degradation at August 10, 2021

Timeline:

Root causes & lessons learned