GoBright - Platform portal / api showing errors – Incident details

Platform portal / api showing errors

Resolved
Partial outage
Started over 3 years agoLasted about 1 hour

Affected

Portal

Partial outage from 8:32 AM to 9:52 AM

T1B - Meet-Work-Visit (Europe)

Partial outage from 8:32 AM to 9:52 AM

T2B - Meet-Work-Visit (Europe)

Partial outage from 8:32 AM to 9:52 AM

API

Partial outage from 8:32 AM to 9:52 AM

T1B - Meet-Work-Visit (Europe)

Partial outage from 8:32 AM to 9:52 AM

T2B - Meet-Work-Visit (Europe)

Partial outage from 8:32 AM to 9:52 AM

Updates
  • Update
    Update

    We just resolved the issue!

  • Resolved
    Resolved

    Post-mortem report of EU services degradation at August 10, 2021

    On Tuesday August 10, 2021 the GoBright platform experienced service degradation, causing errors shown to users when using the GoBright platform.

    This report shows the timeline and future steps.

    Timeline:

    August 10, 2021 - 10:32 AM CEST:

    During routine maintenance and doing regular zero-downtime upgrades, we came aware that the platform was experiencing problems. Our monitoring tooling showed the issues arising and soon after that customers experienced the problems.

    August 10, 2021 - 10:50 AM CEST:

    Quickly after the issue arose, it was clear that connection issues internally in the platform where generating the problems.

    August 10, 2021 - 11:10 AM CEST:

    After we investigated the connection issues the caching service which is an essential part of the platform was not accepting new connections.

    We started scaling up extra server resources to the caching services.

    August 10, 2021 - 11:37 AM CEST:

    Scaling of the caching services was finished and the platform returned to it's normal state.

    We started monitoring the services to make sure they kept up.

    August 10, 2021 - 11:52 AM CEST:

    Services kept stable, the issue was fully resolved.

    Root causes & lessons learned

    The root cause which caused this event was that when doing the routine zero-downtime maintenance, server capacity of the updated servers was scaled up. This led to more connections to the caching services, which were hitting the limits of the connections, whilst having no pressure on other aspects like CPU and Memory.

    Because only the connections were hitting the limits the caching services were not identified of having problems, and therefore not scaled.

    We now applied the amount of connections as an extra monitoring metric, which is now also evaluated in scaling scenario's.

    This will prevent this from happening in the future.

  • Monitoring
    Monitoring

    We implemented a fix and currently monitoring the result.

  • Identified
    Identified

    We are currently applying a fix.

  • Investigating
    Investigating

    The platform is currently experiencing issues, which are shown to users by errors in the portal and apps. We are currently investigating this incident.