Database connection issues across services


Reason for Outage Report

Incident summary

On April 8th, 2024, we experienced a service outage due to a mismatch of our Root Certificate Authority (CA) and the deployed certificates on internal hosts. This was due to a sequence error in the certificate invalidation and rollout process.

The disruption began at 13:26 UTC and most customer facing services were back online at 14:29 UTC. The total service downtime was approximately 1 hour and 3 minutes. Our Anti-Phishing service experienced a longer downtime, as it was not successfully restored by the employed mitigation.

The rollout of the root cause fix caused another 18 minutes of downtime between 18:18 UTC and 18:36 UTC.

Root cause details

The affected certificates are used to ensure the confidentiality, integrity, and authenticity of cluster-internal connections. These are for example connections between workload- and database servers. As we require strict TLS verification also for internal connections, all connection attempts failed.

The incident was immediately detected by our monitoring, allowing the development team to start working on temporary mitigations and a full resolution of the root cause in parallel.

This incident exclusively affected the availability of our services and did not affect the confidentiality or integrity of the data.


13:26 UTC: Mismatch between the expected Root Certificate Authority and the derived certificates.

13:32 UTC: First status update on and notification of all subscribed users.

13:43 UTC: The root cause is identified and work on a mitigation and root cause fix has started.

14:30 UTC: The mitigation is in effect for all services but the Anti-Phishing service. All other services are back online.

18:18 UTC: We are announcing another downtime of up to 20 minutes while rolling out the root cause fix.

18:36 UTC: All services are back online, including Anti-Phishing.

18:58 UTC: The incident is marked as resolved.

Corrective Action-Items

We identified and are implementing the following action items during the post-incident analysis to avoid similar incidents and improve the time to recovery in future incidents:

  • Update internal code to improve the speed and reliability of the certificate rollout process and avoid human errors. [DONE]

  • Update internal documentation to reflect the improved process. [DONE]

  • Add more proactive monitoring to alert about certificate mismatches earlier in the process. [DONE]

  • Improve internal development tooling to decrease the time to recovery in similar incidents. [IN PROGRESS]

Avatar for Christo Karafermanof
Christo Karafermanof

We've now resolved the incident. Thanks for your patience. We will publish a Reason for Outage report as soon as we have analyzed the incident in-depth.

Avatar for Christo Karafermanof
Christo Karafermanof

All customer-facing services are back online, but we continue to monitor the situation before marking the incident as resolved.

Avatar for Christo Karafermanof
Christo Karafermanof

While we are rolling out the fix of the root cause, services will be unavailable for up to 20 minutes. We will inform you when the rollout is complete and all services are restored.

Avatar for Christo Karafermanof
Christo Karafermanof

The mitigation is applied and access to all services is restored. We are still working on fixing the root issue.

Avatar for Christo Karafermanof
Christo Karafermanof

We've confirmed the root cause and are working on a intermediate mitigation and root cause fix in parallel.

Avatar for Christo Karafermanof
Christo Karafermanof

We are experiencing database connection issues across systems and are investigation the root cause.

Avatar for Christo Karafermanof
Christo Karafermanof
Began at:

Affected components
  • New Threat Intelligence Portal
  • eCrime Threat Intelligence Portal
    • AntiPhishing
    • CIRK
    • CrimeWare
    • Insight Articles
    • Statistics
  • APIs
    • RIRK API
    • Brand Abuse / AntiPhishing API
    • Threat Feeds API