Database connection issues across services

Reason for Outage Report

Incident summary

On April 8th, 2024, we experienced a service outage due to a mismatch of our Root Certificate Authority (CA) and the deployed certificates on internal hosts. This was due to a sequence error in the certificate invalidation and rollout process.

The disruption began at 13:26 UTC and most customer facing services were back online at 14:29 UTC. The total service downtime was approximately 1 hour and 3 minutes. Our Anti-Phishing service experienced a longer downtime, as it was not successfully restored by the employed mitigation.

The rollout of the root cause fix caused another 18 minutes of downtime between 18:18 UTC and 18:36 UTC.

Root cause details

The affected certificates are used to ensure the confidentiality, integrity, and authenticity of cluster-internal connections. These are for example connections between workload- and database servers. As we require strict TLS verification also for internal connections, all connection attempts failed.

The incident was immediately detected by our monitoring, allowing the development team to start working on temporary mitigations and a full resolution of the root cause in parallel.

This incident exclusively affected the availability of our services and did not affect the confidentiality or integrity of the data.

Timeline

13:26 UTC: Mismatch between the expected Root Certificate Authority and the derived certificates.

13:32 UTC: First status update on status.csis.dk and notification of all subscribed users.

13:43 UTC: The root cause is identified and work on a mitigation and root cause fix has started.

14:30 UTC: The mitigation is in effect for all services but the Anti-Phishing service. All other services are back online.

18:18 UTC: We are announcing another downtime of up to 20 minutes while rolling out the root cause fix.

18:36 UTC: All services are back online, including Anti-Phishing.

18:58 UTC: The incident is marked as resolved.

Corrective Action-Items

We identified and are implementing the following action items during the post-incident analysis to avoid similar incidents and improve the time to recovery in future incidents:

Update internal code to improve the speed and reliability of the certificate rollout process and avoid human errors. [DONE]
Update internal documentation to reflect the improved process. [DONE]
Add more proactive monitoring to alert about certificate mismatches earlier in the process. [DONE]
Improve internal development tooling to decrease the time to recovery in similar incidents. [IN PROGRESS]

Database connection issues across services

Reason for Outage Report

Find Your Subscription

Subscribe to Status Updates