Incident summary
On April 8th, 2024, we experienced a service outage due to a mismatch of our Root Certificate Authority (CA) and the deployed certificates on internal hosts. This was due to a sequence error in the certificate invalidation and rollout process.
The disruption began at 13:26 UTC and most customer facing services were back online at 14:29 UTC. The total service downtime was approximately 1 hour and 3 minutes. Our Anti-Phishing service experienced a longer downtime, as it was not successfully restored by the employed mitigation.
The rollout of the root cause fix caused another 18 minutes of downtime between 18:18 UTC and 18:36 UTC.
Root cause details
The affected certificates are used to ensure the confidentiality, integrity, and authenticity of cluster-internal connections. These are for example connections between workload- and database servers. As we require strict TLS verification also for internal connections, all connection attempts failed.
The incident was immediately detected by our monitoring, allowing the development team to start working on temporary mitigations and a full resolution of the root cause in parallel.
This incident exclusively affected the availability of our services and did not affect the confidentiality or integrity of the data.
Timeline
13:26 UTC: Mismatch between the expected Root Certificate Authority and the derived certificates.
13:32 UTC: First status update on status.csis.dk and notification of all subscribed users.
13:43 UTC: The root cause is identified and work on a mitigation and root cause fix has started.
14:30 UTC: The mitigation is in effect for all services but the Anti-Phishing service. All other services are back online.
18:18 UTC: We are announcing another downtime of up to 20 minutes while rolling out the root cause fix.
18:36 UTC: All services are back online, including Anti-Phishing.
18:58 UTC: The incident is marked as resolved.
Corrective Action-Items
We identified and are implementing the following action items during the post-incident analysis to avoid similar incidents and improve the time to recovery in future incidents:
Update internal code to improve the speed and reliability of the certificate rollout process and avoid human errors. [DONE]
Update internal documentation to reflect the improved process. [DONE]
Add more proactive monitoring to alert about certificate mismatches earlier in the process. [DONE]
Improve internal development tooling to decrease the time to recovery in similar incidents. [IN PROGRESS]
We've now resolved the incident. Thanks for your patience. We will publish a Reason for Outage report as soon as we have analyzed the incident in-depth.
All customer-facing services are back online, but we continue to monitor the situation before marking the incident as resolved.
While we are rolling out the fix of the root cause, services will be unavailable for up to 20 minutes. We will inform you when the rollout is complete and all services are restored.
The mitigation is applied and access to all services is restored. We are still working on fixing the root issue.
We've confirmed the root cause and are working on a intermediate mitigation and root cause fix in parallel.
We are experiencing database connection issues across systems and are investigation the root cause.
We’ll find your subscription and send you a link to login to manage your preferences.
We’ve found your existing subscription and have emailed you a secure link to manage your preferences.
We’ll use your email to save your preferences so you can update them later.
Subscribe to other services using the bell icon on the subscribe button on the status page.
You’ll no long receive any status updates from CSIS Security Group, are you sure?
{{ error }}
We’ll no longer send you any status updates about CSIS Security Group.