Unscheduled Downtime

RFO 23.02.2024

Incident Summary On the 23rd of February 2024 between 03:12:33 UTC and 06:06:33 UTC the CSIS Threat Intelligence Portal and Threat Intelligence APIs were unavailable. This was due to an exhaustion of available database connections. The issue was detected both by our automatic monitoring and CSIS MDR-A staff, who escalated the incident to the development team. The responding development team identified the root cause at 06:01 UTC and canceled the offending database transactions that were blocking other transactions from completing. This resulted in all affected services becoming available again at 06:06:33 UTC. In total the affected services were unavailable for 2 hours and 54 minutes. This brought our Threat Intelligence API availability in February to 99.48%.

Root cause details A long running schema change (DDL) transaction on the database storing compromised credentials concided with a surge in compromised credentials being detected. The DDL transaction blocked the processing of the incoming compromised credentials (DML transactions), which eventually exhausted all available database connections, also for other services.

Timeline:

2024-02-23 03:12:33 UTC: Our monitoring detects the unavailability of TIP APIs.
2024-02-23 03:18:32 UTC: Our status page at status.csis.dk is updated to reflect the outage.
2024-02-23 03:47:33 UTC: The incident is escalated to the CSIS development team.
2024-02-23 06:01:39 UTC: The root cause is identified.
2024-02-23 06:06:33 UTC: All services are restored.

Corrective Action Items We identified and are implementing the following action items during the post-incident analysis to avoid similar incidents and decrease our time-to-recovery in future incidents:

Refactoring of our compromised credential processing pipeline to avoid a similar database connection exhaustion from occurring again. [DONE]
Proactive alerting of the development team in case of unusual database connection patterns. [IN PROGRESS]
Further compartmentalization of services to limit the fallout of similar database connection exhaustion incidents to the offending service only. [IN PROGRESS]

Find Your Subscription

Subscribe to Status Updates