Increased error rates and latencies for Cassandra depending APIs

Incident Report for Tink

Postmortem

Business Impact:

Degraded Auth/Statistics/Transaction Service due to high latencies from Cassandra in Rugby environment
Likely affected endpoints:

/api/v1/transactions/...

/connector/users/{id}/transactions

/api/v1/statistics/query

/api/v1/budgets

/api/v1/insights

/api/v1/insights/action

Duration:

The issue started at around 04:10 CET on Jan 5th and ended at 08:25 CET on the same day.

Posted Jan 14, 2022 - 10:53 CET

Resolved

We're back to normal.
The culprit was a spike in deleting/updating some Cassandra depending services. The issue started at around 4 AM and caused a certain node to go down. A restoration resolved the situation.

Also the Transaction Service was affected by the issue from 04:10 to 08:25 CET, with a partial error rate on the affected APIs, with highest degradation (up to 10% failure rate) from 07:50 to 08:15 CET.

Posted Jan 05, 2022 - 08:31 CET

Investigating

Mostly affected is the production environment for the Royal Bank of Scotland.
We've already discovered the culprit for this issue and took action to fix it.

Posted Jan 05, 2022 - 08:24 CET