Increased error rates and latencies for Cassandra depending APIs
Incident Report for Tink
Postmortem

Business Impact:

  • Degraded Auth/Statistics/Transaction Service due to high latencies from Cassandra in Rugby environment
  • Likely affected endpoints:

/api/v1/transactions/...

/connector/users/{id}/transactions

/api/v1/statistics/query

/api/v1/budgets

/api/v1/insights

/api/v1/insights/action

What did we learn?

  • Due to its way of working, it is hard to be proactive in such situations.

Duration:

  • The issue started at around 04:10 CET on Jan 5th and ended at 08:25 CET on the same day.
Posted Jan 14, 2022 - 10:53 CET

Resolved
We're back to normal.
The culprit was a spike in deleting/updating some Cassandra depending services. The issue started at around 4 AM and caused a certain node to go down. A restoration resolved the situation.

Also the Transaction Service was affected by the issue from 04:10 to 08:25 CET, with a partial error rate on the affected APIs, with highest degradation (up to 10% failure rate) from 07:50 to 08:15 CET.
Posted Jan 05, 2022 - 08:31 CET
Investigating
Mostly affected is the production environment for the Royal Bank of Scotland.
We've already discovered the culprit for this issue and took action to fix it.
Posted Jan 05, 2022 - 08:24 CET
This incident affected: Aggregation services.