Account service causing high latency
Incident Report for Tink
Postmortem

Business Impact:

Accounts, transactions and payments endpoints were degraded in performance. But not totally down

Time Line:

Part 1:

  • 00:39 Transactions Error Alert.
  • 01:24: Access Error Alert.
    Investigation
  • 02:08: Incident was declared.
  • 02:20: Rolling pods
  • 02:28: Account-service has been running out of memory and cpu, in correlation with the escalating latencies.
  • 02:32: All pods have been rolled out and latencies return back to normal
  • 02:46: Incident was resolved.

Part 2:

  • 07:00 Alerts for Kirkby Production.
  • 07:19: Incident was declared.
  • 07:38: statusmessage sent
  • 08:32 Scaling up Kirkby and Oxford.
  • 08:49 Account services seemed to be to normal.
  • 08:53 transactions service was back to normal.
  • 08:59 The incident resolved

What did we learn?

  • We need an OnCall alignment training 
  • We also need to evaluate the accuracy of our alerts on Main API
  • Monitoring Memory usage would be helpful.
Posted Apr 13, 2021 - 12:24 CEST

Resolved
We're back to normal, the incident is resolved.
Posted Apr 08, 2021 - 09:15 CEST
Monitoring
The account service latencies are currently recovering.
Posted Apr 08, 2021 - 08:58 CEST
Investigating
The high latency is causing failures for all functionality depending on the accounts endpoint, mainly ingestion of transactions and PFM.
Posted Apr 08, 2021 - 07:51 CEST
This incident affected: PFM services and Other services.