Increased response time on aggregation and payments

Incident Report for Tink

Postmortem

Timeline

2021-04-12

18:39: Incident was declared by developers
18:40: Rollback of PRs was initiated.
19:03: Instances were rebooted in oxford
19:23: Environment was stabilizing, continued monitoring
19:39: Incident was resolved.

Incident description

A PR related to a release train caused increased memory consumption to reach the max limit and caused heavy garbage collection which affected all the endpoints in Oxford-production to be unresponsive.

How did we resolve the incident?

Stoping the release train and reverting back to previous version. Restarting the instances due to high memory usage caused by the garbage collection.

What did we learn?

To avoid similar situations in the future, we will implement a way to alert long/high usage of garbage collection.
Implement a better way of getting alerted in regards to releases for failed PRs

Posted Apr 19, 2021 - 09:39 CEST

Resolved

This incident has been resolved.

Posted Apr 12, 2021 - 19:51 CEST

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Apr 12, 2021 - 19:23 CEST

Identified

We are currently experiencing increased response time in aggregation and payments endpoints.
We have identified the issue and currently looking into solving it.

Posted Apr 12, 2021 - 19:06 CEST

This incident affected: Payments.