2021-04-12
18:39: Incident was declared by developers
18:40: Rollback of PRs was initiated.
19:03: Instances were rebooted in oxford
19:23: Environment was stabilizing, continued monitoring
19:39: Incident was resolved.
Incident description
A PR related to a release train caused increased memory consumption to reach the max limit and caused heavy garbage collection which affected all the endpoints in Oxford-production to be unresponsive.
Stoping the release train and reverting back to previous version. Restarting the instances due to high memory usage caused by the garbage collection.
To avoid similar situations in the future, we will implement a way to alert long/high usage of garbage collection.
Implement a better way of getting alerted in regards to releases for failed PRs