Limited rate of errors on a few webservices


From 14:00 to 19:30 GMT+1, around 10% of queries to renew access tokens was failing.

Timeline

  •   2020/12/09 - 2:00PM UTC+1: Two servers were added following the internal runbook, missing firewall rules triggered errors on only these 2 servers;
  •   2020/12/09 - 6:23PM UTC+1: 1st notification of the issue (external testing);
  •   2020/12/09 - 7:30PM UTC+1:  Add iptables rules. Issue resolved

Severity

The severity of the incident has been classified as MINOR which implies a low impact for our users.

Root cause analysis 

  • Two new web servers were added without them being listed in the iptables (firewall) rules in memcache servers.
  • All queries to memcached failed on these two servers. All other servers were up and running that mitigated a lot of the issue and avoided any major impact.
  • Concerning oauth2 access token: there was no impact since a retry would fix this, no token has been lost.

Remediation plans

In order to prevent similar circumstances going forward the following actions will be put in place.

| Preventative Action | Owner | Due date | | --- | --- | --- | | Improve Ansible procedure | Head of Cloud Engineering | Already done on 2020/12/10 |