Never fail the same way twice


Henock Zewde photo At IG we write a lot of software. We have thirty or so teams in multiple geographic locations developing and supporting around two hundred disparate applications and services, utilising a diverse range of technologies old and new.

Supporting such an environment can be challenging, and one of these challenges is that applications are often affected by reoccurring issues, usually painful and time consuming to find and resolve.

Historically these were diagnosed and resolved at an individual application level and the lessons learned communicated to development teams via wikis or email, but this has proven to be ineffective:​

  • teams were too busy developing 'urgent' features to make the required changes;
  • new joiners would not know of lessons learnt;
  • over time, developers would just forget and repeat the mistakes.​

The solution was automation.


We developed a system called the Automated Non Functional Requirements scoring ​service (ANFR) that daily assesses all of our applications to check compliance with respect to a specified set of non-functional requirements.

Unlike Sonar or other build time code assessments tools, the ANFR service does not only assess an application at build time, but also at runtime, and in all environments.

We leverage other 3rd party runtime systems like SPLUNK, AppDynamics, Puppet, as well as our in-house Application Configuration System to assess each application against an ever growing list of assessments.

For example,

  • the number of 500 HTTP status codes must (at least initially) be below 0.01% of all requests in all environments;
  • every request to a dependent system must be behind a Circuit Breaker (a mechanism that stops downstream applications failing because a system they depend on is not working);
  • the application should not show degraded performance;
  • the application is not writing sensitive information into the logs;
  • the application is not logging Debug in Production;
  • the application is running on a supported JVM version;
  • the library versions used by a project must be higher than a defined minimum;
  • the application must have a health monitor endpoint;
  • the application must have architectural diagrams checked in with the source;
  • the application must have a load test plan defined.

If an application fails an assessment we block it from Test and UAT deployments – requiring an expiring 'lifeline' request from the team before allowing the deployment.

Issues faced

While everyone agreed with the assessments, and some teams were quick to resolve their issues, we also had a lot of pushback. For example, some teams were adamant that their application should not be held to the same standard in TEST as in PROD, or teams were simply too busy working on important business features to fix 'minor' issues.

In order to mitigate this, we decided to introduce assessments one at a time, prioritising them in terms of remediation ease and operational criticality.

Conclusions

We are still in the early stages of running the ANFR system. We have found that writing the software and the assessments was straightforward, however getting the teams to put in the effort of passing the assessments proved harder.

The journey towards "never failing the same way twice" continues ... slowly.

About the author:

Henock Zewde works a Technical Team Lead of the Performance and Reliability Team (PRET) at IG. Find him on LinkedIn.