Breaking Up The Monolith


Henock Zewde photo How we at IG gradually, safely and confidently migrated IG’s most critical dealing flows from a monolithic, single point of failure booking engine to Microservices.



Context

At IG, we have two flows that are considered critical to our business of allowing retail customers to trade in financial derivatives such as contracts for difference, financial spread betting and stockbroking: login and trade.

Login

The part of login that pertains to our Single Point of Failure (SPOF) challenge was the post-login display of currently open trading positions.

Trade

The trade flow consists of two sub flows: ‘open a deal ticket’ and ‘place a trade’:

These flows take place over thirty-one instances of our RestfulGateway (running on Tomcat), each calling thirty instances of our BookingEngine (running Coherence on six physical hosts) over RMI.

On an average day, we get around seven million requests to the BookingEngine for getOpenPositions, and around nine million request for getMarketDetails.

RMI is an old technology, with the key drawback that, if there is a network issue with one of the six BookingEngine hosts causing a connection timeout, we can’t dynamically exclude the failing host, meaning that all 31 RestfulGateway clients would continue to attempt to connect it, and wait the full four-minute timeout before abandoning the connection and trying again. If this were to happen when the system was under load, our RestfulGateway instances would run out of the 1000 threads allocated by Tomcat in under ten minutes. This was far too little time for someone from the operations team to diagnose and resolve any recoverable issues, or implement a configuration change to remove the failing host and restart all 31 instances.

This would result in a complete outage from our customers’ point of view. Not only would they not be able to trade (in of itself a terrible thing), but they would not be able to use the platform for anything either.

The Work

As a result, a two year project was initiated, where we took the opportunity to not only remove the SPOF of the RMI calls from these two critical flows, but to also move non-booking functionality currently in the BookingEngine into smaller, more focused, services that had only one responsibility. We took the opportunity to also upgrade part of our asynchronous flow to our strategic messaging platform Kafka and Ignite. With this in mind a new architecture was designed, which would mean we would make use of three existing services, create two completely new services, and introduce two new technologies to our trading.

The work of moving the logic required for the three flows into the new services and stitching them together was an interesting challenge, but I would like to concentrate on the work done for the open deal ticket flow, the challenges we faced, and how we made sure that we performed the migration without any risks to the platform.

Challenges

  1. How could we guarantee the validity of the data we were providing from the combination of these new services?
  2. How could we seamlessly rollback to using our BookingEngine over RMI in case we discovered some unforeseen issue in production?
  3. Would we be introducing unacceptable latency to the trade flow because of the extra network hops?
  4. How could we be sure that we do not allow for this kind of single point of failure in the future?

Mitigations


The Shadow Call

IG offers around 300,000 markets that our clients can trade on, and each market can have a large number of dynamically configurable rules that are applicable to it (open/close times, minimum trade size, special behavior in high or low volatility times, etc.). This means that the number of permutations that were possible were large by any measure, and the only true source of ‘correct’ data was in production.

We needed some mechanism that would allow us to test the results from our new service in production in real time. Our strategy was that once we had a basic version of the MarketRulesService application deployed to production (with no live traffic calling it) we introduced a ‘shadow calling’ wrapper around the RMI client calling into the BookingEngine. This wrapper would let the RMI call take place and return the result to the customer (so our customers did not see any change to their response), but then in a new thread make a call to our new MarketRulesService with exactly the same parameters and compare the results of the RMI call and the HTTP call, and log any discrepancies.

This had three benefits:

  1. It gave us hard numbers on how correct our service was in real time.
  2. It gave us the same load that the BookingEngine was receiving without affecting our customers.
  3. It increased confidence that our new solution would be a valid replacement.


After a series of load tests in our User Acceptance Testing (UAT) environment, we were comfortable that the RestfulGateway and the BookingEngine were not adversely affected by the new shadow calling proxy, and we pushed our changes out to production, with no noticeable client impact.

Using this and its stand-alone equivalent we tracked down and resolved all discrepancies, adding an acceptance test against for each issue found, and then pushing our changes out to production. This meant that not only was our entire regression pack automated, but our discrepancies in production gradually declined with each release.


The quick rollback and the gradual release

Our shadow calling library also had the functionality of letting us route a percentage of traffic to our new MarketRulesService while the remainder continued down the legacy RMI path. We used Zookeeper to allow us to dynamically specify the percentage of traffic routed, and we used this to effectively release our changes into production over a four-week period, pushing up the percentage regularly, and monitoring the results on our systems. This also meant that we could roll back to the legacy RMI path instantly if we discovered any issues.


Issues Faced

The target SLA for our new service was that we would respond in under 10 milliseconds at our 99.9th percentile. In order to even approach this, one of our main pain points was that MarginService, and the downstream services it called (all of which pre-existed our project and had not been built with this traffic in mind), were not always able to cope with the load we were placing on it. Fortunately, the data that was coming back from this service was not mandatory for a trade. Our solution for this was that instead of re-architecting this service, we instead opted for providing our customers with a gracefully degrading service. We would call MarginService for every request, but if it took too long to answer or returned an error, the MarketRuleService would not return any margin data. This meant that we would not show the indicative margins required for a trade to our clients,, but we would still allow them to place the trade, which could then be potentially be rejected by the BookingEngine if they exceeded their margin requirements.


Future Testing

As part of the testing done to sign off this work, we wanted to make sure that having removed this one SPOF, we had not introduced another one, so we put our new architecture under load and started manually knocking out single instances of all the services, including the BookingEngine, to find the minimum combination of instances we could tolerate before we had an outage. We also want to make sure that even though we were confident we no longer had a SPOF for these two flows, we didn’t at some point in the future introduce a new one. So, we are now in the process of introducing a monthly resiliency test that will continue to prove we are as resilient as we think we are. Initially this will be done as a manual set of tests, run by the regression team but once we are comfortable, we are looking to automate them, and expand it out to other flows and systems. At some point in the future we could devise our own version of Netflix’s famous simian army. But for now, this is a small step in the right direction.


Conclusions

  1. The SPOFs in your systems are probably known, as they will have caused issues/outages in the past.
  2. Make sure that the replacement SPOF fix does not introduce a newer, shinier one.
  3. Regularly test your infrastructure to prove that your understanding of its resiliency matches the reality (automate once confident).
  4. Allow for graceful degradation where you have services with different SLAs.
  5. Introducing a major change into your primary flow should be a gradual carefully organised change, that can be rolled back instantly.
  6. Use the shadow calling as a strategy to prove new systems can cope and are as good as the current if not better than the system being replaced.


About the author:

Henock Zewde works as Technical Team Lead of the Performance and Reliability Team (PRET) at IG. Find him on LinkedIn.