Thursday, August 17, 2023

UK's core payments settlement system fails... again. Some thoughts

As they increasingly forsake cash, regular folks are making dozens of digital payments every month. What they don't realize is how this growing reliance on digital payments increasingly yokes their commercial lives to the fate of a single piece of infrastructure: their central bank's large-value settlement system. When that system experiences a glitch, everyone's financial life gets put on hold.

In the United Kingdom's case, it is the Bank of England's RTGS settlement system that lies at the core of the economy. RTGS's centrality is highlighted by the fact that all the arrows in the chart below converge on it: every payment in the UK, big or small (except for cash), ultimately gets finalized using RTGS.

Alas, RTGS failed this Monday for six hours. No reasons were given, although I can't help wonder if it is was due to a software glitch stemming from Bank of England staff having been recently upgraded RTGS to the ISO 20022 payments language, rather than something like a cyberattack.

RTGS's centrality illustrated. Source: Bank of England

This isn't RTGS's first long failure. Back in 2014, a poorly-managed software update caused RTGS to shut down for 9 hours, leading to a revealing independent review.

The failure of the nation's key piece of payments infrastructure, even for just a few hours, is not a good thing. During those hours of unavailability, costly delays are imposed on day-to-day commerce as well as financial markets. Even when a buggy system is up and running, the uncertainty of another potential long failure acts as a pervasive cost on commercial society. 

To reduce these costs, central bank large value payments systems are typically built with multiple layers of redundancy. In RTGS's case, the hardware is hosted at two different sites, so that if the primary site goes down, the other one can quickly kick in. Presumably whatever knocked RTGS down last Monday was fierce  enough to incapacitate both sites.

A third layer of redundancy comes in the form of the Bank of England's Market Infrastructure Resiliency Service, or MIRS. With RTGS's two sites incapacitated, the Bank can "fail over" to MIRS, payments recommencing. MIRS uses different software, programming, and hardware, as well as being  hosted in a geographical remote location with a separate group of staff. This is achieved by an outsourcing arrangement with SWIFT, the same folks who run the global SWIFT messaging system.

There's no indication that the Bank of England failed over to MIRS earlier this week, staff preferring to focus on fixing RTGS instead. Alas, this choice subjected the UK economy to a long settlement delay. Why no fail-over to MIRS? Why choose such a long period of settlement deprivation?

A reading of the inquiry into the 2014 failure gives some clues into what may have happened two days ago. When RTGS failed on Monday, October 20, 2014, the Bank of England likewise chose not to fail over to MIRS. Why? The inquiry pointed to the fact that it would haven taken 2-2.5 hours to get MIRS up and running. Given this length of time, it made sense to try to fix RTGS instead, an inherently-preferable system because of features like the ability to save on liquidity, which the back-up system MIRS lacked.

Management was also reticent to switch on MIRS because they weren't sure if, after having activated it on Monday, they could turn it off on Tuesday night and manually return to a now-repaired RTGS without making a mistake. Bank officials only felt comfortable doing this manual switch back to RTGS on a weekend, because it afforded them much more time than a weeknight.

And thus trepidation about switching on the back up system led to it never being activated in 2014, which forced 9 hours of settlement deprivation on the UK economy.

Among its suggestions, the 2014 inquiry called for an upgrade to the MIRS back up option in order to make it a less anxiety-inducing option to turn to. The passage is worth reading in full:

Work should be undertaken to remove or reduce the barriers to invocation of MIRS so that
the Bank can "switch and fix" in parallel and in confidence. This should focus on testing the process to fail-back to RTGS intraweek (which is the primary barrier to invocation). If it is not possible to reduce this barrier, consideration should be given to enhancing the resilience and functionality within MIRS. In addition the Bank may wish to consider other back-up options for RTGS.
These were all good ideas. They would have reduced the hassle of resorting to the backup option by either improving the switching experience, or by upgrading MIRS's features so that being stuck on it for a few days posed less of a nuissance.

Which brings us back to 2023. If there is an inquiry into Monday's RTGS outage, investigators will need to explore why a multi-hour delay was once again imposed on UK citizens. Was it because, once again, the costs of using the back up system were deemed too high relative to the benefits? If so, were the costs deemed too high because none of the improvements suggested back in 2014 were adopted?

Failure to learn from the past would be unfortunate. These issues are especially salient because the Bank of England will introduce the next version of RTGS in 2024. Given that the updated RTGS will be built with more modern technology, it will (hopefully) fail less often than the older version. But it will still fail. What will the updated back up scheme look like? Will RTGS quickly switch over to tertiary site, or will the economy be forced to endure multi-hour settlement failure as a fix is pursued?

These are not just questions for the UK, but for every nation, since we all have large value payments systems on which commercial society is entirely dependent. It seems to me that if you have designed and built a back up system, that back up system should be, ya know, used. Those who operate them, usually central banks, should not be afraid to switch over. In the UK's case, that means that the decision to turn on MIRS (or whatever back up system the updated RTGS will use after 2024) should always be an easy decision for the Bank of England to make, not a gut-wrenching one.

4 comments:

  1. You can't have realtime, consistency and redundancy at the same time. This is a variant of an engineering triangle and requires tradeoffs. If you don't choose your priorities reasonably, a third site may make things even worse.

    ReplyDelete
    Replies
    1. I agree that there probably is a tradeoff along the lines you describe, but I'm not convinced that the product of that tradeoff is a 6 or 9 hour closure of RTGS every time a problem is encountered. Why not 1 to 2 hours? The 2014 inquiry made some simple suggestions. Did the Bank of England follow up on these?

      Delete
  2. FYI from 2019: Why is the UK so bad at infrastructure projects?
    https://www.independent.co.uk/voices/letters/hs2-infrastructure-projects-uk-europe-a9074636.html

    ReplyDelete
    Replies
    1. In the areas I follow, like payments and monetary infrastructure, I actually find the UK to be way ahead of most western nations. The Faster Payments system is a good example.

      Delete