Planning for server reliability

Server reliability is of primary importance in our networked world. Billions of dollars in e-commerce is utterly dependent on reliable servers and reliable networks connecting those servers to users.

The Australian stock exchange (ASX) recently had a bad day when their trading system failed. Market opening was delayed by 90 minutes, and after experiencing problems during trading, they were forced to close two hours early. Subsequently, some trades had to be cancelled.

For a stock exchange, this performance is disastrous, and they have been subject to heavy criticism. ASIC, Australia's financial services regulator, has announced they will be investigating. Chi-X, a recently alternative Australian trading platform, will no doubt win some customers over the debacle.

What caused the ASX outage? The ASX has issued a fairly detailed technical report here. The original issue was a hardware failure, which in itself should not cause an outage - high availability systems always have failover provisions in place. The important question to answer is why the failover systems failed?

The problem was the database server which holds all the ASX data, including trades and financial instruments (e.g. shares) that are traded. All ASX data is replicated onto a database server in a backup datacentre, and standard failover processes mean that when the database server fails, there should be a seamless failover to the backup database server.

Unfortunately, this did not happen completely. Most of the ASX trading system correctly failed over to the backup database server, but some subsystems were still connecting to the failed server. This was not detected until it became apparent there were inconsistencies in the reference data for some securities. At this point the market had to be closed until the inconsistencies were resolved.

The primary lesson to be learned from the ASX failure is that failover scenarios must be tested regularly - at least weekly, and possibly nightly. The ASX report states the last failover test was on 27 July, about 7 weeks prior to this incident! For such a mission critical system this seem inadequate. Given that the market closes around 4 pm, there seems no reason why overnight failover tests could not be run.

Try CompleteFTP in a 30-day free trial