At some point, the system-wide computer outage that took all of British Airways out of action starting on May 27 will provide a valuable lesson in maintaining critical systems. But for now, British Airways’ IT staff is investigating why the systems failed so it can decide how to prevent it happening again.
“It was not an IT issue, it was a power issue,” a British Airways spokesperson told eWEEK in an email. “There was a total loss of power. The power then returned in an uncontrolled way causing physical damage to the IT servers.” “We know what happened,” the spokesperson added, “we are investigating why it happened.”
The shut down stranded as many as 300,000 passengers forced the cancellation of hundreds of British Airways flights during a long holiday weekend in both the U.S. and United Kingdom.
Previous statements from British Airways indicated that the power event did so much damage that both the main data system and the backup system were damaged, indicating that they were likely co-located.
Statements from the power companies serving Heathrow airport and the offices surrounding it, which included the BA data centers indicate that there was no power surge from their end, which suggests that power conditioning equipment inside BA’s Heathrow data centers were probably involved.
Electric power is typically sent to a major data center from the power distribution grid to an automatic transfer switch, then to some switch gear and then to uninterruptable power supplies. The automatic transfer switch is designed to instantly switch from commercial power to local power sources, which are usually diesel generators in an N+1 configuration.
The job of the UPS is to condition the power and to provide backup power while the power source is transferring current between commercial power and locally generated power. The UPS delivers conditioned power to the IT center power distribution unit (PDU) which then sends that current to individual servers and other related equipment.
While BA is still investigating how a power surge managed to take out their entire data system, it seems that the main culprit is likely the PDU, which has the job of reducing the voltage to the level used by the servers.
This process involves some big transformers, and when a transformer fails, it can be pretty destructive. But we don’t know for sure that it was transformers at this point. It could also be anything from squirrel in the UPS (which has happened to me) to a terrorist attack.
What was unusual about the BA outage is that it affected all of BA’s data systems, including its websites, the reservation system and all of the flight planning and internal communications systems. This meant that BA was unable to communicate with passengers, with its other offices and with the public by any means other than the telephone.
According to a passenger who was traveling on BA when the data system crash happened, the aircraft crew and the staff at the airport had no idea what was happening until considerably later. This passenger’s plane was actually taxiing for takeoff at Germany’s Frankfurt Airport, when the flight was brought to a halt. “We were told there was a delay,” explained Esther Schindler of Phoenix, AZ, “they were waiting to space out planes.”
Schindler, veteran freelance technology writer, said that there was confusion among the crew because there was no information available to the crew or at the gate. The plane eventually returned to its original boarding gate. She said that she noticed that the information board inside the boarding area was still listing the flight as still boarding, indicating that it hadn’t been updated by the central system.
“They didn’t know what the IT crash was about,” Schindler said, “all the systems that would be communicating with them were down.”
“It took a little while but they confirmed that the entire IT system had crashed, and they told us it was worldwide,” she said.
The passenger experiences and the information from the airline make a few things clear. First, BA was putting every function into a single computer system, regardless of whether that made sense, which then insured that if anything went wrong with that system, then everything would shut down.
Second, the BA back up system that should have provided redundancy wasn’t truly redundant, or the event that took out the main system would not have been able to take out the backup system at the same time. Normal practice for backup data systems would have required at the very least that the backup system be physically separated from the main system by enough distance that the same catastrophe couldn’t affect them both.
But statements by the airline say that the event that damaged the data system also damaged the backup systems, so clearly its redundancy was compromised. Normally, the power handling and conditioning equipment is also built with redundancy so that even if something major, such as the PDU, goes out, the computers can continue to get power.
Normally, the best practices for such a data system call for the power to enter the data center from two separate commercial grids, usually at two ends of the building, to prevent damage such as what befell BA. Likewise, the standby generators can feed the power from either service entrance, and the N+1 configuration of the generators means that power can be supplied even if one generator goes out.
What will we actually find? We may eventually learn there was poor design in the data center and that there was no effective redundancy. Furthermore there was apparently over-reliance on a single data system, which suggest an organization that was being penny-wise and pound foolish.
BA needs to answer to its customers and shareholders about how an IT outage that was probably avoidable brought the company to a standstill over the Memorial Day weekend.