When Software Goes Bad
Paul Marks, writing in New Scientist, raises an interesting point in his article “Crashing Software Poses Flight Danger,” (New Scientist, Feb 11, 2008).
“Why do software bugs arise and why can’t they be removed? Bugs are sections of code that start doing something different to what the programmer intended, usually when the code has to deal with circumstances the programmer didn’t anticipate. All software is susceptible to bugs, so it must be tested under as many different circumstances as possible. Ideally, the bugs get discovered at this time and are removed before the software is actually used. This is very difficult in complex systems like aircraft because the number of possible scenarios, – such as different combinations of air densities, engine temperatures and specific aerodynamics – is huge.”
The problem occurs when you write software that must be bug-free, for example to be used in the flight control systems of aircraft. In order to reduce weight and improve fuel efficiency aircraft makers have begun using computers to control planes. But bugs that result in the Blue Screen of Death on your Windows PC can be much more serious when they happen on a plane flying 30,000 feet in the air.
Paul Marks writes that the US National Academy of Science has documented several instances of trouble caused by faulty software design. In August 2005 a computer in a Boeing 777 gave contradictory reports of airspeed, varying between an air speed that was too fast for the plane to fly without breaking apart and too slow for it to stay in the air. That same year an Airbus 319 lost computerized flight and navigation displays, autopilot, auto throttle and radio for two minutes. The NAS also reports that “faulty logic in the software” was behind a computer failure that controlled fuel flow in an Airbus 340.
According to Marks, to test for bugs aircraft manufacturers adhere to the “DO-178B standard” which rates software based on the severity of the resulting consequences should it fail. The most rigorous test, “Level A” is for software whose failure would result in a catastrophic event using the “modified condition/decision coverage (MCDC)” method that tests software by placing it as many dangerous situations as possible and seeing how it handles.
Marks notes that Martyn Thomas, a systems engineering consultant in the UK, doesn’t believe that MCDC works. “MCDC testing is not removing any significant numbers of bugs,” he says. “It highlights the fact that testing is a completely hopeless way of showing that software does not contain errors.”
Thomas’s solution is to change the way software is written starting with the languages used. He suggests using computer languages that force the programmer to write unambiguous code such as B-Method and SPARK, instead of more commonly used C and its variants that allow programmers to write vague code that can lead to bugs. These languages and their compilers make it difficult for programmers to write ambiguous code by mathematically verifying the code as it is written. Languages that do this are considered “strongly typed”, defined in Wikipedia as “”a programming language plac(ing) severe restrictions on the intermixing that is permitted to occur, preventing the compiling or running of source code which uses data in what is considered to be an invalid way.”
One of the more commonly used “strongly typed” languages is Ada, of which SPARK is a variant. Ada, named after Lord Byron’s daughter and creator of the world’s first software Augusta Ada Lovelace, is an object oriented language developed in the 1970s by the US Department of Defense to reduce the number of high-level computer languages used, at the time numbering as high as 450. Since its adoption in December 1980 Ada has become a standard in the defense industry, and moved into other outside areas where safety is paramount. For example, Wikipedia notes that the software that controls the Boeing 777 is written in Ada – but makes no mention of the airspeed incident noted in Marks’s article.
So how safe is Ada? According to the article “Ada 2005 Strengthens Ada’s Safety-Critical Muscles” in the November 2005 issue of COTS Journal, Ada (http://www.cotsjournalonline.com/home/article.php?id=100424) was conceived with safety in mind, using strong typing and well defined semantics that has undergone continual international inspection and formal reviews – called “validating an implementation” now an ISO standard, thereby avoiding many of the pitfalls and traps that cause run-time errors in C and its variants.
The COTS Journal article also notes something that I was considering myself: Why should safety-critical software be used only in safety-critical situations?
“…in a society increasingly dependent on sophisticated computer software, there are more and more applications where correctness is essential, even if they are not formally considered safety-critical. For example, the commercial banking structure relies on complex computer controls. Even a minor failure can cause waves that can have huge economic consequences. There are several decades of experience in building safety-critical systems, and the success has been remarkable—no fatalities can be attributed to failure of certified safety-critical software. It is both practical and essential to extend these techniques to improve the reliability of our entire computer infrastructure.”
In 2002 the Knight Trading Group’s a bug in the firm’s own software generated “sell orders” on its own stock, driving the price down in pre-market trading, forcing the NASDAQ to halt trading in the stock. In 2005 the Tokyo Stock Exchange missed its morning trading session due to a software glitch. All this occurred years after the granddaddy of software bugs, Y2K cost firms billions to fix. Bug-free software may not have killed anyone in these situations, but it sure cost firms a lot of money.
So why isn’t Ada being used more often today in corporate environments where C variants including Java – and the bugs that come with it – have become the de facto standard? First the existing code base of C variant applications in use at firms represents a huge investment of hundreds of billions of dollars. While this code may be prone to bugs, it is much cheaper to patch the code than it is to replace it with code that is more robust.
Secondly companies already have an existing infrastructure, including hardware and software tools that are designed to work specifically with C variant languages. Part of this infrastructure would include a huge talent base of C programmers that can be augmented relatively easily by qualified professionals. As an informal test I compared jobs listings on the technical job site “Dice” using Ada and “C” as keywords. While the Ada search using that keyword returned 140 jobs nationwide, the C returned over 18,000 – roughly one in five positions listed.
Finally there is the perception that C code is “good enough,” that much of the programming being done in business today doesn’t require the extra money invested in better code. No one will die if a payroll application crashes during a check run, or a customer’s call is cut off prematurely by a bad line of code in voice recognition software.
However problems occur when the business processes used to justify programs change and the programs themselves do not, for example when a low priority process becomes embedded in a critical one, or the process itself becomes critical. Once software is in place and is working, it’s difficult to justify replacing it until something goes wrong. And when something does, an emergency often results in a quick fix that is cheaper in the short term than a more expensive and time consuming longer term solution.
It would be nice for companies to embrace “strongly typed” programming, but given the above constraints I simply do not see it happening any time soon. What I do see happening at the very least are better design tools appearing that monitor code writing and catch bugs before they are compiled, as well as more automated testing including perhaps full simulations of complex systems on virtual machines. Given today’s technology it is possible to run these simulations using historical data and time compression to see whether a given application works as intended or as designed – with bugs. Many firms already have completely duplicated production environments for testing and disaster recovery. It wouldn’t take much to adapt these environments into full-blown simulations.