Something about humans is that we love arguing about disaster. Whether it’s meteorologists arguing about the most devastating hurricane or military historians discussing the world’s most significant battle, people always have their own opinions. And it’s no different with computer scientists, who like to compare notes on the worst software bug ever.

What appear to be relatively minor bugs have caused millions of dollars of damage. Why, then, do we seem to take pleasure in talking about them? Is it through a sense of “There but for the case of a good debugger go I”? Or is it like witnessing an accident, where you know what’s coming but you can’t turn away? Entire books have even been devoted to the subject.

A number of these catastrophic failures involve the space program. It’s not that they do anything wrong, in particular; it’s just that their systems tend to be the most complex—which also means they tend to fail the most spectacularly. We’ve discussed “normal accident theory,” before, which states that as a system gets more complex, the chances of failure increase no matter how careful you are with all the requisite components. In other words, no matter how rigorously you test all the various components, when you put them together, they’re more likely to fail because of unexpected interactions between them. Plus, when rockets fail, they tend to blow up.

Devastating software bugs tend to fall into one of several categories. Here are three of them, as well as suggestions for how to avoid them.

  1. Buffer overflow.

Many devastating software bugs fall into the category of overflow, where a number ends up becoming too big for the space set aside for it and all sorts of bad things happen. For example, in 1995, the European Space Agency rocket Ariane 5 blew up just 39 seconds into flight from, as James Gleick described in the New York Times magazine, “trying to stuff a 64-bit number into a 16-bit space.”

“Ordinarily, when a program converts data from one form to another, the conversions are protected by extra lines of code that watch for errors and recover gracefully,” Gleick wrote. “Indeed, many of the data conversions in the guidance system’s programming included such protection. But in this case, the programmers had decided that this particular velocity figure would never be large enough to cause trouble. After all, it never had been before.”

This rocket, however, was faster than previous rockets that used the software, and the number indeed overflowed the buffer. Moreover, a backup unit also failed—from the same error, because both systems were using the same software, Gleick wrote.

The lesson? As we’ve discussed before, it’s important to ensure that buffers are lots bigger than the biggest value you can imagine. Remember that the Y2K problem was caused because people couldn’t imagine that the program they wrote in the 1980s and 1990s would still be in use in the year 2000.

  1. Units mismatch.

In 1999, NASA lost a $125 million Mars orbiter when a Lockheed Martin engineering team used English units of measurement while the agency’s team used the metric system for a key spacecraft operation, reported Robin Lloyd in CNN. “The units mismatch prevented navigation information from transferring between the Mars Climate Orbiter spacecraft team in at Lockheed Martin in Denver and the flight team at NASA’s Jet Propulsion Laboratory [JPL] in Pasadena, California. The spacecraft came within 60 km (36 miles) of the planet—about 100 km closer than planned and about 25 km (15 miles) beneath the level at which it could function properly.”

As with the Ariane incident, JPL magnanimously refused to blame Lockheed for the problem. “People make errors,” JPL administrator Tom Gavin told CNN. “The problem here was not the error. It was the failure of us to look at it end-to-end and find it.”

The lesson? Whether you use the English system or the metric system doesn’t matter. What really matters is that, whichever system you use, everyone’s using the same one. Specify everything, and check.

  1. Missing punctuation.

Whether it’s a missing semicolon or a 0 that looks like an O, stories abound of programs laid waste by a typo. But none was more spectacular than the explosion of the Mariner I in 1962.

“Less than 5 minutes into flight, Mariner I exploded, setting back the U.S. government $80 million ($630 million in 2014 dollars)”—7 percent of NASA’s budget that year, wrote Zachary Crockett in Priceonomics.  “The root cause for this disaster? A lone omitted hyphen, somewhere deep in hand-transcribed mathematical code.” As a result, the satellite went off course, risking a crash, and engineers chose to destroy the satellite instead.

While the hyphen had also been missing in previous versions, that part of the software hadn’t been called upon before, because it was only needed in the event of a radio guidance failure, reported one NASA eyewitness. As it happens, Mariner I experienced such a failure, which brought the bug into play.

The lesson? Make sure you test all the software.

As long as we’re working with people, computer bugs are going to happen. The important thing is being able to find them and building in ways to fail gracefully from them. And that’s not rocket science.

IDC_blog

Related Posts