Have you rebooted your plane lately?

The Federal Aviation Administration (FAA) issued a worrisome warning recently: “Boeing’s laboratory tests discovered that under certain circumstances, all of the 787’s power systems can suddenly shut down entirely during a flight,” reports the Wall Street Journal.

“Each generator is linked to a control unit. Boeing found that if the four engine generators were left on continuously for about eight months, a software internal counter would overflow and cause the control units to enter a fail-safe mode,” explains Jad Mouawad in the New York Times.

Airlines with Boeing 787 Dreamliners were told to make sure to turn them off and on again at least every 248 days—about every eight months. Otherwise, the electrical systems would shut off, and the plane could fall out of the sky.

This is what’s known, in programming parlance, as a “non-graceful exit.”

So why are the planes sitting around turned on all the time anyway? “During the early stages of the plane’s introduction, Boeing drafted an internal report concluding that Dreamliners experienced most of their reliability problems just after being powered up. The company recommended adding additional time before flights to deal with erroneous ‘nuisance’ messages,” writes the WSJ.

“As a result, many airlines made efforts to keep aircraft powered over unusually long stretches to avoid some nagging technical headaches and keep their Dreamliners flying on schedule.”

And if you’re worried, Boeing reassures us that “all jets in the fleet have been powered off and turned back on as part of routine maintenance, so there is no imminent danger of a plane losing power,” reports Kwame Opam in The Verge. “In the meantime, Boeing is working on a fix for the bug.”

This isn’t the first time that too-limited fields have caused problems in programming. Surely many of us recall the Y2K problem: All the computers in the world were going to stop when the calendar turned to January 1, 2000, because many programs had been written (in COBOL, no doubt) to assume that years always started with 19.

“Some doomsayers warned that the Y2K bug was going to end civilization as we know it,” writes historian Jennifer Rosenberg. “Other people worried more specifically about banks, traffic lights, the power grid, and airports—all of which were run by computers. Even microwaves and televisions were predicted to be affected by the Y2K bug. As computer programmers madly dashed to update computers with new information, many in the public prepared themselves by storing extra cash and food supplies.”

But with several years to worry about it, apparently most of the programs got fixed, because the world didn’t end on January 1 and only a smattering of issues were reported.

Similarly, on many 32-bit systems, we’ve got the “Year 2038 problem.”

“The problem springs from the use of a 32-bit signed integer to store a time value, as a number of seconds since 00:00:00 UTC on Thursday, 1 January 1970, a practice begun in early UNIX systems with the standard C library data structure time_t,” wrote Larry Seltzer in InformationWeek in 2013. “On January 19, 2038, at 03:14:08 UTC that integer will overflow.”

You might think, 2038? Almost a quarter century from now? Who cares? But it’s more likely to cause problems than you think.

Seltzer continues, “It’s not difficult to come up with cases where the problem could be real today. Imagine a mortgage amortization program projecting payments out into the future for a 30 year mortgage. Or imagine those phony programs politicians use to project government expenditures, or demographic software, and so on.” Some smartphones already have problems creating calendar entries past that date, Seltzer adds.

In a similar 32-bit issue, YouTube broke when the number of views hit more than 2,147,483,647 for “Gangnam Style.” This led Google to use a 64-bit number instead, giving us the capability of watching “Gangnam Style” 9,223,372,036,854,775,807 times before YouTube breaks.

But fixing the issue in 32-bit systems is going to be more complicated because there’s more involved than just fixing a single system, Seltzer reports. “Designing a new standard is easy, but adapting all the existing software which relies on the old standard is very, very hard.” As an example, think of how much effort people are expending converting from IPv4 to IPv6.

More generally, how do we keep this type of issue from happening?

  • Always assume a program is going to be around longer than you think. The stupid little routine you do every morning to handle a quick-and-dirty problem is likely still going to be used long after you’re dead and buried. If you’ve got some sort of counter or field in your program, make it gigantic. For example, if programmers had made time_t an unsigned integer, the life of the data type would have extended to the year 2106, Seltzer notes.
  • Include boundary conditions in your testing. What happens if a field gets too big? How long will it take each of the fields to fill up? (Notably, the 787’s problem was caught during laboratory testing.)
  • Document, document, document. Make sure that the documentation for the program clearly notes the boundary limits for each of the fields, in terms that programmers years from now can understand. That way, at least when the people 40 years from now are looking at the program, they know the problem is coming.

In the meantime, maybe go turn everything in your office off and back on again. Just to be safe.

Related Posts