Cover Story (sidebar) / April 1998

Embedded Reliability: Bet Your Life

Tom R. Halfhill

Your life literally depends on millions of invisible computers that control everything from commercial airliners and antilock braking systems to traffic lights and medical equipment. It's a good thing those computers don't crash as often as PCs, because real life does not let you undo.

Embedded control systems far outnumber PCs, and they're multiplying faster than AOL disks. Occasionally they do fail, sometimes with catastrophic results. In 1996, an Ariane 5 rocket exploded after a program tried to stuff a 64-bit value into a 16-bit variable. In 1991, an Iraqi Scud missile killed 28 Americans when a computer's clock drift prevented a Patriot missile battery from tracking the target accurately. In 1986 and 1987, three cancer patients died when a pair of Therac-25 radiation-therapy machines accidentally blasted them with lethal doses of radiation.

But those kinds of failures make news precisely because they're rare. Millions of vehicles and other devices work flawlessly, day after day. What makes embedded systems so reliable?

Experts cite three factors: Reliability is a high priority; developers try to keep embedded systems as simple as possible; and developers and customers alike resist making extensive changes to smoothly running systems.

IBM was the prime contractor for many of the software systems on the Space Shuttle. It took eight years to write the first programs, says Dr. Barry Feigenbaum, senior software engineer for IBM network-computing solutions. Neither IBM nor NASA is eager to change the code. "Old vintage code tends to be more reliable than new, fresh code that hasn't aged yet," he points out.

The microkernel in QNX Software Systems' embedded OS has not changed at all since 1991, notes Greg Bergsma, corporate communications manager for QNX. The QNX OS is found in the monitoring equipment at nuclear power plants, medical-imaging devices, chemical-processing systems, the Space Shuttle's "Canadarm," and the Shuttle's new payload bay vision system. Some QNX systems have been running without a reboot for three years.

QNX keeps the microkernel small — just 10 KB — and it contains only 14 calls. Just the kernel and the interrupt-service routines run in ring 0 (x86 terminology for a supervisor, kernel, or executive mode). Everything else — the file system, device managers, network services, the optional GUI, and other pieces of system software — runs as independent processes in separate partitions. One process is a "software watchdog," dedicated to handling memory violations.

To minimize complexity, some embedded systems shun multithreaded code, which is thorny to debug. NASA almost lost control of the Mars Pathfinder last year when a thread-priority conflict caused the lander's computer to repeatedly reboot itself. Engineers at the Jet Propulsion Laboratory traced the problem to a wrongly initialized Boolean parameter in Wind River's VxWorks OS. Luckily, they were able to upload a patch; on-site service wasn't an option.

That tale and other famous failures should raise a red flag for PC developers, who hurry larger programs to market with less testing. Unfortunately, the cold, hard realities of the marketplace make it almost impossible for PC developers to borrow much wisdom from their embedded-systems brethren.

Copyright 1994-1998 BYTE

Cover Story (sidebar) / April 1998

Embedded Reliability: Bet Your Life

Tom R. Halfhill

Return to Tom's BYTE index page