Cover Story / April 1998

Crash-Proof Computing

Here's why today's PCs are the most crash-prone computers
ever built — and how you can make yours more reliable.

Tom R. Halfhill

Men are from Mars. Women are from Venus. Computers are from hell.

At least that's how it seems when your system suddenly crashes, wiping out an hour of unsaved work. But it doesn't have to be that way. Some computers can and do run for years between reboots. Unfortunately, few of those computers are PCs.

If mainframes, high-end servers, and embedded control systems can chug along for years without crashing, freezing, faulting, or otherwise refusing to function, then why can't PCs? Surprisingly, the answer has only partly to do with technology. The biggest reason why PCs are the most crash-prone computers ever built is that reliability has never been a high priority — either for the industry or for users. Like a patient seeking treatment from a therapist, PCs must want to change.

"When a 2000-user mainframe crashes, you don't just reboot it and go on working," says Stephen Rochford, an experienced consultant in Colorado Springs, Colorado, who develops custom financial applications. "The customer demands to know why the system went down and wants the problem fixed. Most customers with PCs don't have that much clout."

Fortunately, there are signs that everyone is paying slightly more attention to the problem. Users are getting fed up with time-consuming crashes — not to mention the complicated fixes that consume even more time — but that's only one factor. For the PC industry, the prime motives seem to be self-defense and future aspirations.

With regard to self-defense: Vendors are struggling to control technical-support costs, while alternatives such as network computers (NCs) are making IT professionals more aware of the hidden expenses of PCs. With regard to future aspirations: The PC industry covets the prestige and lush profit margins of high-end servers and mainframes. But processing power alone does not a mainframe make. When the chips are down, high availability must be more than just a promise.

That's why the PC industry is working on solutions that should make crashes a little less frequent. We're starting to see OSes that upgrade themselves, applications that repair themselves, sensors that detect impending hardware failures, development tools that help programmers write cleaner code, and renewed interest in the time-tested technologies found in mainframes and mission-critical embedded systems. As a bonus, some of those improvements will make PCs easier to manage, too.

But don't celebrate yet — it's hardly a revolution. Change is coming slowly, and PCs will remain the least reliable computers for years to come.

Why PCs Crash

Before examining the technical reasons why PCs crash, it's useful to analyze the psychology of PCs — by far the biggest reason for their misbehavior. The fact is, PCs were born to be bad.

"The fundamental concept of the personal computer was to make trade-offs that guaranteed PCs would crash more often," declares Brian Croll, director of Solaris product marketing at Sun Microsystems. "The first PCs cut corners in ways that horrified computer scientists at the time, but the idea was to make a computer that was more affordable and more compact. Engineering is all about making trade-offs."

It's not that PC pioneers weren't interested in reliability. It's just that they were more interested in chopping computers down to size so that everybody could own one. They scrounged the cheapest possible parts to build the hardware, and they took dangerous shortcuts when writing the software.

For instance, to wring the most performance out of slow CPUs and a few kilobytes of RAM, early PCs ran the application program, the OS, and the device drivers in a common address space in main memory. A nasty bug in any of those components would usually bring down the whole system. But OS developers didn't have much choice, because early CPUs had no concept of protected memory or a kernel mode to insulate the OS from programs running in user mode. All the software ran in a shared, unprotected address space, where anything could clobber anything else, bringing the system down.

Ironically, though, the first PCs were fairly reliable, thanks to their utter simplicity. In the 1970s and early 1980s, system crashes generally weren't as common as they are today. (This is difficult to document, but almost everyone swears it's true.) The real trouble started when PCs grew more complex.

Consider the phenomenal growth in code size of a modern OS for PCs: Windows NT. The original version in 1992 contained 4 million lines of source code — considered quite a lot at the time. NT 4.0, released in 1996, expanded to 16.5 million lines. NT 5.0, due this year, will balloon to an estimated 27 million to 30 million lines. That's about a 700 percent growth in only six years.

"People who build reliable systems don't radically change the system very often," says Sun's Croll. (Solaris is holding fairly steady at 7 million to 8 million lines of code.) "PCs tend to have boatloads of fresh, virgin, untested code. The sheer number of lines of code makes bugs more likely. The code you never write has no bugs."

Engineers who work with mainframes and critical embedded systems agree. "Having 15 million lines of code isn't as bad as having 15 million lines of new code," notes Wayman Thomas, director of mainframe solutions for Candle, which makes performance monitors and other software for large-scale servers and mainframes. (See the sidebars "Why Mainframes Rarely Crash" and "Embedded Reliability: Bet Your Life".)

However, Russ Madlener, Microsoft's desktop OS product manager, says that code expansion is manageable if developers expand their testing, too. He says the NT product group now has two testers for every programmer. "I wouldn't necessarily say that bugs grow at the same rate as code," he adds.

It's true that NT is more crash-resistant than Windows 95, a smaller OS that's been around a lot longer. And both crash less often than the Mac OS, which is older still. In this case, new technology compensates for NT's youth and girth. NT has more robust memory protection and rests on a modern kernel, while Windows 95 has more limited memory protection and totters on the remnants of MS-DOS and Windows 3.1. The Mac OS has virtually no memory protection and allows applications to multitask cooperatively in a shared address space — a legacy of its origins in the early 1980s.

Still, it will be interesting to see how stable NT remains as it grows fatter. And grow fatter it will, because nearly everybody wants more features. Software vendors want more features because they need reasons to sell new products and upgrades. Chip makers and system vendors need reasons to sell bigger, faster computers. Computer magazines need new things to write about. Users seem to have an insatiable demand for more bells and whistles, whether they use them or not.

"The whole PC industry has come to resemble a beta-testing park," moans Pavle Bojkavski, a law student at the University of Amsterdam who's frustrated by the endless cycle of crashes, bug fixes, upgrades, and more crashes. "How about developing stable computers using older technology? Or am I missing a massive rise in the number of masochists globally who just love being punished?"

Although there are dozens of technical reasons why PCs crash, it all comes down to two basic traits: the growth spurt of complexity, which has no end in sight, and the low emphasis on reliability. Attempts to sell simplified computers (such as NCs) or scaled-down applications (such as Microsoft Write) typically meet with resistance in the marketplace. For many users, it seems the stakes aren't high enough yet.

"If you're using [Microsoft] Word and the system crashes, you lose a little work, but you don't lose a lot of money, and no one dies," explains Sun's Croll. "It's a worthwhile trade-off."

Causes Behind Crashes

You can sort the technical reasons for crashes into two broad categories: hardware problems and software problems.

Genuine hardware problems are much less common, but you can't ignore the possibility. One downside to the recent sharp drop in system prices (see "Disposable PCs," February) is that manufacturers are cutting corners more closely than ever before. Inexpensive PCs aren't necessarily shoddy PCs, but sometimes they are. (See the sidebar "It's a Hardware Problem!".)

Another cause of mysterious crashes, outright sabotage, is beyond the scope of this article. The dangers of viruses, worms, and Trojan horse programs are well documented, and it's really a security issue. And, of course, nefarious behavior isn't limited to software. In a study of 10,000 help-desk calls, analysts at Workgroup Technologies discovered that 10 calls in one month at one company came from users whose SIMMs had been stolen. A former CIO at a publishing company told BYTE that his employees frequently upgraded their systems by pilfering SIMMs from other employees' machines. (Robin Hood strikes again.)

Generally, though, when a computer crashes, it's the software that's failed. If it's an application, you stand to lose your unsaved work in that program, but a good OS should protect the memory partitions that other programs occupy. Sometimes, however, the crashed program triggers a cascade of software failures that brings down the entire system.

Then the only recourse is to reboot, sacrificing unsaved work in all open applications. And because neither the OS nor the applications get a chance to clean up after themselves — by closing open files, deleting temporary files, flushing I/O channels, and so forth — an abrupt reboot can leave debris on the hard disk or even scramble the disk. This leads to more instability, more crashes, and lost data.

So why do programs crash? Chiefly, there are two reasons: A condition arises that the program's designer didn't anticipate, so the program doesn't handle the condition; or the program anticipates the condition but then fails to handle it in an adequate manner.

In a perfect world, every program would handle every possible condition, or at least it would defer to another program that can handle it, such as the OS. But in the real world, programmers don't anticipate everything. Sometimes they deliberately ignore conditions that are less likely to happen — perhaps in trade for smaller code, faster code, or meeting a deadline. In those cases, the OS is the court of last resort, the arbiter of disturbances that other programs can't resolve. "At the OS level, you've got to anticipate the unanticipated, as silly as that sounds," says Guru Rao, chief engineer for IBM's System/390 mainframes.

To deal with these dangers, programmers must wrap all critical operations in code that traps an error within a special subroutine. The subroutine tries to determine what caused the error and what should be done about it. Sometimes the program can quietly recover without the user's knowing that anything happened. In other cases, the program must display an error message asking the user what to do. If the error-handling code fails, or is missing altogether, the program crashes.

Autopsy of a Crash

Crash is a vague term used to describe a number of misfortunes. Typically, a program that crashes is surprised by an exception, caught in an infinite loop, confused by a race condition, starved for resources, or corrupted by a memory violation.

Exceptions are run-time errors or interrupts that force a CPU to suspend normal program execution. (Java is a special case: The Java virtual machine [VM] checks for some run-time errors in software and can throw an exception without involving the hardware CPU.) For example, if a program tries to open a nonexistent data file, the CPU returns an exception that means "File not found." If the program's error-trapping code is poor or absent, the program gets confused.

That's when a good OS should intervene. It probably can't correct the problem behind the scenes, but it can at least display an error message: "File not found: Are you sure you inserted the right disk?" However, if the OS's error-handling code is deficient, more dominoes fall, and eventually the whole system crashes.

Sometimes a program gets stuck in an infinite loop. Due to an unexpected condition, the program repeatedly executes the same block of code over and over again. (Imagine a person so stupid that he or she follows literally the instructions on a shampoo bottle: "Lather. Rinse. Repeat.") To the user, a program stuck in an infinite loop appears to freeze or lock up. Actually, the program is running furiously.

Again, a good OS will intervene by allowing the user to safely stop the process. But the process schedulers in some OSes have trouble coping with this problem. In Windows 3.1 and the Mac OS, the schedulers work cooperatively, which means they depend on processes to cooperate with each other by not hogging all the CPU time. Windows 95 and NT, OS/2, Unix, Linux, and most other modern OSes allow a process to preempt another process.

Race conditions are similar to infinite loops, except they're usually caused by something external to the program. Maybe the program is talking to an external device that isn't responding as quickly as the program expects — or the program isn't responsive to the device. Either way, there's a failure to communicate. The software on each end is supposed to have time-out code to handle this condition, but sometimes the code isn't there or doesn't work properly.

Resource starvation is another way to crash. Usually, the scarce resource is memory. A program asks the OS for some free memory; if the OS can't find enough memory at that moment, it denies the request.

Again, the program should anticipate this condition instead of going off and sulking, but sometimes it doesn't. If the program can't function without the expected resources, it may stop dead in its tracks without explaining why. To the user, the program appears to be frozen.

Even worse, the program may assume it got the memory it asked for. This typically leads to a memory violation. When a program tries to use memory it doesn't legitimately own, it either corrupts a piece of its own memory or attempts to access memory outside its partition.

What happens next largely depends on the strength of the OS's memory protection. A vigilant OS won't let a program misuse memory. When the program tries to access an illegal memory address, the CPU throws an exception. The OS catches the exception, notifies the user with an error message ("This program has attempted an illegal operation: invalid page fault"), and attempts to recover. If it can't, it either shuts down the program or lets the user put the program out of its misery.

Not every OS is so protective. When the OS doesn't block an illegal memory access, the errant program overwrites memory that it's using for something else, or it steals memory from another program. The resulting memory corruption usually sparks another round of exceptions that eventually leads to a crash.

Corruption also occurs when a program miscalculates how much memory it already has. For instance, a program might try to store some data in the nonexistent 101st element of a 100-element array. When the program overruns the array bounds, it overwrites another data structure. The next time the program reads the corrupted data structure, the CPU throws an exception. Wham! Another crash.

Altered States

Modern PCs suffer from a whole other class of problems related to their state — the sum total of all the information that defines the machine's status or condition. State information includes all the software installed on the hard disk, the configuration files, the control panel settings, the configurable data in the BIOS, and the user's preferences settings. It's everything that makes one system different from another system that has identical hardware.

Before PCs had hard drives, they were essentially stateless. They stored everything on floppy disks and tapes. Users and administrators never had to install, uninstall, or manage any software on the system. Because the state information was independent of the machine, it was almost impervious to any disaster that befell the machine. If a meteor destroyed your PC, you could replace it with another PC and get back to work immediately. There was nothing to reinstall or reconstruct. (Today, NCs attempt to recreate this pure statelessness by storing everything on a server.)

By contrast, modern PCs hoard an immense amount of state information that's constantly changing. Even when you're staring blankly at the screen, a brief flurry of disk activity might signal that your OS is modifying its registry settings in the background. Problems arise when a change of state knocks the system off balance. Usually this happens after the installation of some new software — a new version of the OS, a new application, an updated device driver, or just about anything. Suddenly the system doesn't work like it used to. You are the victim of a software conflict that's often incredibly difficult to fix because you're not sure what changed or how to change it back.

Two of the biggest culprits are DLLs on Windows PCs and extensions on Macs. DLLs are code libraries that different programs can share. Extensions are programs that hook into the Mac OS during boot-up to modify the system's behavior or augment the capabilities of an application. Both types of components inflict ridiculous amounts of aggravation.

One common problem occurs when a software installer dumbly replaces a newer version of a component with an older version. The newly installed application works fine, but an existing application might start crashing. Users aren't sure whom to blame. Result: a series of frustrating tech-support calls.

Shouldn't the installer merely check a component's date stamp before replacing it? Alas, it's not always that simple. Sometimes the date stamp isn't definitive, or maybe it has changed. Windows allows an installer to query a DLL to discover its actual version number, which is safer. But even if every installer were this careful, version management is only one problem. "Some companies tend to change functions in a common DLL without telling everyone right away, and those changes can cause problems for existing programs," says Dave Galligher, product-development manager at Cougar Mountain Software, an accounting software vendor.

Programs expect their DLLs to contain functions that have a particular name, a particular list of calling parameters, and particular return values. But Windows has no standard mechanism for querying a DLL to confirm this information. A program that relies on a DLL function to return a 32-bit integer value could easily crash if a different version of the DLL returns a 64-bit-long integer instead.

The problem of managing a system's state has spawned a whole subindustry of utility programs and management tools: CleanSweep, Conflict Catcher, Extensions Manager, First Aid Deluxe, Norton Utilities, Oil Change, RealHelp, TuneUp, Uninstaller, and dozens more. OS vendors are rapidly adding new management features to their system software, too. It's all because today's PCs require more care and feeding than a barrel full of Tamagotchi Giga Pets.

It's also a classic example of accelerating complexity. Components such as DLLs were invented to reduce complexity; programs wouldn't grow so fast if they shared common code. But installers began splattering so many DLLs all over the hard disk that they created a new problem. That, in turn, spurs the industry to produce new management tools, utilities, and OS features — still more complexity. It starkly demonstrates how difficult it will be to transform PCs into truly reliable systems.

"The highest management cost in an IT environment comes from managing PCs," says Steve Mann, vice president of product strategy for Computer Associates. "They're not very manageable, and they're not very standardized in terms of configurations."

The chore of managing PCs is directly related to reliability. In a survey of 1800 IT professionals at the Computer Associates world user conference in 1997, 70 percent of the respondents agreed that mainframes are more reliable than PC-based client/server systems. "It's only recently that administrators have begun demanding the same levels of manageability and reliability that they're used to with mainframes and large servers," says Mann.

Searching for Solutions

Any solution must start with the way developers write, test, and debug their source code. Beyond that, installers must do a better job of loading finished programs onto systems. Finally, the OSes and applications must work together to make PCs easier to manage.

At the risk of igniting a flame war, it's only logical to place a large portion of the blame where it belongs: on C and C++. "Writing in C or C++ is like running a chain saw with all the safety guards removed," says Bob Gray, senior director of consulting services for Virtual Solutions, a developer of custom industrial applications. "It's powerful, but it's easy to cut off your fingers."

Few, if any, languages make it so easy to write bad code. Of course, anyone can write bad code in any language, but C and C++ are famously unforgiving. The computer industry standardized on C/C++ for commercial software development over a decade ago, creating a mountain of buggy software that will haunt us for decades to come.

Diehards protest that the sparsity of C/C++ is what makes it so fast. But PC hardware is getting so fast anyway that it's time to refocus instead on reliability. In the years ahead, as old-but-indispensable C/C++ programs continue to crash, the excuse that C/C++ conserves every CPU cycle will seem quaint — as quaint as coding the year in two digits instead of four, thus conserving 2 bytes of storage.

What's the alternative? Take your pick. All fourth-generation-language (4GL) tools are safer, including Delphi, PowerBuilder, TopSpeed, Smalltalk, and Visual Basic. Perhaps the best example of a modern language is Java. It contains numerous safeguards that stop many bugs before they happen (see the sidebar "Better Tools for Better Code").

Rushing development cycles to match "Internet years" is another source of trouble. "If you look at the industry today, we see six- or nine-month development cycles instead of 18-month cycles," says Gary Ulaner, group product manager for Quarterdeck's RealHelp. "There are also more programmers doing software development, and not all of them have the same level of discipline for quality assurance. The requirements of time to market and revenue often cause products to be shipped before they're ready."

One dubious solution is public beta testing. Time was, you had to be someone special to be a beta tester. Now anybody who has a computer, a modem, and a reckless disregard for system stability can test beta software. The novelty of being an insider who runs prerelease products (even if a million other people are doing the same thing) has made public betas a huge hit. But public betas are also responsible for spreading buggy code, leaving a wake of system crashes and trashed hard drives.

"Some people might not realize what beta means," says Virtual Solutions' Gray. "It's not just a trick way to get an early copy of a new product."

True, public betas expose fresh code to mass testing. But how many casual beta testers report unique bugs — or any at all? How many of them bother to remove the buggy software (including all its hidden components) from their system after the final product ships? How many realize what they're doing to their systems?

Microsoft's Madlener defends the practice of public betas but acknowledges that developers and users should be more careful. "Of late, we've been reviewing the disclaimer messages that come with these beta products," he says. "They call for some responsibility on the part of the beta testers, too, so they don't install the beta on a system that's mission-critical."

The next step is software installation — and installers need to get smarter. OS/2 Warp 4 has an integrated Feature Installer that makes sure the right files get saved in the right places without stepping on other components. It's not just for installing OS software, either; third-party developers can use it for applications. Unix package installers, who have been around a lot longer, do the same thing. There are also some good third-party installers, such as InstallShield for Windows and MindVision's Installer VISE for the Mac.

Madlener says Windows NT 5.0 will have a new Application Installer Service, which sounds a lot like OS/2's Feature Installer. It means that developers will no longer have to write their own setup code. Instead, NT 5.0 will execute a script that tells where each file goes. NT will arbitrate any DLL conflicts and keep a log of all new files and registry changes. According to Madlener, this will make it easier to cleanly uninstall the software or reinstall individual components.

Madlener says he doesn't know yet if other versions of Windows will get the installer, but he says Windows 98 will have a management tool called the System File Checker. This is a diagnostic program that checks system components and can reinstall missing or broken pieces. It also keeps a log that's a snapshot of the system's state, making it easier to reverse changes.

Automated Maintenance

An interesting but potentially hazardous solution to system maintenance is automatic updating. Few users or administrators have time to scour the Internet for the latest upgrades and patches. That has opened the door for utilities such as CyberMedia's Oil Change and Quarterdeck's TuneUp and RealHelp. They compare your system configuration to a database on the Web. Then they help you download and install any relevant updates. It's such a good idea that Microsoft is thinking about adding similar features to Windows.

But there's a danger: Every change of your system's state, no matter now minor, can potentially break some existing software. An older program might crash with a newer DLL or device driver, forcing you to upgrade that program as well. Sometimes this triggers a cascade of failures and fixes before the system returns to a stable state. Sometimes you reach a dead end in which no update for a broken program is available. And inevitably, the upgrades consume more memory, disk space, and CPU resources, accelerating the day when your PC becomes obsolete.

The phenomenon of new software breaking old software is well known to software engineers. Alan Wood, senior engineer at Tandem Computer, says fixes to Tandem's NonStop Kernel typically break something else in the OS about 5 percent to 10 percent of the time. Tandem catches those problems with thorough regression testing. But it's hard to perform that kind of formal testing on PCs: Every PC is slightly different.

Utilities such as Oil Change and TuneUp recognize this hazard. They log every alteration and save replaced components in a compressed archive, so you can undo an installation. But there's still a chance you'll wade deep into a series of changes and won't be able to roll back the system.

Applications can take some responsibility for system management, too. When a user launches Office 98 for the Mac, it performs a self-diagnostic. If it can't find any of its shared libraries — perhaps the user mistakenly disabled a library with the Extensions Manager — Office 98 installs a fresh copy from a compressed archive on the hard disk. It all happens invisibly, so the user won't even notice. Microsoft says future versions of Office for Windows will also be self-repairing.

The Essence of PCs

Of course, every new feature, management tool, OS upgrade, and utility program adds still more code and complexity to a system. Some experts think PCs won't stop crashing until everyone accepts the futility of "feature shock." In other words, the shortest path to stability is simplicity: simpler hardware, simpler software, simpler user interfaces. But this demands a whole new way of thinking, says Michael L. Dertouzos, director of the MIT Laboratory for Computer Science: "It's more difficult, a little bit like birth control."

He says the change, if it ever comes, could begin as a grass-roots rebellion. Someone will use the Web to distribute a leaner, meaner OS that circumvents the entrenched platforms. It'll be more stable, easier to use, and better understood.

It sounds a lot like what's happening today with Linux, or the early days of Mosaic. But Linux flunks the simplicity test, and Mosaic begat Navigator, which begat Communicator. Simple software doesn't stay simple for long.

At the other extreme is the NC concept: a stateless, simplified client designed for a wired world. But NCs sacrifice the crucial essence of a PC — unlimited local control. Mainframes and critical embedded systems achieve their high reliability by sacrificing local control, too. For better or for worse, many users and IT professionals would rather crash than switch.

That's why the ultimate solution is a long way off. Realistically, developers will continue to write bigger programs that ship before they're ready. OSes will continue to grow more complicated. Users will continue to vote with their dollars for feature-laden software. Established platforms and applications will continue to overshadow radical alternatives. And PCs will continue to crash.

Sidebars:

Where to Find

Candle
Santa Monica, CA
Phone: 310-829-5800
Internet: http://www.candle.com/

Computer Associates
Islandia, NY
Phone: 516-342-5224
Internet: http://www.cai.com/

Merit Project
Internet: http://www.meritproject.com/

Cougar Mountain Software
Boise, ID
Phone: 208-375-4455
Internet: http://www.cougarmtn.com/

IBM Server Group
Poughkeepsie, NY
Phone: 770-863-1234
Internet: http://www.s390.ibm.com/

MIT Laboratory for Computer Science
Cambridge, MA
Phone: 617-253-5851
Internet: http://www.lcs.mit.edu/

QNX Software Systems
Kanata, Ontario, Canada
Phone: 613-591-0931
Internet: http://www.qnx.com/

Quarterdeck
Marina del Rey, CA
Phone: 310-309-3700
Internet: http://www.quarterdeck.com/

Sun Microsystems (Solaris)
Mountain View, CA
Phone: 650-786-7737
Internet: http://www.sun.com/solaris/

Tandem Computer
Cupertino, CA
Phone: 408-285-6000
Internet: http://www.tandem.com/

Upgrading and Repairing PCs
Que
Phone: 317-581-3500
Internet: http://www.mcp.com/info/0-7897/0-7897-1295-4/

Virtual Solutions
Irving, TX
Phone: 972-550-7900
Internet: http://www.vsol.com/

Why PCs Crash, and Mainframes Don't

PCs versus
                  crash-resistant mainframe computers.

Anatomy of a Crash

Anatomy of a crash.

DLL Disasters

Dynamic-link library
                  disasters.
DLL conflicts are a common cause of crashes.

Tom R. Halfhill is a BYTE senior editor based in San Mateo, California. You can reach him at tom.halfhill@byte.com.


Letters / June 1998

Amen!

My hat is off — again — to BYTE, this time for facing quality issues head-on. "Crash-Proof Computing" (April cover story) and "Reliability Counts" (May editorial) express my sentiments exactly. It is painful to find customers already angry at the whole computer industry when I show up for a meeting with them. While they vent their frustrations, I find myself apologizing for sins I didn't commit, before I can even try to help them deal with their disasters. OS weaknesses are the most frustrating. And yes, cheap, poorly made, and virtually unsupported hardware is a big problem, too. Apparently many vendors are forgetting to tell customers that good PCs actually cost less in the long run.

George Rogers Clark
grclark@usaconnect.com

Embedded Risk

"Crash-Proof Computing" is right on target: Reliable software is simply not a high priority for software vendors or for their customers. You also correctly say that embedded systems are generally more reliable than PC software. But embedded systems threaten to become more like PCs, as pressure rises for reduced development costs, shorter time-to-market, and shorter product life cycles. Most disturbing, the exploding complexity of embedded systems is comparable to what happened in the PC industry 10 years ago, and embedded systems developers have started to emulate the PC industry even where this is irresponsible. For example, many developers switch from languages like Ada, Pascal, or Modula-2 to C, rather than to safe languages, like Java or Component Pascal.

One final note: A single-address-space operating system is not a bad thing per se; it's even necessary for efficient component software. But we need better protection mechanisms, which can either be done in software, by using safe languages, or in hardware, by separating address mapping and memory protection.

Dr. Cuno Pfister
Managing director
Oberon Microsystems
http://www.oberon.ch/

Letters / July 1998

Reliability Does Count

Regarding your statement in "Reliability Counts" (May Editorial), "I'm hard-pressed to think of any other piece of hardware you can buy for $3000 that's as failure-prone as a PC"; well, I'm hard-pressed to think of any piece of hardware that costs $3000 that can do anywhere near the number of things that a PC can, and with incredibly cheap added cost for the software!

Pete Stoppani
pstoppani@msn.com

That depends on your perspective. But my bottom line is that it's time to re-examine the whole idea that PCs are unreliable because they're versatile. And apparently our readers agree. Our April cover story "Crash-Proof Computing" generated more mail than any story in recent memory. Read on. --Mark Schlack, editor-in-chief.

See, We Told You

I have answered many of the questions dealt with in "Crash-Proof Computing" so many times that my clients are beginning to think that I am just giving them a load of BS. It's great to be able to point to an authoritative source — BYTE — and say, "See, I told you!"

Robert Schuett
President, CMT Systems
Calgary, Alberta, Canada
schuett@cmt.net

No Reliable Criteria

I applaud your recently declared enthusiasm for PC reliability. As the manager of information systems in a medium-size municipal government, I have always emphasized reliability in my purchasing decisions. The problem is how to differentiate the good from the bad. What objective criteria exist? The trade press almost exclusively emphasizes raw performance. I am starting to see more emphasis on technical support, but maybe if the products were more solid to start with, we wouldn't need to rely so heavily on that. Couldn't we all use those hours spent hanging on the phone in a much better way?

Henry Kalb
Kalb@Allentowncity.org

"Crash-Proof Computing" was worth a whole year's subscription. Great article, with great visuals to support it. I've passed it on to several colleagues, most of them very knowledgeable about computers. Uniform response: a really good overview of a serious problem.

Ace Allen, M.D.
Editor, Telemedicine Today
http://www.telemedtoday.com

Mission Critical

I love PCs. I can live with some of the drawbacks because I recognize the PC for what it is. But in my area -- industrial control — mission critical means if the computer fails, you have a disaster. Over the years I have seen the PC creep into that environment, put in with only one thought — saving money. The software has gotten buggier, downtime has increased, and unnecessary risks are taken to keep things going.

I could not agree with you more that shortcuts have been taken in PC design to achieve its price level. I also cannot believe that we are telling people to install as little software as possible. Is not that what a computer is designed to run?

Andrew L. Winter
St. John's, Newfoundland, Canada

I agree that things are getting out of hand. We're already talking about what kinds of follow-up articles we can do. We're even kicking around the idea of a new benchmark program that will deliberately try to crash a system, so we can obtain hard data about reliability. I'm dubbing this program the "CrashMark." We don't know if we can do it in a way that's fair, but we're looking into it. — Tom R. Halfhill, senior editor.

No Screamers

Fortunately for myself and other readers, BYTE doesn't lead with cover pages touting the latest 333-MHz screamers. Instead we got "Crash-Proof Computing." As the leader of a small team managing the implementation and support of more than 130 Intel, RS/6000, and Sun servers, the concept of crash-proof computing is dear to my heart. I am painfully aware of many of the issues you described, their underlying causes, and remedies. Your work in putting these issues together with direct reference to the technology, design purpose, and directions was excellent.

J. Dennis King
jdennisking@ibm.net

The Mainframe Perspective

As a software engineer who has worked on both sides of the fence, I take exception with the statements in "Crash-Proof Computing" that "anyone can write software for PCs" and "not just anybody can program a mainframe." Mainframe programs are just as likely to suffer from code bloat as their PC cousins, because on mainframes and midrange systems, system resources are usually not at a premium. I have seen many mainframe programs that had blocks of code endlessly repeated in the main routine, rather than put into a subroutine. It is very true that on the PC side a huge amount of code has been written by amateurs. The many tools for nonprogrammers encourage companies to use fledgling programmers rather than hire a professional. Things are really no better on the mainframe side. Many companies, faced with programmer shortages, are hiring people to do mainframe programming and conducting extensive "on the job" training. The myth persists that mainframe programmers are better trained and more skilled than their PC counterparts. The reality is that programming on any platform requires a high degree of skill and intelligence.

John Cahill
jcahill@scc911.com

I'd still argue that it's a lot easier to start hacking away with Visual Basic on a PC than it is to write code for an IBM S/390. Also, anybody who writes a program for a PC can easily distribute it as freeware or shareware on the Internet, which isn't how mainframe software is typically distributed.

Yes, mainframes, too, can suffer from code bloat. After my story went to press, I found out the number of lines of code in IBM's OS/390: 25 million. That's a lot of code, but it's still less than NT 5.0, and about the same as Windows 98, from what we hear. However, the programmers who wrote that repeating code you mention might have been highly skilled! "Inlining" code instead of using a subroutine makes the code larger, but it executes faster because the CPU doesn't have to branch as often. In fact, some optimizing compilers will automatically inline your code, even if you don't write it that way. Of course, it's also possible that the code you saw was simply written by a bad programmer. We definitely agree on your final point: Programming requires skill and intelligence. — Tom R. Halfhill, senior editor.

More Code Bloat

In "Crash-Proof Computing," author Tom R. Halfhill say that Windows NT 5.0, which will have an estimated 27 million to 30 million lines of code, represents about a 700 percent growth in code size in six years. What exactly do these millions of lines of code represent? Do they include user-space code, such as the user interface, system commands, etc.?

Andy Kahn
kahn@zk3.dec.com

I asked that question, too, and the general answer was, "It's everything we consider to be part of the OS." In other words, more than the kernel, but nailing it down in more details is almost impossible. In fact, most OS vendors I contacted couldn't even quote me a number. Since Microsoft is arguing to the Justice Department that Internet Explorer 4.0 is an integral part of Windows 98, I guess that would be included, too. Gets fuzzy, doesn't it? But the actual number of lines is perhaps not as significant as the overall trend. — Tom R. Halfhill, senior editor.

Linux = Robust OS

"Crash-Proof Computing" avoided comparisons of the robustness of different operating systems. My main workhorse machine runs Linux. It never crashes. And from what I've heard, my experiences are typical of Linux users. Your readers need to know that there are choices in OSes that are virtually crash-proof. Switching to Linux may not solve everyone's problems, but it clearly excels in robustness. Isn't that what your article was about?

Rob Scala
New London, CT
rob@scalas.com

I had planned a chart that showed how frequently different OSes crash, but I soon discovered that reliable data is not available. There's a lot of anecdotal evidence, but that's not the same thing. I did mention Linux (and Unix in general) as an OS that has more modern features, such as preemptive task scheduling. But Linux is not a simple OS to install or configure. That is why I said it "flunks the simplicity test." It may not always flunk. I'd like to see more transparent installation and configuration, and there is progress in that direction. But anyone who thinks Linux is suitable for the average person obviously doesn't spend much time around the average person. The bottom line is that we need OSes for PCs that not only are reliable, but also are easy enough for the average PC user to manage. No current OS passes that test, I'm afraid. — Tom R. Halfhill, senior editor.

Don't Trust Anyone Under 30

"Crash-Proof Computing" was too good! It is this sort of professional, no-nonsense reporting that makes BYTE so worthwhile.

As much as I enjoy my PCs, I often have the impression that they and their operating systems were designed by young people who somehow missed out on the entire history of computing technology. One example: Under MVS we had a system catalog, a central repository of every file name in the complex and where it resided physically. Any application that wanted a file asked the OS; the OS looked in the catalog and gave the application a pointer to the file. If you changed the physical location of a file by any method, the catalog was automatically updated by the OS unless you explicitly specified otherwise. If you deleted or created a file, the catalog was updated. Reinstalling applications, moving them or even putting in a whole new OS made no difference.

Now move a file on the PC. You get a little flashlight on the screen and that sinking feeling. Move an application and the same thing will probably happen because the working directory has changed. Incredibly, physical volume pointers are stuck in the properties table, and if they are not there, you must type in the path at an application prompt. If the "catalog" isn't in your head or written on a piece of paper, you have a problem. PCs are useful, but you have to be under 30 years of age to be really impressed.

Garth Klatt
Softek Research
73642.1620@compuserve.com

The Last Nail

Mark Schlack's "Reliability Counts" hit the nail on the head. My current work project is the migration of a 100-client, 120-server Token ring LAN to a new facility. I can't even make any progress until I've waded through the endless repairs, rebuilds, reconfigures, and reconnects caused by bad hardware and software. Now that you know which nail to hit, please get as big a hammer as you and keep hitting it!

Crawford Leitch
itch@earthlink.net

Copyright 1994-1998 BYTE

Return to Tom's BYTE index page