Cover Story / February 1994

How Safe Is Data Compression?

MS-DOS 6.0 with DoubleSpace raised real-time
data compression to a new level of visibility. As
controversy raged over its reliability, some concerned
users retreated from the technology. But the real
issue isn't data compression at all; it's how
compression is integrated into the operating
environment. New approaches promise to make this
technology much more foolproof.

Tom R. Halfhill

To paraphrase Aesop, necessity is the mother of compression. If it weren't for the explosive growth in the size of operating systems, applications software, and data files, today's hard disks would be cavernous warehouses with acres of megabytes to spare. In just a few short years, desktop PCs with 40-MB hard drives have given way to systems that rival network servers: 200-, 400-, and even 1-GB hard drives aren't uncommon.

Yet despite the rapidly rising capacities and falling prices of hard disks, not everyone can afford the latest hardware, and some systems can't be upgraded. So it's no surprise that millions of PC users who bought MS-DOS 6.0 last year quickly embraced a new feature called DoubleSpace. Merely by installing some free compression software that ran invisibly in the background, they could virtually double the size of their existing hard disks.

But it wasn't that simple. Almost immediately, Microsoft was besieged by complaints about a myriad of problems, and some users reported catastrophic losses of data. Other people had no trouble at all and enthusiastically endorsed DoubleSpace. The controversy raged on BBSes and on-line services for months, while Microsoft steadfastly denied that DoubleSpace was buggy. Finally, last November, Microsoft released DOS 6.2 with several new safeguards.

Nevertheless, many users remain spooked about real-time data compression. Their fear is fed by persistent horror stories of users who have trashed megabytes of valuable files while using DoubleSpace and other on-the-fly disk compressors. Although in many cases the compression software is an incidental player, the association has been made: Compression is unreliable.

The real problem isn't data compression, though; it's how well the technology is implemented in the operating environment. And no environment is more hostile than that of DOS-based PCs.

PCs are plagued by dozens of "standards" covering everything from video cards to I/O buses. The memory layout resembles a map of the Balkans. TSRs and device drivers fight territorial battles over disputed memory blocks and interrupts. Applications think nothing of bypassing the ROM BIOS to save a few microseconds, and there are a number of versions of the BIOS from different vendors. Software installation programs automatically rewrite critical configuration files such as AUTOEXEC.BAT and CONFIG.SYS, often without notifying the oblivious user. Windows 3.1 layers a multitasking GUI atop a single-tasking, character-based operating system and contributes additional configuration files: SYSTEM.INI, WIN.INI, CONTROL.INI, and more. Different versions of DOS are available from three major companies. Most important, the DOS file system was simply not designed with real-time compression in mind.

Nearly all the troubles that users experienced can be traced to confusion or to odd interactions between the compression software and other parts of the system. That doesn't make the complaints any less serious, of course, but it does mean that the future of data compression is tightly bound to the continuing evolution of PCs. As PCs mature, become easier to use, and consolidate around better standards, data compression will steadily gain in popularity.

Transparency is the key: Data compression works best when it's completely transparent to both users and software. Today's compression software tries to keep a low profile, but it is frequently shoved into the open by forces beyond its control.

One lasting effect of the DoubleSpace debate is that all makers of compression software are paying even closer attention to safety issues. Current products will continue to improve, and new approaches are being explored.

For instance, some companies are working to move compression into hardware. By integrating the technology with the CPU's I/O bus — and, perhaps, even in the CPU itself — data compression could become as transparent as floating-point math. The goal is not only to conserve hard disk space but also to significantly improve system performance by keeping the data compressed while it moves over the bus to peripherals and networks, and maybe even to main memory.

Compression Goes Prime-Time

The basic concept of data compression is at least as ancient as the Romans, who figured out that the Roman numeral V required less space on a stone tablet than did IIIII. Modern compression techniques are widely used to shrink huge graphics, video, and sound files down to manageable size.

But those types of compression — JPEG, MPEG, Indeo, and the compressors included with QuickTime and Video for Windows — are so-called lossy methods; some data is irretrievably discarded when the files are compressed. Lossy compression is unacceptable for critical data, such as spreadsheets, databases, and text. For those types of files, only lossless compression will do: Not a single bit of valuable information can be lost during compression or decompression.

Before DoubleSpace, the most popular lossless compression products were the file-level utilities for archiving data on floppy disks and saving time during downloads. One of the leading file compressors for PCs is Phil Katz's PKZip, a shareware program so effective, it can squeeze the complete text of NAFTA (North American Free Trade Agreement) from its normal bureaucratic bulk of 3.3 MB down to a mere 568 KB — an impressive compression ratio of nearly 6 to 1. The resulting file is small enough to fit on a floppy disk or to download from a BBS. Similar utilities are available for Unix and the Macintosh (see the text box "Data Compression on the Macintosh").

But file-level utilities require users to run a program to compress and decompress the file. Some utilities can make self-extracting archives — a single executable file that encapsulates both the compressed data and the decompression program — but it's still not simple enough for casual users. For them, real-time, on-the-fly data compression is a better solution.

Real-time compressors run in the background, automatically shrinking files when they're saved on disk and expanding them when they're loaded. Most real-time compressors set up a compressed virtual drive on the uncompressed host drive, so compressing a file is as easy as saving or copying a file onto the new virtual drive.

For PCs, examples include DoubleSpace from Microsoft; Stacker from Stac Electronics (Carlsbad, CA); XtraDrive from Integrated Information Technology (IIT) (Santa Clara, CA); SuperStor Pro from AddStor (Menlo Park, CA); and DoubleDisk Gold from Vertisoft Systems (San Francisco, CA), which supplied Microsoft with compression technology for DoubleSpace.

In 1991, DR-DOS 6.0 from Digital Research was bundled with SuperStor, thus becoming the first version of DOS to include real-time data compression. Until DoubleSpace came along, however, real-time compressors were mainly confined to a relatively small market of power users. With the release of MS-DOS 6.0, millions of casual users who barely knew the difference between a physical drive and a logical drive were suddenly creating compressed volumes on their hard disks with nary a second thought — some with disastrous results.

Trouble with DoubleSpace?

Data-loss problems are always difficult to trace, and the natural tendency is to blame the last thing installed. With DoubleSpace, most problems seemed to fall into three categories: (1) file corruption caused by bad sectors on the hard disk; (2) puzzling disk-full errors caused by badly fragmented compressed drives or lower-than-expected compression ratios; and (3) subtle interactions with other software, including the SmartDrive disk caching in DOS 6.0.

Microsoft responded by adding several safeguards to DOS 6.2 and DoubleSpace. ScanDisk, a new diagnostic/repair utility, fixes damaged files and automatically scans the hard disk for surface errors before DoubleSpace is installed. DoubleGuard, a new protection option, alerts users if another program or TSR corrupts the RAM-resident portions of DoubleSpace. SmartDrive no longer turns on writeback caching by default. And Microsoft also made it easier to remove DoubleSpace altogether — a paradoxical but popular feature in other compression products.

Despite the safeguarding efforts of Microsoft and others, some users have simply reached the end of their ropes. Steven Polinsky, a lawyer in Ridgefield, New Jersey, says he removed Stacker from the 40-MB hard drive of his desktop PC after experiencing mysterious errors, even though he's not sure Stacker was to blame. Next, he installed DoubleSpace on his laptop computer's 60-MB hard drive but immediately ran into video initialization trouble with Windows. Now, he's reluctant to put either Stacker or DoubleSpace on his desktop PC.

"Overall, I'm a believer in DoubleSpace and in data compression in general," says Polinsky. "But this is my mission-critical computer. It's my billing, my accounting, my research tool, my everything. I can't live without it."

However, most users are so hungry for hard disk space that they're willing to take a little misfortune in stride. Chris Cooper, a software engineer working in Pforzheim, Germany, was not deterred even after CHKDSK failed to fix an error on his DoubleSpace drive: "I backed up, reformatted, and reinstalled everything, and that certainly fixed the problem."

The irony is that data compression is a rock-solid technology. There's no magic or voodoo; it's as straightforward and reliable as 2+2 = 4. The comparison is apt because virtually all lossless data-compression products are derived from principles of information theory that were formulated in the 1940s and later refined in the 1970s and 1980s. At its roots, the basic technology of data compression may be the levelest playing field in the computer industry. It is how various companies use compression that makes the difference.

Not Rocket Science

Typically, lossless data compression is based on some variation of the LZ (Lempel-Ziv) or LZW (Lempel-Ziv-Welch) methods, named after Abraham Lempel, Jacob Ziv, and Terry Welch. When adapted for real-time compression, LZ/LZW strikes a reasonable compromise between efficiency and speed. On average, it achieves a compression ratio of about 2 to 1. Compare that to lossy methods such as JPEG, which deliver compression ratios as high as 100 to 1, if you aren't too picky about quality.

Compression works better on some types of files than on others, and some files cannot be compressed at all. Compression algorithms depend on repeating patterns of data, so they don't work on files consisting of random data. Typical examples include encrypted files (the better the encryption, the more random the data) and files that have already been compressed (because randomization is a byproduct of compression; otherwise, you could repeatedly compress a file until it was squeezed down to a single byte).

Compression algorithms can be optimized for different data types. One method, RLE (run length encoding), works well on files with long strings of repeating bytes. For example, if a portion of a graphics file has 100 white pixels in a row, an RLE compressor might save 1 byte that indicates "white" and another byte that indicates "100." The decompressor knows that the first byte represents the color and that the second byte tells how many pixels of that color will follow. Even though RLE is a good choice for compressing a graphics file, it would be a poor choice for a text file.

For that reason, some compressors analyze the uncompressed data to choose the optimal compression method. However, none of the real-time disk compressors, such as DoubleSpace, does this. Analysis takes time, and the extra compression isn't worth the performance hit.

Instead, DoubleSpace and other real-time disk compressors use a sliding dictionary form of LZ compression, no matter what kind of data the file contains. To shrink a file, the compressor looks for repeating patterns. It then replaces each pattern with a pointer that refers back to an earlier occurrence of the same pattern, as well as a token that specifies the length of the pattern. Later, when the file is decompressed, the pointers and tokens are replaced with the original patterns.

Microsoft cites this example: "the rain in Spain falls mainly on the plain." Counting spaces and the period, this phrase normally requires 44 bytes. But it contains several repeating patterns, including ain and the. DoubleSpace would encode the phrase as follows:

the rain [3,3]Sp[9,4]falls m[11,3]ly on
                [34,4]pl[15,3].

Bracketed numbers represent pointers and tokens, so [9,4] tells DoubleSpace to replace the pointer (9) and token (4) with the four-character pattern that begins nine characters before the pointer.

The result: The compressed version requires 37 bytes instead of 44. That's not an enormous saving, but the method works much better on database files, whose fields are padded with lots of spaces, and on graphics files that have large areas of solid color. (The algorithm does not care whether the patterns of bytes represent ASCII characters or any other kind of data.)

This method is known as sliding dictionary because the compressed data itself contains the "dictionary" of patterns that's later used to reconstruct the file and because the compressor works its way through the file using a fixed-size sliding window. In other words, the compressor will not scan backward through the entire file to locate a matching pattern; it searches only a window of bytes that slides through the file during compression. The size of that window usually ranges from 2 to 8 KB. (DoubleSpace's sliding window is about 4 KB.)

These and other variables allow for some product differentiation, but in truth, no lossless disk compressor enjoys a knockout advantage in terms of compression efficiency or speed. Much more important is how the compression software interfaces with the operating system, how the compressed volumes are structured, and the quality of their diagnostic and repair utilities.

Implementing Compression

To keep data compression as transparent as possible to the user (and to applications), it's best implemented as a background process that hooks into the normal file system and automatically compresses and decompresses files as they're saved on disk. If the compression software installs itself as a virtual drive on the system (similar to a logical partition), it can recede even further into the background. On PCs, however, that requires finding room in memory for yet another device driver and then protecting it from other drivers, rogue programs, and territorial TSRs.

File-level compressors (e.g., PKZip) don't test the fragility of the DOS environment because they don't run in the background, and their compressed files don't appear any different to the system than do ordinary files. But real-time compressors must rely on a device driver to reroute all file I/O through their compression routines.

Before MS-DOS 6.0, most third-party compressors used the same method employed for years by RAM disks, which are also virtual drives: They loaded the device driver from the CONFIG.SYS file during bootup. However, this approach has a few disadvantages. To swap drive letters so that the compressed drive appears as drive C, both the virtual drive and the physical drive need duplicate copies of CONFIG.SYS, AUTOEXEC.BAT, and all files they reference. This, in turn, leads to synchronization problems when you make any changes to the files.

Another potential problem is the competition for memory. If too many device drivers and TSRs load into conventional memory (i.e., the first 640 KB of RAM), some MS-DOS programs — particularly games — will not have enough memory to run. If you modify the CONFIG.SYS file to load the compressed drive's device driver into upper memory (i.e., the area above 640 KB and below 1024 KB), it may conflict with other device drivers or TSRs competing for the same territory.

Preloaded Drivers

Digital Research's DR-DOS 6.0 offered a novel solution: A new system file called DCONFIG.SYS that booted before CONFIG.SYS. SuperStor's device driver could load from DCONFIG.SYS immediately after the memory manager, mount the compressed drive, and then chain to CONFIG.SYS on the compressed drive before any other TSRs or drivers tried to grab memory. In other words, the compression software got a head start.

With the release of MS-DOS 6.0, Microsoft achieved basically the same result with a somewhat different approach. Previous versions of MS-DOS couldn't load a device driver before CONFIG.SYS, but MS-DOS 6 has a modified IO.SYS boot file that automatically preloads a device driver called DBLSPACE.BIN before CONFIG.SYS executes.

DBLSPACE.BIN reads a new configuration file called DBLSPACE.INI, mounts any compressed drives it finds listed there, assigns the appropriate drive letters, and only then passes control to CONFIG.SYS. This happens before any other device drivers or TSRs get a chance to load from CONFIG.SYS or AUTOEXEC.BAT. The CONFIG.SYS file still needs to run a program called DBLSPACE.SYS that relocates DBLSPACE.BIN from conventional memory to upper memory. Even if DBLSPACE.SYS doesn't run, or if the entire CONFIG.SYS file is trashed, DBLSPACE.BIN still preloads and mounts the DoubleSpace drive.

Once the compressed drive is mounted, it appears to the system as a virtual drive. All file I/O happens normally, except the device driver intercepts the I/O to compress and decompress files as they're saved and loaded from the new drive.

Besides DoubleSpace, Stacker 3.1 is the only other compressed-drive product for MS-DOS that preloads its driver before CONFIG.SYS. Earlier versions of Stacker used the former method of loading the device driver within CONFIG.SYS. SuperStor and Vertisoft System's DoubleDisk Gold still load their drivers from CONFIG.SYS when running under DOS.

However, SuperStor/DS — a new DoubleSpace-compatible version of SuperStor included with IBM's PC-DOS 6.1 (which will soon be the only version of SuperStor available) — preloads its driver before CONFIG.SYS when running under IBM's PC-DOS 6.1, which has the same preload capability as MS-DOS 6.0. And the latest version of DR-DOS — now called Novell DOS 7 after Novell's acquisition of Digital Research — comes with Stacker 3.1 instead of SuperStor and also adopts the preload technology.

IIT's XtraDrive adds still another twist. Although XtraDrive loads a device driver from CONFIG.SYS like SuperStor and DoubleDisk Gold, it handles file I/O a bit differently. When you install XtraDrive on a hard disk, it relocates the DOS boot files elsewhere on the drive and substitutes its own custom boot files in the boot sector. As a result, XtraDrive boots first when you switch on the machine, and DOS boots immediately afterward. That allows XtraDrive to intercept calls to BIOS INT 13 (disk I/O) and redirect the I/O to its own compression routines.

Because XtraDrive still relies on CONFIG.SYS to load its device driver, it is as vulnerable as SuperStor and DoubleDisk Gold to CONFIG.SYS problems. If the critical command in CONFIG.SYS is accidentally deleted or the CONFIG.SYS file is trashed or the device driver is somehow corrupted, the compressed drive won't mount. Users have not lost any data at that point, but they will probably be alarmed that their compressed drive seems to have vanished.

The compressed drive is still there, of course, but it's not recognized by DOS until the problem is corrected. In a worst-case scenario, an unsuspecting user might panic and do something that actually destroys the data (e.g., assume the data is already lost, reformat the hard disk, and reinstall the compression software). For these reasons, the ability to preload a device driver independently of CONFIG.SYS is considered an important safety feature of real-time disk compressors.

Data Integrity

Other safety factors come into play after the device driver loads into memory, and DOS mounts the compressed drive. Compression products take significantly different approaches in the way they simulate a virtual drive and organize their internal structures.

For example, DoubleSpace, Stacker, SuperStor, and DoubleDisk Gold all simulate a virtual drive by creating a single, large file on the uncompressed host drive. (Microsoft calls it a compressed volume file, or CVF.) In other words, the hundreds of files stored on your compressed drive actually appear on the physical drive as a single file.

It is not just an enormous jumble of data, of course — the file mapping is handled internally by the compression software. (XtraDrive, again, is the exception; it stores compressed files in the normal fashion.)

Some people fear that storing everything in one massive file compromises data integrity. However, a number of safeguards and cross-checks are built into the compression architectures to prevent you from losing information even if the CVF is corrupted. What's most important is not whether compressed data is stored in a CVF or in discrete files, but rather the integrity of the compression architecture and how readily you can diagnose and repair common problems with disk utilities.

This is the main battleground on which compression vendors are waging warfare. It also accounts for much of the controversy over DoubleSpace.

Cluster Bombs

For instance, a significant difference among DoubleSpace, SuperStor, DoubleDisk Gold, and Stacker 3.1 is how they store compressed data. All handle data in 8-KB chunks that are compressed to fit variable-size clusters. A cluster may contain 1 to 16 sectors, each 512 bytes long. But only Stacker can subdivide a cluster and store the pieces in scattered locations on the disk. The others must store a cluster in sectors that are contiguous.

This can lead to problems if the compressed drive becomes badly fragmented. Fragmentation inevitably happens over time as you save, delete, and resave files on a disk. It happens faster under certain conditions, but gradually all disks become fragmented, especially if they're nearly full. Eventually, there's not enough contiguous free space to save an entire file, so DOS has to split up the file and store the clusters in various places around the disk.

Other than slowing down disk I/O, fragmentation isn't a serious problem on an uncompressed drive, because the clusters are always a fixed size. As long as DOS can find enough free clusters, no matter where they're located, it can save the file. If there aren't enough free clusters, DOS returns a disk-full error.

On a compressed drive, however, things are a little more complicated. (Well, a lot more complicated.) To begin with, the actual size of a cluster varies in direct proportion to the compression ratio. The goal is to more efficiently use the disk space that DOS often wastes.

Because uncompressed DOS disks have fixed-size clusters (usually 8 KB), a tiny five-line batch file would still occupy a whole cluster. On a compressed drive, that file could be stored in a one-sector cluster (512 bytes), thus saving 7.5 KB of disk space. If you save an 8-KB file on a compressed drive and if the compressor achieves a 2-to-1 compression ratio, the resulting file needs only 4 KB and occupies a cluster of eight sectors (8 times 512 bytes equals 4 KB). The best possible case is a 16-to-1 compression ratio (yielding a one-sector cluster, 512 bytes). The worst case is a 1-to-1 ratio — no compression (yielding a 16-sector cluster, 8 KB).

OK, so far. But what if the compressed drive is badly fragmented and DOS can't find enough contiguous sectors to store a cluster? Stacker 3.1 will break up the cluster into smaller pieces (known as extents) and fill in the holes. DoubleSpace, SuperStor, and DoubleDisk Gold can't do this. Instead, they return a disk-full error — even if there's actually enough free space on disk to save the file.

The problem gets worse if you're trying to save data that can't be compressed. Perhaps the file is encrypted or has already been compressed with PKZip or is being downloaded as a GIF (CompuServe's compressed file format for graphics). The 8-KB cluster can't be compressed any further, so it needs 16 contiguous sectors (16 times 512 bytes equals 8 KB). Even if the compressed drive has megabytes to spare, DoubleSpace, SuperStor, and DoubleDisk Gold can't save the file if they can't find 8 KB of contiguous free sectors.

In theory, the compressed drive could have hundreds of megabytes free and still return a disk-full error because of a single cluster that won't fit in the holes. In reality, could a drive actually become that severely fragmented? Not likely, in normal use. But Blossom Software (Cambridge, MA), which sells a diagnostic utility called DoubleCheck, gives away a small program called Bust that demonstrates the problem.

Bust deliberately fragments a DoubleSpace drive and then attempts to save a file that won't fit within the clusters. (Don't try this on a drive with important data.) According to Alan Feuer, director of software development at Blossom, DOS 6 sometimes won't return a disk-full error but, instead, reports that the file was successfully saved. Result: a trashed file. Feuer says Microsoft fixed the problem in DOS 6.2.

Microsoft denies such a bug exists, but those who are curious can find Bust in the IBM Forum on CompuServe. (Microsoft removed it from the MSDOS Forum.)

Of course, you can avoid all these problems by defragmenting the compressed drive (e.g., using either DOS's DEFRAG or a third-party utility) on a regular basis; however, some users are not very attentive to system maintenance. What they need is some kind of background defragging that functions as transparently as background compression. AddStor sells a product called DoubleTools for DoubleSpace that — among other things — provides this important function.

FAT Structures

Stacker drives get fragmented just as easily as other compressed drives, but since they can subdivide clusters, there's less chance you'll encounter a mysterious disk-full error. The PC version of Stacker (but not the Macintosh version) can do this because it has an additional mapping table that keeps track of the scattered extents. The extra table is an extension of the FAT (file allocation table), which DOS uses to allocate clusters.

Here's yet another area where compression products differ. They each take a slightly different approach to how they organize and verify the integrity of the FAT and related mapping structures. Naturally, each vendor claims its approach is the most reliable.

If a disk's FAT gets corrupted, DOS won't know which clusters of data belong to which files. If you can't repair the damage, the result could be lost data. For safety, therefore, DOS normally keeps two copies of the FAT on an uncompressed disk. Stacker and XtraDrive also keep two FATs on their compressed disks. DoubleSpace, SuperStor, and DoubleDisk Gold keep only one FAT.

The argument for keeping two FATs is redundancy: If one FAT gets trashed, a repair utility can try to restore it with information from the second FAT. The argument for keeping one FAT is simplicity: If two FATs somehow get out of synchronization, which one is correct?

This could happen if your computer crashes or the power fails while saving a file on a compressed drive. The disk I/O might be interrupted after DOS has updated only one copy of the FAT. It's even more likely if you're using writeback disk caching, because the FAT update could be delayed a few seconds. (One of the changes between MS-DOS 6.0 and 6.2 is that SmartDrive's writeback caching is now turned off by default.)

Microsoft contends that not only are two FATs unnecessary, but that the extra mapping table Stacker uses to subdivide clusters adds yet another layer of complexity to an already complex scheme. In fact, compressed drives from all vendors have internal mapping structures that are much more complex than ordinary drives because they have to keep track of such things as variable-size clusters and compression ratios. DoubleSpace, for example, supplements the normal FAT with a BitFAT and an MDFAT. Stacker's mapping table for extents adds a third level of indirection beyond the FAT and the variable-size cluster mapping.

That's too complicated, says Benjamin W. Slivka, development leader for MS-DOS. Slivka says third-party tool vendors complain that Stacker's architecture is more difficult to support. While true, that hasn't stopped the tool vendors. Most of the major diagnostic and repair utilities support both DoubleSpace and Stacker, although there's less support for SuperStor, DoubleDisk Gold, and XtraDrive, which don't command as much market share.

All disk compressors also come with their own utilities, and these tools are tailored for their unique compression architectures. Often, they try to turn complexity into an advantage by performing extensive cross-checks between the various mapping structures. XtraDrive, for example, compares both copies of the compressed drive's FAT during bootup. If they don't match, the user is advised to run a program called VMU (Volume Maintenance Utility). VMU tries to figure out which FAT is correct by checking file links, mapping tables, and free clusters.

Strange Interactions

Anytime something as complex as real-time disk compression is introduced into an environment as unruly as DOS, there are bound to be unforeseen consequences. When a mysterious problem can be traced at all, often it's not directly caused by the compression software itself, but rather by interactions among various elements of the system (see the text box "Data Loss: A Cautionary Tale" on page 64).

Microsoft has compiled a list of software that may not work on a DoubleSpace drive, including protected copies of Lotus 1-2-3 release 2.01, Informix relational database, MultiMate 3.3/4.0, the DOS version of Quicken, Movie Master 4.0, Tony La Russa Baseball II, Empire Deluxe, Links, Ultima, and others. Some of these programs won't run on any compressed drive, and the reasons vary widely, ranging from tricky copy-protection schemes to their handling of temporary files.

Different versions of the ROM BIOS are known to cause problems, too. Some BIOS chip sets don't properly handle an interrupt call made by DBLSPACE.BIN during bootup, resulting in stack corruption. The DOS 6.2 version of DBLSPACE.BIN doesn't call this interrupt.

Writeback disk caching has also been singled out for blame. Some users are in the habit of switching off their computers immediately after quitting an application or even without quitting. If the disk cache isn't flushed before the power goes down, open files may not be closed properly, and the FAT may not be updated. It's a small problem that can snowball, eventually corrupting multiple files. DOS 6.2 now makes sure the cache is flushed before redisplaying the DOS prompt on the screen, but what DOS really needs is a controlled shutdown procedure like that of Windows NT, Unix, and the Mac.

Another interaction is possible with the MS-DOS FORMAT command. Many users scan their hard disks with utilities that check the media for surface errors and then mark those bad sectors so that they'll never be allocated to files. The bad sectors are deallocated in the FAT. But what most people don't know is that FORMAT rewrites the FAT and may reset the bad-sector flags, thus freeing those sectors for allocation to files.

This little detail stumped some users who backed up and reformatted their hard drives before installing MS-DOS 6 and DoubleSpace. Their idea was to clean off the disk and reduce the considerable amount of time it can take to compress a crowded drive. Ironically, it's the kind of thing only a power user would think of; it probably wouldn't occur to a casual user.

But if they didn't immediately follow the reformat with another disk scan, there's a chance at least one file would end up in a sector previously marked as bad. Sometimes that file happened to be the DoubleSpace CVF. And what happened next depended on the kind of data stored in that sector. If it was part of an executable file, the program would probably crash. If it was part of a data file, information could be lost. Either way, the hapless user was in the dark.

Why wasn't this problem discovered before MS-DOS 6 and DoubleSpace? After all, the FORMAT command has worked the same for years. But apparently, it wasn't noticed or considered important until data compression made the environment more precarious.

Fortunately, Microsoft says it has changed the FORMAT command in DOS 6.2 so it doesn't reallocate bad sectors marked in the FAT. Also, a new utility in DOS 6.2 (ScanDisk) automatically checks the drive for bad sectors before installing DoubleSpace.

Only a tiny minority of users would be affected by something like this, but that's potentially a lot of people when multiplied across the huge installed base of MS-DOS. In fact, MS-DOS 6.0 and DoubleSpace have inspired a whole cottage industry of diagnostic programs, fix-it tools, and free advice on public networks.

Touchstone Software (Huntington Beach, CA), which sells a disk utility called CheckIt Pro, inadvertently upset Microsoft by posting a free program on CompuServe last summer that scans a hard disk for bad sectors. Touchstone was among the first to identify the FORMAT problem. Company president Shannon Jenkins says her small company got in hot water with Microsoft, but she added, "I think the release of DOS 6.2 has borne us out...the things we talked about back in June have now shown up in DOS 6.2."

Microsoft was the first company to encourage users to install data compression without taking security measures. "Just press Return, and you'll get data compression and writeback caching and lots of other stuff," says Jenkins. "Microsoft should have encouraged users to take a more cautious approach."

Hardware Compression: Full Circle?

Microsoft has a golden opportunity to clean house with the upcoming release of Windows 4.0 (code-named Chicago). As a major revamping of the PC environment, Windows 4.0 could sweep away years of old code and build a new foundation that's designed from the ground up to accommodate such features as data compression.

However, there's another possibility: By submerging compression even deeper than the operating system, it could be made even more transparent and foolproof. What's deeper than the operating system? The hardware.

Once again, hardware-based data compression is an old idea. Back in the days when 8086- and 286-based PCs were the norm, there was a market for plug-in ISA boards that sat on the I/O bus, compressing and decompressing data on its way to and from the hard drive. The compression algorithms were hard-wired into high-speed chips. Real-time compression couldn't be done in software back then because CPUs weren't fast enough. Not until speedy 386 microprocessors became available could software-based compressors work in real time without noticeably affecting system performance. And the plug-in boards became obsolete because they were constrained by the slow speed of the ISA bus.

Hardware-based compression still survives in tape backup units, where it's so reliable and transparent that users scarcely know it's there. In fact, the widespread QIC (quarter-inch cartridge) compression standard for tape backup originated at Stac in the mid-1980s.

Now Stac and other companies are taking another look at hardware compression. The potential advantages are many: Better integration with the system; more transparency to users; greater compression ratios; improved system performance; and, perhaps, faster networking.

Speedy local buses such as VL-Bus and PCI (Peripheral Component Interconnect) are appearing in more new PCs, thus solving the ISA constraint. Hardware compression would require little or no installation or intervention by the user. Greater compression ratios are possible because high-speed compression chips can use more complex algorithms. They also free the CPU for other tasks and don't occupy memory, as software-based compressors do. Finally, by keeping the data compressed as it moves through the computer and over networks, hardware-based solutions can dramatically improve overall system performance.

The Future of Compression

What kinds of performance gains are possible? Stac says it already has a prototype VL-Bus card that compresses data 20 percent to 50 percent faster than software-based compressors and uncompresses 10 percent to 30 percent faster. But that's just a start.

Software compressors currently work at about 1 MBps on a 66-MHz 486, and about 2 MBps on a Pentium. By next year, Stac says it will have compression chips capable of 10 to 20 MBps; in two or three years, 50 to 60 MBps. Of course, CPUs will get faster, too, but not anywhere near that pace.

The compression chips could be built right onto the motherboard and would add about $100 to the street price of a computer, according to Stac. They'll probably show up first in high-end systems.

Last year, Stac made a deal with Novell to license the Stacker compression technology for use in Novell DOS 7 and all networking software. "Stac's vision is that data should need to be compressed only once," says John Bromhead, Stac's vice president for marketing. "After it's compressed, it should stay compressed whether it's transferred to disk or tape or across a network or through the system or whatever."

That's also IBM's vision. IBM Microelectronics is introducing a series of compression chips that couple directly to the CPU. According to Ted Lattrell, development manager, the first chip already compresses data at 40 MBps, and future versions could hit 100 MBps. Lattrell says the chips have already attracted interest from PC and workstation vendors, who are planning to introduce systems later this year or in early 1995.

"Once it happens — once compression is hard-wired into the first system — there's no going back," says Lattrell. "A system without built-in compression would be at a disadvantage in the marketplace. I think hardware compression will alter the way data is represented inside computers for years to come."

Indeed, IBM is researching the possibility of putting ultraspeed compression chips on the CPU's memory bus. That would do for RAM what disk compressors do for hard drives — effectively double the computer's main memory.

Beyond that, it's possible that compression chips will eventually be integrated within the CPU itself, just as math coprocessors migrated to CPUs on the 486DX and 68040. Today's 0.8-micron process technology makes about 625,000 gates available on a chip, and IBM's compression engine requires only 75,000 gates. Soon, 0.5-micron processes will enable about 1.3 million gates, and IBM hopes to shrink the compression engine down to only 40,000 gates. That makes built-in compression a real possibility for high-end microprocessors, such as the PowerPC 620 under development by IBM, Motorola, and Apple.

One drawback to hardware compression is that it creates a problem when you transfer files over a network or by removable disks to systems that don't have the same compression hardware. A similar problem exists with software compression, and it's usually solved by including a decompression driver with the file or the media.

A more important consideration is that hardware compression, although an old idea, will nevertheless be as new to most of today's users as software compression was before DoubleSpace.

"When you stick this hardware compression into the system, people are going to wonder how it affects their software-based compression," says Phil Devin, vice president of storage technologies for Dataquest (San Jose, CA). "Do their compressed files get compressed again? Is there a conflict? Here comes one more level of uncertainty that's going to be an inhibiting factor at first."

Devin recently wrote a newsletter debunking the idea that data compression of any kind is a free lunch — an idea that particularly caught hold when Microsoft started bundling DoubleSpace with DOS 6. "It's not something for nothing," he says. "It's something that takes work."

PC Disk-Compression Software Comparison

Product	Pre-Loaded Driver [1]	Compressed Volume File [2]	Divisible Clusters [3]	Compressed FATs [4]
Vertisoft DoubleDisk Gold	No	Yes	No	One
Microsoft DoubleSpace	Yes	Yes	No	One
Stac Electronics Stacker	Yes	Yes	Yes	Two
AddStor SuperStor DS	Yes [5]	Yes	No	One
AddStor SuperStor Pro	No	Yes	No	One
IIT XtraDrive	No	No	No	Two

[1] The device driver for the compressed drive loads during bootup before CONFIG.SYS

[2] The compressed virtual drive appears on uncompressed host drive as a large, single file

[3] Compression software can subdivide a cluster to store data in noncontiguous sectors

[4] Number of file allocation tables on the compressed drive

[5] PC-DOS 6.1 only

PC Disk-Compression
Software Comparison

Safety Tips for Disk Compression

 —  If you're using MS-DOS 6.0, you should upgrade to
                6.2.

              

               —  Before installing compression, prepare
                your hard disk by cleaning off all
              unwanted files and then running a
                defragmenter and a thorough surface test.
              (This may take a few hours.)

              

               —  Back up the hard disk, just in case.

              

               —  Read the installation instructions
                carefully, especially any warnings about
              disabling incompatible TSRs or other
                programs.

              

               —  After installing compression, be
                conscientious about system maintenance.
              Regularly run a defragger and any diagnostic
                utilities that came
              with the product.

              

               —  Consider buying a good third-party
                diagnostic/repair utility to supplement
              the standard utilities.

              

               —  Try to avoid completely filling the
                compressed drive.

              

               —  Avoid saving encrypted files or files that
                are already compressed
              (e.g., ZIP, ARC, and GIF) on the compressed
                drive. They can't be compressed
              any further, and they'll load faster from an
                uncompressed drive.

              

               —  Hard disk space currently costs about a
                dollar per megabyte;
              if you can afford to upgrade and your system
                has room to expand,
              it's still the best solution.

Photograph: With lossy compression techniques such as JPEG, image quality declines as the compression ratio goes up. This is because files are compressed by throwing out data. The first image is uncompressed. Maximum compression was used for the last image. The remaining image represents the middle ground between the two.

Illustration: How MS-DOS Preloads DoubleSpace. Before the release of version 6.0, MS-DOS loaded all device drivers from CONFIG.SYS. But DOS 6.0 has a modified IO.SYS file that automatically loads a device driver called DBLSPACE.BIN, which, in turn, calls a new configuration file named DBLSPACE.INI. Only then does CONFIG.SYS execute, loading any other device drivers, as well as a short program called DBLSPACE.SYS that relocates the DoubleSpace driver into upper memory. Stacker 3.1 is the only other real-time disk compressor that preloads in this fashion under MS-DOS.

Illustration: Disk Fragmentation. When a compressed drive becomes badly fragmented, it may cause puzzling disk-full errors, even when there is enough free space to store the file. This happens when a piece of data won't fit into any of the free but noncontiguous variable-size clusters. In this case, an 8-KB chunk of uncompressible data can't be saved on the drive, even though there's 16 KB of space available. The PC version of Stacker can avoid this dilemma by subdividing the cluster and storing the pieces in noncontiguous locations.

Illustration: Compressed Clusters. Uncompressed DOS drives normally store data in fixed-size clusters of 8 KB. (Macintosh clusters are also fixed size, but the size varies depending on the drive capacity.) Compressed drives store data in clusters that vary in size according to the compression ratio. A cluster may be as small as 512 bytes (assuming a 16-to-1 compression ratio) or as large as 8 KB (if the data is uncompressed). The average is about 4 KB (2-to-1 compression).

Tom R. Halfhill is a BYTE senior news editor based in San Mateo, California.
You can reach him on the Internet or BIX at thalfhill@bix.com.

Letters / April 1994

Compression Woes

Having just read "How Safe Is Data Compression?" (February), I feel somewhat validated in my concerns. I have used all but the latest releases of Stacker and both DOS releases of DoubleSpace, and I have suffered catastrophic data loss with all of them — on five different computer systems.

I, too, had long assumed, as author Tom R. Halfhill claims, that "the real problem isn't data compression, though; it's how well the technology is implemented in the operating environment." Accordingly, I have always striven to limit the number of software vendors I've bought products from and always paid the premium for brand-name, U.S.-made hardware. Yet I continue to suffer catastrophic data loss. The common factors among all my difficulties have been the use of compression and the MS-DOS operating system.

My experiences with compression are wide and varied. Yet I am far from being an expert in the field. How could I hope to become one when even the technical-support staffs and engineering personnel are stumped?

Gregory D. Miller
Stanford, CA

Copyright 1994-1997 BYTE

Cover Story / February 1994

How Safe Is Data Compression?

Tom R. Halfhill

PC Disk-Compression Software Comparison

Safety Tips for Disk Compression

Letters / April 1994

Return to Tom's BYTE index page