Core Technologies / October 1997

Keeping It Simple

A Pentium-class processor rebels against current design trends
with a vastly simplified microarchitecture.

Tom R. Halfhill

Can simplicity and elegance surpass complexity at the processor level? That's what Centaur Technology is betting as it prepares to ship a new Pentium-class microprocessor, the IDT-C6. It's a stripped-down CPU that radically departs from modern trends in CISC and RISC design.

At first glance, the IDT-C6 is a simple design — one might almost say old-fashioned. It flunks almost every buzzword benchmark: no superscalar pipelines, no superpipelining, no out-of-order execution, no speculative execution, no rename registers, no reorder buffers. It doesn't even do branch prediction — the first x86 chip without that feature since 1993. At first glance, it resembles a 1980s-vintage 486.

Stranger still, the IDT-C6 is the debut product from an unknown start-up company. Centaur is a new subsidiary of Integrated Device Technology (IDT), which is a well-known manufacturer of static RAM (SRAM) chips and Rx000-series RISC processors under license from Silicon Graphics/Mips. However, IDT has not had any previous experience with the x86 architecture.

Internally, the IDT-C6 has little in common with other fifth- and sixth-generation x86 processors. Yet according to Centaur, it closely matches the performance of a multimedia extensions (MMX) Pentium when running the Winstone 97 business benchmark (37.7 versus 37.5 Winstones at 200 MHz). And as the table "Processors Compared" indicates, it has a much smaller die size than a Pentium, which means it should cost significantly less.

However, at this writing, Centaur had not yet announced prices, and BYTE was unable to verify the performance claims by running the BYTEmark suite or Bapco's Sysmarks. Although Centaur was showing samples of the IDT-C6 in May and June, final-production silicon wasn't expected until mid-August — too late to benchmark for this issue.

When BYTE does test a production chip, it will likely finish behind an identically clocked Pentium on the BYTEmarks. Although BYTEmark programs use real-world algorithms, they are still CPU-intensive synthetic benchmarks. Centaur agrees that its chip will do better with application-level benchmarks, such as the Winstone or Sysmark suites.

The reason for this is the processor's ascetic design. The IDT-C6 sacrifices raw core throughput to gain other advantages: large internal caches (32 KB each for instructions and data), high clock speeds (150, 180, and 200 MHz to start, with 225 and 240 MHz likely this fall), low power consumption (14 W maximum at 200 MHz for the desktop chip, and 7.1 to 10.6 W for the mobile chips), a tiny die size (88 square millimeters), and rapid upgrades (Centaur hopes to deliver improved versions every six to 12 months).

One at a Time

The idea of a streamlined x86 processor has been cooking for years in the mind of Glenn Henry, Centaur's president. He is a former IBM Fellow and RISC pioneer who came to IDT by way of Dell and Mips. At his last job, Henry worked on a hybrid RISC/CISC processor that could execute both the Rx000 and x86 instruction sets.

That project fizzled, but Henry took his ideas to IDT. In April 1995, Henry and his first three engineers sat down at his kitchen table in Austin, Texas, to sketch out the IDT-C6. They conceived a chip that had a single six-stage instruction pipeline. That alone was heresy. Virtually all of today's processors — both CISC and RISC — are superscalar devices. This means they have multiple pipelines that execute two or more instructions at once. The exceptions are low-cost embedded processors.

The decision to have only a single pipeline immediately saved millions of transistors (and the associated complexity). Superscalar processors need complex logic to control the flow of instructions through their parallel pipes. The latest CPUs — such as Intel's Pentium II and Pentium Pro, AMD's K6, and Cyrix's 6x86MX — can also execute multiple instructions out of order before retiring the results in original program order.

Centaur's chip is obviously a strict in-order machine, because it executes only one instruction at a time. That saves even more transistors, because it doesn't need a reorder buffer, rename registers, or the extra control logic to manage all that instruction shuffling.

Because of these design decisions, the IDT-C6 requires significantly less testing than a more complex CPU. "Trying to design and verify an out-of-order superscalar processor is a real problem for everybody, especially for an x86," notes Henry. "Only two years later, we're sampling our Pentium-class processor."

That's about half the time it takes to design and verify most other CPUs. NexGen labored for eight years on its first x86 chip. Intel is spending about five years on Merced.

The Branch Not Taken

Raising even more eyebrows among the digerati, Henry decided to omit branch prediction, too. Although this decision eliminates a branch target buffer and other related circuitry, it appears to be an odd trade-off. Branches are so common in modern code (about one for every five instructions) that it seems as if a little extra complexity could significantly boost throughput.

To understand why the company made this decision, take a closer look at the chip's pipeline, as shown in the figure "A Straightforward Pipeline". It's similar to a 486 pipeline (fetch, decode, address calculation, execute, writeback) except for an additional translate stage (stage 2). During that stage, the IDT-C6 translates x86 instructions into simpler, 33-bit-long microinstructions or retrieves microcode from its internal ROM, much as other x86 chips do. In stage 3, the chip fully decodes the instruction and accesses the registers. In stage 4, it evaluates branches.

If the program doesn't branch at this point, stage 4 takes only 1 clock cycle, so instructions keep flowing and life is beautiful. However, if the program does branch, the CPU must fetch the target instruction from the cache and herd it through the pipeline, which consumes 4 clock cycles. Most branches aren't taken, so the IDT-C6 averages about 2.5 clock cycles per branch.

By comparison, a Pentium needs only 1 clock cycle per branch if it correctly predicts the outcome. However, if a Pentium guesses wrong, it needs 4 or 5 clock cycles to recover. Henry calculates that a Pentium averages about 1.8 clock cycles per branch. In his judgment, the Pentium's extra complexity buys only a little more efficiency.

For all its simplicity, the IDT-C6 still has a few tricks to speed execution. The IDT-C6 has an eight-entry call-return stack. When a program branches, the CPU pushes the return address onto this internal stack. Most other CPUs would store and retrieve the address from memory. Centaur predicts that the IDT-C6 will save a slow memory access by pulling the address off the return stack about 90 percent of the time.

Another special feature is a cache that holds eight entries from the page-directory table, a lookup table that x86 processors use to access memory. About 90 percent of the time, the IDT-C6 finds the pointer it needs in the cache instead of looking in the table, which saves yet another memory access. And to keep complex instructions from paralyzing the chip's lone pipeline, the IDT-C6 also has a special queue incorporated into stage 2 that lets it fetch and translate up to three instructions while executing another instruction.

In other words, the IDT-C6 isn't as primitive as it first appears. It's not just a recycled 486 chip with MMX tacked on. Rather, it's a bold attempt to quickly produce an x86 processor that offers competitive performance at an affordable price.

"We're going to get hit by all the technical journals because we don't have superscalar pipelines and out-of-order execution and all that other stuff," says Henry. "But microprocessors ought to be commodities. Our theme was to develop a chip for the common masses. This project was my labor of love."

Processors Compared

  Centaur IDT-C6 Intel Pentium (P55C) Intel 486DX4*
Top clock speed
200 MHz**
233 Mhz
100 Mhz
MMX instruction set
Yes
Yes
No
MMX instruction issue
One per cycle
Two per cycle
Unknown
Number of integer pipelines
One
Two
One
L1 cache (instruction + data)
32 KB + 32 KB
16 KB + 16 KB
16 KB unified
Number of transistors
5.4 million
4.5 million
1.6 million
Fabrication process
0.35-micron CMOS
0.35-micron CMOS
0.6-micron CMOS
Die size
88 sq. mm.
140 sq. mm.
345 sq. mm.
Pin-out
Socket 7
Socket 7
486 socket
Introduction date
September 1997
June 1997
March 1994
*The 486DX4 was Intel's most powerful 486. Earlier 486 chips (first introduced in 1989) ran at 66 MHz or slower, had an 8-KB unified L1 cache, and included only 1.2 million transistors.
**The 225- and 240-MHz versions are likely this fall.


Graphical
                  version of table.

A Straightforward Pipeline

IDT-C6 pipeline
                  diagram.
The IDT-C6's pipeline resembles a 486 pipeline.

Tom R. Halfhill is a BYTE senior editor who is based in San Mateo, California. You can contact him at thalfhill@bix.com. Additional information about the Centaur Technology IDT-C6 can be found on its Web site at http://www.centtech.com.


Inbox / February 1998

The Chip That Ate My Batteries

No sooner do battery developers give us a better power-to-weight ratio than CPU makers chew up the extra power to give us a constant 1-hour notebook-battery life. Why not give us a simpler, less powerful processor that gobbles less power?

In "Keeping It Simple" (October 1997 Core), Tom R. Halfhill says the C6 is simpler. It is neither superscalar nor superpipelined, which means fewer transistors. No branch prediction means even fewer transistors.

But for all that, the C6 has more transistors than the Pentium P55C. And, although it has more transistors, its die size is 37 percent smaller. One would think this is because caches are more regular and more dense. But 32KB of cache accounts for maybe 800,000 transistors — not enough to account for the greater number of transistors, nor the reduction in chip size.

This article makes it look simple, but it begs many questions. And, by the time I got to the third column, it was already clear that I still do not have my wish for a suitable notebook chip.

K. C. Toh
Petaling Jaya, Malaysia

The reason why the C6 has more transistors than a P55C-series Pentium (5.4 million versus 4.5 million) is indeed the larger caches. The C6 has 32KB each of instruction cache and data cache. It's possible to estimate how many transistors that 64KB accounts for: 64KB = 524,288 bits. Static RAM (SRAM) cells typically have six transistors per bit when used this way. That totals 3.1 million transistors, without considering the control logic associated with the caches.

The large proportion of transistors in the caches (more than 50 percent of the chip's total) indeed does account for the C6's tiny die size. As you correctly point out, memory occupies less space than logic because it's more dense, but you've underestimated the amount of memory.

Despite the C6's simpler design, the new Tillamook-class Pentiums still consume less power at the same clock frequency. That's because the Tillamooks are manufactured on a 0.25-micron process, while the C6 is still at 0.35 micron. The Tillamook can run at a lower voltage.

You make a good point about simple designs rarely remaining simple. The C6, for instance, will add branch prediction, a better FPU, and better MMX capability this year, as well as an integrated L2 cache. However, its die size and price will remain small, because Centaur is aiming the C6 at the low-end PC market.

Note the new address for Centaur's Web site: http://www.winchip.com/. — Tom R. Halfhill, senior editor

Copyright 1994-1998 BYTE

Return to Tom's BYTE index page