Cover Story / November 1994

AMD vs. Superman

The quad-issue K5 series is AMD's long-awaited answer
to Intel's Pentiums. Its RISC-like core and innovative approach
to x86 decoding may propel it past the Pentiums of today,
but Intel isn't standing still.

Tom R. Halfhill

Originally it stood for Kryptonite, the mythical substance that could destroy Superman. But to avoid legal hassles with D.C. Comics, AMD changed the code name of its fifth-generation x86 chip to Krypton-5, or K5. That compromise is symbolic of AMD's long quest to challenge Intel's supremacy in microprocessors — a quest that has been dogged with legal obstructions and technical hurdles at every turn.

Since 1990, lawyers for AMD and Intel have been fighting tedious battles over contractural language and microcode in federal court. By 1994, it seemed that AMD was gaining the upper hand. But lawsuits were only part of AMD's troubles. For years, AMD has lagged at least a full generation behind Intel in x86 evolution, relying heavily on licensed technology and minor variations of Intel chips to carve out a pittance of market share. Intel's 1993 introduction of the Pentium — the first superscalar x86 processor — threatened to widen the gap and knock AMD out of the contest altogether (see "80x86 Wars," June BYTE).

But at AMD's labs in Austin, Texas, engineers had begun work on a new x86-compatible microprocessor family. Their goal was to create a series of chips that would leapfrog Intel's Pentiums and put an end to AMD's perennial follow-the-leader status. To reach that goal, they had to start working on the K5 before they knew any significant details about the Pentium. This ensured that the K5 would not be a derivative design, but it also put enormous pressure on the engineers to devise a superior microarchitecture without sacrificing software compatibility.

There was honor at stake, too, after Intel CEO Andy Grove denounced AMD as "the Milli Vanilli of semiconductors," a reference to the pop singers who were stripped of a Grammy award for lip-synching their songs. Grove dismissed AMD as a clone company: "Their last original idea was to copy Intel."

AMD CEO Jerry Sanders, visiting Austin for a quarterly strategy meeting in late 1993, declared that an independently designed fifth-generation x86 chip was AMD's top priority. Work had already started in July 1992, but incredibly, the team consisted of only two engineers: Mike Johnson, director of advanced processor engineering, who also fathered AMD's 29000 embedded RISC chip, and Dave Christie, who began modeling various design alternatives on a software simulator written by Johnson. Later, they were joined by Dave Witt, who became the design manager. But several months passed before the number of K5 engineers in the lab exceeded the number of lawyers in the courtroom.

After two years of effort, the result is a truly creative design that makes AMD's promise to catch Intel at least credible, though by no means assured. AMD says the K5 is its first x86-compatible CPU family that discards all vestiges of Intel's intellectual property (including microcode) while delivering better-than-Intel performance. According to AMD's simulations, the first K5-series chip will run real-world applications (e.g., Microsoft Word, Excel, and CorelDraw) about 30 percent faster than a Pentium at the same clock speed. AMD says it could do even better with artificial benchmarks such as SPEC and Dhrystone — and without optimized compilation.

Johnson credits the K5's performance to larger primary caches and a more aggressive superscalar design. Instead of the Pentium's twin integer pipelines, with their many restrictions on parallel execution, the K5 has a five-unit, four-issue superscalar architecture that unites a RISC-like core with a unique x86 instruction decoder.

The decoder, by far the most complex and fascinating part of the chip, strives to minimize the liabilities of the x86 instruction set by splitting the long CISC instructions into smaller RISC-like components called R-ops (RISC operations). These R-ops, in turn, are dispatched four at a time to a core that borrows heavily from RISC. Dynamic register renaming, branch prediction, speculative execution, and out-of-order execution — it's all there. The K5 implements a hybrid CISC/RISC technology that is also evident in the Nx586 chip from NexGen and will almost certainly be a larger feature of Intel's next-generation x86, code-named P6.

The K5's design had better be forward-thinking, because the competition isn't standing still. The P6 is expected to debut next year, when AMD will just be ramping up production of the K5. At the same time, NexGen will be shipping the Nx586 and Cyrix will introduce yet another fifth-generation x86, the M1 (see the text box "x86 Wars Update").

AMD claims the K5's microarchitecture has enough performance headroom to compete with all these processors, and that faster versions will soon follow. There's already talk about a K6 that could debut late next year or early in 1996. The K6 might depart even further from its Intel ancestry by abandoning pin-compatibility with Intel's x86 chips. (In contrast, the K5 is pin-compatible with the P54C-series Pentiums.) As with most wars of independence, AMD's struggle with Intel will not be won easily or cheaply.

CISC/RISC Fusion

Intel's public commitment to CISC notwithstanding, it is generally recognized that future gains in x86 performance will be achieved by working around the inherited limitations of the x86 instruction set. CISC was a good idea when Intel conceived the original 8086 in the 1970s and was trying to cram a rich instruction set onto a 29,000-transistor chip. But a processor like the K5, which incorporates 4.1 million transistors, has different priorities. It is bound by how fast it can fetch, decode, and execute instructions, not by the computational wealth of its instruction set.

The pure RISC approach would be to dump the x86 instructions altogether and replace them with modern, streamlined instructions. However, the x86's greatest liability is also its greatest asset: an instruction set that runs more software than any other architecture in the world.

The approach now being explored by all x86 engineers — including those at Intel, AMD, Cyrix, and NexGen — is to integrate CISC and RISC technologies without abandoning backward compatibility. Special attention is being focused on the x86 instructions themselves, which are troublesome because of their complexity and variable lengths. After evaluating several different solutions, AMD finally settled on a sophisticated decoder that turns complex x86 instructions into relatively simple and fast-executing R-ops.

R-ops bear a strong resemblance to the microcode instructions that are inherent in all x86 processors. Every x86 chip executes its most complex instructions as a sequence of microinstructions fetched from an internal microcode ROM, though the most recent x86 chips minimize the use of microcode by hard-wiring the simpler instructions. But the K5's R-ops are subtly different: The vast majority of them are generated on the fly by the decoder, not from microcode.

But microcode still handles the most complex and infrequently encountered x86 instructions, such as string operations and transcendentals. Even in those cases, however, the result is a stream of R-ops identical to those generated by the decoder. The R-ops have so much in common with RISC instructions that AMD used an assembler for its 29000 RISC processor during early phases of the K5's development.

The transition from x86 instructions to R-ops begins even before the K5 fetches from its primary I-cache (instruction cache). During I-cache loading, instructions are predecoded — every byte is tagged with additional bits of information. These tags mark the instruction boundaries, identify the various fields within each instruction, and (in the case of branch instructions) predict where the branch will be taken.

The purpose of this predecoding is to reduce the amount of work required later when the instructions enter the execution pipeline for final decoding. Just marking the instruction boundaries saves time, because x86 instructions can vary in length from 8 to 120 bits, so the processor has to figure out where one instruction ends before fetching the next. (RISC chips avoid this problem by using instructions that are always 32 bits long.) By marking the instruction boundaries during predecode, the K5 resolves these serial dependencies before the instructions even reach the cache.

Identifying the fields in each instruction helps, too. Later, when the processor fully decodes the instructions, it can quickly distinguish between op codes and their various operands.

All this predecoding happens in the same cycle as the cache prefetch, before the instructions enter the execution pipeline, so it doesn't add a stage to the pipeline or delay execution. It is also supervised by a coherency mechanism that watches for self-modifying code, another bane of x86 design.

One drawback of predecoding is that it makes the long x86 instructions even longer. To compensate, the K5's primary I-cache is twice as large as the Pentium's: 16 KB versus 8 KB. (Actually, the K5's I-cache has about 24 KB of array space, but AMD quotes it as having 16 KB because it's equivalent to a conventional 16-KB cache filled with untagged instructions.) Both the I-cache and the separate 8-KB data cache are linearly addressed and four-way set-associative, which is more efficient than the Pentium's two-way set-associative caches.

Instruction Fission

Final decoding is a two-cycle process that begins by fetching the tagged instructions from the I-cache into the decoder. Decoder is something of an understatement. This is really the heart of the chip, where x86 instructions are converted into R-ops and dispatched to the five functional units (two integer, one FPU, one branch, and one load/store). AMD refers to the decoder as the R-op mux (R-op multiplexer).

If you picture the CISC instructions as heavy atoms, the R-op mux is like a nuclear reactor that splits them into elemental RISC particles. Among other things, this fission allows the K5 more flexibility in arranging out-of-order execution. A single x86 instruction might break down into multiple R-ops that are dispatched to different functional units, executed separately, and then completed out of order. (Eventually, of course, to preserve software compatibility, the results are restored to their original program order.)

To start the fission process, up to 16 bytes are fetched from the I-cache in a single cycle. The bytes enter a special FIFO (first-in/first-out) queue called the byte queue. This queue is scanned for enough bytes of predecoded instructions to generate four R-ops. It is possible for a single byte in the queue to contain enough information to generate four R-ops, but more likely several bytes will be consumed. As bytes exit the queue, more are fetched from the I-cache. These bytes don't have to originate from the same cache block; indeed, they often represent different speculative threads of execution that are scattered throughout the cache.

Next come the two decode stages. From the byte queue, the four R-ops' worth of instruction information are transferred to four identical decode positions and then are converted into R-ops. The four R-ops don't necessarily bear a direct relationship to their antecedent x86 instructions. They may represent a single instruction, multiple instructions, or fragments of an x86 instruction. It all depends on the original instruction's complexity.

The simplest x86 instructions (e.g., a register-to-register add) map directly to single R-ops, but most instructions yield two or three R-ops. For example, a stack-relative memory-to-register add is broken up into a pair of R-ops: one to load the register and another to add the registers. A stack-relative register-to-memory add would yield three R-ops: one to load the register, one to add the registers, and one to store the result. Instructions that divide into three or fewer R-ops are called fast instructions.

If a complex x86 instruction (e.g., a string operation) requires more than three R-ops, it traps into the microcode ROM. This can produce hundreds of R-ops. However, the microcode sequencer generates these R-ops in clusters of four per cycle, so they're issued to the functional units in parallel, just like the fast-instruction R-ops. The sequencing continues until the complex instruction is finished. Then the byte queue resumes processing fast instructions. In practice, this microcode detour rarely happens, because today's smart programmers and compilers avoid the worst x86 instructions.

At this point, the fission is complete: The CISC instructions have been converted into quartets of fully decoded R-ops that, on average, are 59 bits long. Even though that is longer than true RISC instructions, they carry decode information and operand fields that get separated later. What's important is that R-ops are much more manageable than x86 instructions, and the vast majority will execute in a single cycle.

Under optimal circumstances, the K5 can convert four x86 instructions into four corresponding R-ops in a single cycle, making it four-issue superscalar on both the x86 and RISC sides. In practice, it will attain optimal issue rates far more often on the RISC side than on the CISC side.

Point of Order

During the next stage, the R-ops are dispatched in parallel to a set of reservation stations, which act as queues for the functional units. Because the load/store unit is deemed the most vital, it has six stations. Other units have as few as two. Each integer unit (an ALU and an ALU/shifter) has two stations, making a total of four stations that can handle integer-type R-ops.

The functional units execute the R-ops at a peak rate of five per cycle, although the K5 can retire only four per cycle. The units are completely independent, so they are free to complete their instructions out of order if there are no dependencies (i.e., as long as the completion of one instruction does not depend on the result of a previous instruction).

Without some kind of reorder mechanism, of course, the K5 would play havoc with existing software. To ensure that results are retired in program order, each R-op gets an entry in a 16-slot reorder buffer, which keeps track of the original instruction sequence. When the reorder buffer gives the green light, the R-ops are retired in program order by writing their results to the architectural registers and the 8-KB dual-ported data cache. The K5 completion mechanism is analogous to the one used in the PowerPC 604.

In addition, the reorder buffer makes sure instructions are completed in their entirety before yielding to an exception. Otherwise, an instruction might be left partially completed if it was split into multiple R-ops that were executing out of order when the exception occurred. (In cases where the x86 lets some complex instructions be interrupted by an exception, the K5 allows this, too.)

In addition, the reorder buffer is responsible for register renaming, another RISC retrofit for the x86. A well-known limitation of the x86 architecture is its eight GPRs (general-purpose registers), a rather sparse register file by today's standards. The K5 has 16 logical GPRs, any of which can be renamed to represent the eight physical registers that x86 software expects to see.

Branch prediction is handled a little differently than in other advanced microprocessors. Instead of maintaining a separate branch target buffer to hold the addresses of predicted branches, the K5 appends the predicted address to the branch instruction during predecode. This 10-bit tag, called a successor index, points to a target within the I-cache.

At first, all predecoded branch instructions are predicted not taken. Later, if speculative execution reveals that the prediction was wrong, the prediction is reversed by writing a new successor index that points to the correct cache block. That prediction remains in effect until it's wrong again. In other words, the prediction is reversed every time it's wrong.

This is one reason why the cache blocks are only 16 bytes in size. The K5 can predict only one taken branch per block, so a smaller block reduces the chance that an instruction will branch to another branch in the same block. A 32-byte cache block would reduce performance, according to AMD's simulations.

Although the branch prediction is "dynamic" in the sense that it adapts to wrong predictions at run time, it does so merely by reversing its predictions in a binary flip-flop. In contrast, some of the latest RISC processors use algorithms that dynamically predict the outcome of branches by keeping track of how often a particular branch is actually taken. But RISC chips don't have to bother with complicated x86 decoding. By adopting a somewhat simpler form of branch prediction, the K5 keeps an already complicated decoder from becoming even more labyrinthine.

There is another advantage to the K5's approach: In effect, it predicts branches over a larger sample of the program than other methods. Branch target buffers have a limited number of entries, usually a few dozen. However, the K5 can theoretically predict a branch in every cache block. Since the block size is 16 bytes and the I-cache is 16 KB, that's potentially 1024 branches. This larger sample — coupled with the K5's flexible cache fetching — partly offsets its less sophisticated predictions. Of course, when the cache is flushed, all the prediction states are lost, too, because they're tagged to the instructions instead of being held in a branch target buffer.

To make this whole mechanism complete, the K5's byte queue can trigger a special signal called BQ confused. It waves this flag when the predecoded instructions don't appear to make sense because of a mispredicted branch or some other anomaly. The signal wipes out the incoherent cache blocks and reloads them with freshly predecoded instructions. Johnson says this rarely happens, but it is so reliable that it once masked a bug in the K5's critical logic path during the chip's early development. Even though not even AMD would claim the K5 is a fault-tolerant processor, it's comforting to know there's a mechanism of last resort that is robust enough to handle a logic glitch and confused code.

RISC to the Core

We've paid relatively little attention to the K5's core because once the x86 instructions are fully decoded into R-ops and dispatched to the functional units, the K5 core is basically a conventional RISC chip. The K5 core was closely patterned after an upcoming superscalar version of the 29000. Indeed, both the K5 and the new 29000 implement the theories expounded in Johnson's book, Superscalar Microprocessor Design (Prentice Hall, 1990). Johnson and Christie modeled early designs of the K5 with a simulator called T-Sim that Johnson wrote for his book. It's interesting to note the resulting architectural differences between the K5 and the Pentium.

The biggest difference is that the K5 has five parallel-functional units instead of two parallel-integer pipelines. Like the Pentium, it can execute two integer operations simultaneously; but unlike the Pentium, it can also execute a floating-point instruction, a load/store, or a branch at the same time. The larger register file and a load/store unit that can perform two operations per cycle keep memory fetches to a minimum.

Another key difference is that the K5 allows out-of-order execution, while the Pentium does not. Overall, the K5 takes a broader approach to superscalar issue than Intel's fifth-generation chip.

One place where AMD cut corners was the FPU. Although the K5's FPU is adequate by x86 standards, it's not as fast as the Pentium's, which has more dedicated logic to make it more competitive with RISC chips. But even Intel says floating point is not particularly important for real-world PC software, so AMD's trade-off is reasonable.

AMD says the K5 was designed to deliver more performance headroom than the Pentium. According to AMD's simulations, adding cache to the K5 yields relatively more performance than adding cache to the Pentium, because the K5 isn't as close to its limit of core saturation. Adding cache is much easier than designing a faster core, so AMD hopes the K5 will remain competitive even when Intel debuts the P6 next year. Of course, this assumes the P6 won't introduce a significantly better architecture. If it does, AMD might not be any closer to catching Intel than it is now.

Nevertheless, the K5 proves that AMD can design a competitive x86-compatible CPU that isn't merely an Intel clone. From its unique R-op mux to its quad-issue superscalar pipeline, the K5 boasts a clearly innovative microarchitecture that inherits only what it must to remain compatible. Indeed, it's possible that Intel's P6 will more closely resemble the K5 than the K5 resembles the Pentium. If nothing else, the K5 will stand as AMD's declaration of independence.

AMD's K5: What's New

— Fifth-generation x86-compatible microprocessor family
— Four-way issue superscalar architecture
— Five-stage pipeline
— Five parallel-functional units: ALU, ALU/shifter, FPU, branch,
and load/store
— 16-KB instruction cache, 8-KB dual-ported data cache, linear
addressing, four-way set-associative
— Out-of-order execution, branch prediction, and speculative execution
— AMD's first completely non-Intel microcode
— Claimed performance: 30 percent faster than the Pentium at the same
clock speed. SPEC ratings not yet available
— Initial clock speed: 100 to 120 MHz — up to 150 MHz in 1996
— Internal/external clock ratios: 1x, 1.5x, 2x, and 3x
— 4.1 million transistors
— 3.3-V, three-layer metal, fully static CMOS
— Initial version on 0.5-micron process, moving to 0.35 micron in 1996
— Pin-compatible with Intel P54C Pentium
— Samples this year, volume production in 1995
— Price: not announced but expected to be the same as P54C Pentium


Illustration: K5 Pipeline: Melding CISC and RISC. As the K5 prefetches variable-length x86 instructions into its primary cache, it tags them with predecode information that will assist full decoding later. Every cycle, the chip fetches up to 16 bytes' worth of tagged instructions from the instruction cache into a byte queue. The byte queue stores the x86 bytes and predecode information from a stream of speculative cache lines and merges them with the currently accessed cache line. From there, a two-stage decoder converts the x86 instructions into relatively simple, fast-executing R-ops and then dispatches them to five independent functional units. These units can execute as many as five instructions per cycle in any order. Before retiring the results, a reorder buffer holds the speculative instructions in original program order to forward intermediate results and recover from mispredicted branches and exceptions.

Illustration: Stage Timing in the K5 Pipeline. The stages of the K5 pipeline provide more evidence of its RISC-like nature. Unlike traditional 486/Pentium pipelines, which have a separate address-generate stage, the K5 uses an execution unit — the load/store unit — to handle external memory accesses. Also, the K5 adds a retire stage in which results are finally allowed to update the architectural state of the processor. As with the 486 and Pentium, instructions are said to complete after the fifth stage, with the retirement stage needed to reorder instructions that execute out of order.

Tom R. Halfhill is a BYTE senior news editor based in San Mateo, California. You can reach him on the Internet or BIX at thalfhill@bix.com.


Letters / February 1995

RISC Registers

I read "AMD vs. Superman" (November 1994) and was wondering how the registers are selected for the r-ops (RISC operations). Are they predefined? If so, how do they interact with the compiler-generated ones? Wouldn't you need to introduce some additional registers to hold the temporary results?

Nader Bagherzadeh
nader@ece.uci.edu

AMD's K5 has twice as many physical GPRs (general-purpose registers) as a conventional x86 processor, and it dynamically renames those 16 GPRs to represent the architectural set of eight logical GPRs. Temporary/intermediate results are held in a physical register until validated, and then the physical register is renamed as a logical register. This is becoming a common technique in modern processors. It's also used in the Cyrix M1 and the Mips T5 (R10000). Basically, it's just a way of getting around the limited register files of existing CPU architectures. — Tom Halfhill

Copyright © 1994-1998 BYTE

Return to Tom's BYTE index page