Microbytes / October 1992

Intel Reveals Tantalizing P5 Details

Tom R. Halfhill

When Intel's P5 microprocessor is released early next year, the new chip will maintain full compatibility with the 486 and eventually deliver four to 10 times the performance, claim Intel engineers. Thanks to its superscalar architecture, parallel-integer pipelines, intelligent branch predictions, and improved floating-point math, the first version of the P5 will crunch more than 100 MIPS at its clock speed of 66 MHz. That's about twice as fast as a 66-MHz 486DX2.

These and other details of the P5 were revealed at Stanford University's Hot Chips Symposium this summer. Cautious and cagey, two Intel engineers prefaced their talks with legal statements mandated by corporate attorneys and were sometimes hissed at for dodging technical questions. But the eager audience of engineering students and observers from rival companies got a tantalizing glimpse at the inner workings of the new CPU.

While the 486 uses 1-micron technology to pack 1.2 million transistors on a chip, the P5 uses 0.8-micron technology and has 3 million transistors. Although the P5 is the successor to the 486, it won't be called the 586. Instead, Intel is running a contest for employees to suggest a new name.

At Stanford, one of Intel's presentations revealed that the P5's integer pipeline resembles that of the 486, but it is split into two parallel pipes. After prefetching and partially decoding an instruction, the P5 decides if the instruction can be executed in parallel with the next instruction that follows. If the P5 doesn't detect any dependencies, the two instructions are dispatched along parallel pipes for execution.

An instruction-issue algorithm in the P5 dispatches consecutive instructions along separate pipes only if the instructions meet the following conditions: both are considered simple instructions (mostly ALU, JUMP, and MOVoperations), the first instruction is not a JUMP, and the destination of the first instruction is neither the source nor the destination of the second instruction. Intel says that more than 30 percent of all instructions meet these conditions and execute in parallel. Although that may not sound impressive, the theoretical limit is 50 percent - that is, if half of all instructions were dispatched along each of the two pipes.

Two other key parts of the integer unit are a branch target buffer and a dualaccess data cache. The branch target buffer predicts the outcome of branches; if correct, the branch executes without delay. Intel says the penalty for incorrect predictions is more than offset by the hits. The dual-access cache handles both data and addresses from the twin pipes and contains logic for resolving address dependencies.

The FPU has three dedicated arithmetic units - a multiplier, a divider, and an adder - plus an eight-stage pipeline that's integrated with the integer pipeline but includes two more execution stages. Although all floating-point computations execute in a single pipe, they work concurrently with the dualaccess cache. As a result, the pipeline achieves one-cycle throughput. Under certain conditions, the FPU may stall for three cycles. However, Intel says these exceptions are so rare that none occurred in the SPECmark benchmark.

Although the FPU is tuned for double-precision memory-to-register operations (the most common type expected), Intel says that single-precision and register-to-register operations are just as fast. New algorithms for transcendental computations are said to yield results that are more correct than previous Intel FPUs.

For backward compatibility, the P5 maintains the 80x86 eight-register stack and uses the top register as an accumulator. To avoid logjams, the registers are shuffled by executing the FXCH (F-exchange) instruction in parallel with other operations. This ensures that the next value to be manipulated always resides at the top of the stack.

Copyright 1994-1997 BYTE

Return to Tom's BYTE index page