State of the Art / November 1994

Brute Force: Mips T5

Mips Technologies' T5 chip takes an aggressive approach
to superscalar dispatch that shows how far today's engineers
must go to deliver cutting-edge performance

Tom R. Halfhill

Microprocessor design is a never-ending cycle of eliminating bottlenecks and thereby creating new ones. When CPUs outran the ability of memory chips and I/O buses to keep them fed with instructions, the solution was to widen the bus, add high-speed caches, and simplify the instructions so that they took less time to process. When the resulting instruction stream surged beyond the capacity of the core, the answer was to deepen the execution pipeline and add multiple functional units with parallel pipes. Then I/O became a problem again, leading to even wider buses and larger caches. And so it goes.

This tug-of-war between bandwidth and horsepower won't end until all known techniques are exhausted or the cost of diminishing returns becomes prohibitive. Even though that wall isn't yet in sight, today's CPU architects are turning to increasingly radical architectures in their quest for leading-edge performance. Witness the new fifth-generation Rx000-series processor from Silicon Graphics, Inc./Mips Technologies (Mountain View, CA).

The chip's code name is T5, or the Terminator — referring to the 1991 movie Terminator 2: Judgment Day, with its Oscar-winning special effects, which were created with Mips-powered SGI workstations. The T5 is based on the same fifth-generation superscalar RISC technology introduced earlier this year in the Mips R8000, a supercomputer processor. But while the R8000 is a multichip module optimized for high-end scientific calculations, the T5 is a general-purpose single-chip processor for desktop PCs, workstations, and servers. It offers a better balance between integer and floating-point performance than the R8000, making it more suitable for mainstream applications. The T5 is designed to be equally at home in PCs running Windows NT, workstations running Unix, or multiprocessor servers for transactional databases.

As the first single-chip superscalar processor from Mips, the T5 represents a significant step forward for the Rx000 architecture. Behind it is a strong RISC heritage — the original R2000 in 1985 was the first commercially available RISC chip, and one of Mips' founders was John L. Hennessy, a pioneer researcher in RISC technology. The T5 builds on preceding generations (R2000, R3000, R5000, and R4000) by incorporating five functional units, twice as much primary cache as the R4400, twice as many registers, dynamic-register renaming, dynamic-branch prediction, speculative execution, out-of-order execution, and multiprocessor support for up to four CPUs on a special cluster bus.

At its initial core speed of 200 MHz, the T5 is expected to deliver 250 SPECint92 and 350 SPECfp92 — and that's with R4000 binaries that haven't been optimized for the T5. Although not as fast as DEC's new Alpha 21164 (see "Alpha Rides High," October BYTE), the T5 offers a distinct price/performance ratio advantage over the Alpha and the Pentium. It will keep SGI's graphics workstations competitive and maintain the price/performance advantage of Mips-based NT boxes.

Mips says the T5 is even faster than synthetic benchmarks indicate because it's more tolerant of cache misses than other processors. The T5's ability to tolerate up to four misses without stalls should make it a particularly good CPU for busy database servers, because transaction processing is I/O-intensive and requires huge secondary caches to avoid the penalty of frequent misses.

Beyond Brute Force

By far the most impressive features of the T5 are its high degree of parallelism and dynamic instruction scheduling. The goal of this design is to combine the brute force of multiple functional units with the finesse of speculative, out-of-order execution.

Previous Rx000-series chips had scalar pipelines that always executed instructions in their original program order. In contrast, the T5 has five independently pipelined functional units, each of which can execute and complete one instruction per cycle without regard to program order. Furthermore, the T5 can predict the outcome of branches and speculatively execute the code that follows. Only when all dependencies are resolved are the results of the completed instructions retired (or graduated, in Mips parlance) and restored to their original sequence. Up to four results can be graduated per cycle.

Out-of-order execution is an increasingly common technique for managing the growing resources of advanced microprocessors. Brute force in the form of higher clock speeds, wider buses, larger caches, and more pipelines has its place but also its limits. Soon you reach a point of diminishing returns where the functional units are stalled because they don't have enough to do. Optimized compilers try to address this problem by reordering the instructions to take advantage of a particular CPU's requirements.

In effect, out-of-order execution is a partial substitute for optimized compilation, because it shifts the burden of instruction ordering from the compiler (or the assembly language programmer) to the CPU itself. The CPU, not the compiler, rearranges instructions to match the availability of resources — and it does so dynamically at run time, not statically at compile time.

If this technique is carried far enough in future designs, optimized compilers and hand-tuned assembly language may become as obsolete as punch cards and toggle switches. The T5 doesn't go quite that far, but it does represent the state of the art, matching or exceeding the sophistication of dynamic issue in DEC's Alpha 21164, IBM/Motorola's PowerPC 620, AMD's K5, and Cyrix's M1.

Wall-to-Wall Superscalar

Mips isn't taking a tentative step toward superscalar with the T5; this chip is thoroughly parallel from front to back (see the figure "T5 Block Diagram").

Each cycle, the T5 can fetch four 32-bit instructions from its 32-KB, two-way set-associative instruction cache, or I-cache. Not only is that cache twice as large as the R4000's, but its two-way set-associativity is also an improvement over the R4000's direct mapping, yielding an effective 4 increase in cache efficiency (see "Cache Advantage," August BYTE). An on-chip controller supports a secondary cache ranging in size from 512 KB to 16 MB, which is also two-way set-associative.

Cached instructions are actually 37 bits long, because five extra bits are appended during a predecode stage when the instructions are prefetched into the cache. The extra bits assist full decoding by classifying the instructions according to various attributes and by preassigning them to execution units. This prefetch/predecode stage is not counted as one of the main pipeline stages.

After the instructions are fetched from the I-cache, they pass through a two-stage decoder. Actual decoding takes only one stage, however; the second stage is for register renaming.

Dynamic-register renaming is a way of expanding a processor's register file without jeopardizing software compatibility. While previous Rx000 processors had 32 integer and 32 floating-point registers, the T5 has 64 integer and 64 floating-point registers, all 64 bits wide. In each of these register files, any of the 64 physical registers can be dynamically renamed to represent the architectural register file of 32 logical registers. Thus, programs continue to see only 32 registers, but the processor has twice as many registers for storing internal values.

This technique is critically important for speculative and out-of-order execution, because it allows the T5 to store intermediate results and speculative results in the "invisible" registers. The results then become visible to the program when all dependencies have been resolved and the speculative paths have been validated.

To keep track of what's going on, the T5 maintains an active list of occupied registers and a free list of available registers. Registers on the active list can have two states: active (i.e., currently in use by an executing instruction) or completed (i.e., the final result of an executed instruction). Up to 32 instructions can be active at a time. After a completed result is graduated and no longer needed, the register is removed from the active list and added to the free list. Speculative execution can continue as long as free registers are available, and register renaming takes only a single cycle (see the figure "Register Renaming in the T5").

Register renaming also plays a crucial role in branch prediction. It's part of a clever mechanism that lets the T5 quickly abort a speculative path if a branch was wrongly predicted.

Here's how it works. The T5 speculates on every branch, up to four branches deep. At each of these junctures, the T5 takes a snapshot of the register states — what Mips calls a shadow map of the register-rename map as it existed at that moment. If it later turns out that a branch was mispredicted, the T5 doesn't have to flush any buffers or clear any registers. It merely restores the appropriate shadow map as the working register-rename map and then adds any registers holding invalid results to the free list. This takes only one cycle.

Therefore, the penalty for mispredicting a branch varies from one to four cycles, depending on when the T5 realizes it has guessed wrong. The worst case is when the T5 pursues speculative paths through four nested branches and then discovers in reverse order that it mispredicted each one. Usually, however, the oldest mispredicted branch in a series will be discovered first, because the CPU has had time to execute more instructions along the speculative path to which it leads. In that case, the penalty for aborting all four branches is only a single cycle. Any branches that follow an invalid branch must themselves be invalid, so the T5 just restores the shadow map for the oldest valid branch in the speculative tree.

Branch prediction is also dynamic, adapting to the program as it runs. The T5 records the history of each branch by setting a 2-bit flag that defines four possible states: strongly taken, weakly taken, weakly not taken, and strongly not taken. According to Mips, the T5 correctly predicts branches more than 90 percent of the time — a significant factor in real-world performance, because the integer code typically found in mainstream applications averages a branch every six instructions.

Five-Way Execution

Even though the T5 can fetch four instructions per cycle and graduate four results per cycle, it has five execution units in between. Potentially, instructions can be issued to all five units at once, and each unit can execute and complete an instruction every cycle. The T5 is therefore something of a cross between a four-way and a five-way superscalar processor, but that apparent mismatch is no accident. By providing more peak bandwidth in the core, the T5 can allocate its resources more efficiently and has more headroom for future growth.

Functional units include two ALUs for integer instructions, a load/store unit, and two FPUs — one for addition and another for multiply/divide/square root calculations. The latter FPU is actually a pair of subunits (a multiplier and a divider/square root) that share the same issue and completion logic. They can execute single- or double-precision division and square-root operations in parallel.

The two ALUs are nearly identical, but one can handle integer multiplication/division (in multiple cycles) and the other has some logic for verifying predicted branches that depend on integer register values. The load/store unit handles all address calculation and translation, and it can calculate one address per cycle. It translates 44-bit virtual addresses into 40-bit physical addresses using a 64-entry, fully associative TLB (translation look-aside buffer). Each TLB entry can address two pages ranging in size from 4 KB to 16 MB.

All five pipelines have the fetch-decode-rename stages described above, as well as at least one execution stage. Counting the graduation stage at the end, therefore, the minimum pipeline depth is five stages.

Instructions remain in program order as they pass through the first three stages. Then they enter a trio of queues to await issue to their appropriate execution units. Each of these queues (one for the ALUs, one for the FPU, and another for the load/store unit) has 16 entries, and they can issue from any position in the queue. In other words, this is where the instruction stream starts to deviate from original program order.

Under certain conditions, the T5 can issue up to five instructions per cycle from these queues. In most cases, however, it will issue one to four instructions, depending on the instruction mix.

As usual for a RISC chip, the vast majority of instructions execute in a single cycle. Single- or double-precision floating-point multiply, add, subtract, compare, and convert operations require two cycles, although multiplications need a third cycle to transfer the result. More complex floating-point instructions that are computed iteratively may consume numerous cycles.

Load operations require two cycles if they hit the data cache, and they can be executed speculatively and out of order. This is important because loads account for about 20 percent of all instructions, and the speculative execution of other instructions will screech to a halt if they depend on data that isn't available yet.

This also explains why Mips doubled the size of the data cache, from 16 KB in the R4000 to 32 KB in the T5. A heavily pipelined processor like the T5 can gobble data at a prodigious rate. The cycle times of memory and microprocessors were about the same 10 years ago; today, we have 60-nanosecond DRAM mated to screamers such as the T5, which at 200 MHz has a cycle time of only 5 ns. And because the T5 graduates up to four instructions per cycle, at times it's equivalent to a scalar CPU running at 800 MHz.

During the final pipeline stage, completed instructions aren't allowed to graduate until any dependencies are resolved and speculative paths are verified. The T5 also maintains precise exceptions: Any completed instructions following the instruction that caused the exception aren't permitted to graduate, either.

During graduation, physical registers are renamed as logical registers to validate their results, and the oldest completed instructions always graduate first. This restores the instruction stream to its original program order.

Mips Dynamics

As long as the T5 doesn't trigger an exception, mispredict too many branches, run out of free registers, overflow its queues, or miss the cache, it will deliver something close to its peak throughput. Although that seems like a lot of ifs, simulations indicate that the T5 is very efficient. As with all CPUs running real-world software, it spends a fair amount of time recovering from cache misses, but when everything clicks, the pipelines gush oil. Mips says that existing R4000 binaries appear to run almost like optimized code, a tribute to the effectiveness of the T5's dynamic scheduling.

"There is less need to recompile for the T5 than there is for a classic superscalar," says John R. Mashey, director of systems technology at Mips and one of the 90 engineers who designed the T5. "The hardware is doing a lot of what compilers sometimes had to do. When you're designing a compiler for a classic static-issue superscalar, you sometimes work real hard to arrange things, like moving loads and branches further apart, being careful not to have two loads in a set of a couple instructions, because you know the chip is only going to single-issue. Here, you just kind of don't care."

MIPS T5: What's New

— 64-bit RISC microprocessor, Rx000-compatible design.
— Mips' first single-chip superscalar CPU.
— Dynamic-branch prediction and speculative execution up to four levels deep.
— Out-of-order execution.
— Five functional units: two integer, two floating point, and load/store.
— Executes up to five instructions per cycle, graduates (retires)
up to four per cycle.
— 64 integer registers and 64 floating-point registers, dynamically mapped
to 32 integer and 32 floating-point logical registers.
— 32-KB, two-way set-associative instruction and data caches.
— On-chip control for up to 16 MB of secondary cache.
— Multiprocessor cluster bus supports up to four CPUs.
— 200-MHz internal clock; external clock programmable for 200, 133, 100,
80, 67, 57, or 50 MHz.
— 64-entry TLB, 44-bit virtual addressing.
— Estimated performance: 250 SPECint92, 350 SPECfp92.
— More than six million transistors.
— Fully static CMOS, 3.3-V, four-layer metal, 0.35-micron process.
Approximately 290mm2 die.
— Estimated power consumption: 20 to 30 W maximum.
— Sampling scheduled for late this year or early 1995; volume production
by late 1995.
— Estimated price: $1000 to $1200.

Figure: T5 Block Diagram. The Mips T5 is a cross between a four-way and a five-way superscalar design. It can fetch up to four instructions per cycle, dispatch up to five instructions to its five execution units per cycle, and graduate four results per cycle. It supports out-of-order execution, branch prediction up to four levels deep, speculative execution, and dynamic-register renaming.

Figure: Register Renaming in the T5. Any of the T5's 64 integer and 64 floating-point physical registers can be renamed to represent any of the 32 integer or 32 floating-point logical (architectural) registers. Whenever an instruction needs a register, a physical register from the free list is added to the active list. When all dependencies are resolved and the instruction graduates, the physical register is renamed as a logical register. In this example, lines 1 and 2 load values into a pair of registers, R1 and R2. Line 3 adds these values together and stores the result in a third register, R3. Line 4 compares the result to zero; if it is less than or equal to zero, the program continues to line 5, which loads a value into R1. If the comparison in line 4 is greater than zero, the program branches to line 9, which adds the values in R1 and R3 and stores the result in R5. Note that execution hinges on several dependencies: Line 3 cannot complete until lines 1 and 2 complete; line 4 cannot complete until after line 3; and line 9 must await completion of lines 1 and 3. Lines 1 and 2 can complete out of order, but they will always graduate in order. Line 5 or 9 could be executed speculatively before the outcome of the branch in line 4 is known. If the branch is validated, the instructions in lines 1 to 4 will graduate, physical register P36 will be renamed as logical register R1, the instruction in line 50 will not graduate, and physical register P49 will be removed from the active list and returned to the free list.

Tom R. Halfhill is a BYTE senior editor. You can reach him on the Internet or BIX at thalfhill@bix.com.


Letters / January 1995

Developing Hardware-Independent Software

I have been following the discussions in BYTE on new microprocessors and systems, and I keep reading that "the code has to be recompiled to take advantage of the CPU." When I read about VLIW's (very large instruction word's) need for smart compilers, I wondered why I hadn't seen any mention of reassemblers. By reassemblers I mean programs that take a binary and generate a new binary optimized for another CPU. Here you have the K5 reconstructing CISC into RISC on the fly; you have processors spending huge numbers of transistors trying to execute code out of order to take advantage of superscalar pipelining; and you have PowerPC emulating binaries. These all seem to lead to an obvious approach of spending the time once in software to determine what one binary is doing and then to generate a new binary that takes advantage of another CPU's strengths.

Alan P. W. Hewett
Mt. Vernon, OH
apwh@cbpine.att.com

The actual term for your reassembler is binary translation, and the short answer is that it does exist. Echo Logic (Holmdel, NJ) has a technology called FlashPort that can translate among several different binary formats, including 680x0 to PowerPC and x86. DEC also has a translator that moves VAX binaries to Alpha. But binary translation is not an answer to the questions of RISC vs. CISC, optimized compilation, or VLIW. Also, there are some unanswered legal questions if you translate a binary without authorization from the original software developer.

I think there are other approaches yet to be explored that address these problems. For example, what if software was delivered in some form of pseudocode that could be compiled and optimized for a specific computer as part of the installation process? You could buy an application in a semicompiled format, and the installer would automatically compile and optimize the code. If you later upgrade to a completely different platform, the software would get recompiled again on installation, using a plug-in installation compiler. This arrangement would insulate the software from details of the hardware and still achieve optimum performance and compatibility without emulation. — Tom Halfhill

Copyright 1994-1998 BYTE

Return to Tom's BYTE index page