Cover Story (sidebar) / June 1994

NexGen Nx586 Straddles the RISC/CISC Divide

Bob Ryan

First and foremost, the NexGen Nx586 is an 80x86 architecture processor; thus, it supports all 80x86 instructions and programmer-visible registers. Programs run on the Nx586 behave just as they do on an Intel 80x86 processor such as the 486 or the Pentium.

What differentiates the Nx586 is its microarchitecture. The instructions that are fetched from memory are standard 80x86 instructions, but the instructions executed in the processing pipelines are RISC-like translations of the CISC 80x86 instructions. NexGen calls them RISC86 instructions and uses them to give the Nx586 Pentium-class performance.

Every processor designer must weigh different trade-offs in deciding what functions to incorporate on-chip. NexGen's decisions regarding Nx586 differ significantly from Intel's decisions with the Pentium.The Nx586 contains three independent execution units — two integer and one address unit — but it does not contain an FPU. NexGen notes that everyone — including Intel — acknowledges that almost all 80x86 code is integer. Thus, NexGen devoted space on the die for split instruction and data caches that are twice as big as the Pentium's and for an integrated level 2 (or L2) cache controller, rather than for an FPU. The Nx587 — a companion to the Nx586 — provides hardware floating-point support on a separate chip.

The 32 KB of cache memory on-chip is divided into a 16-KB instruction cache and a 16-KB data cache. Both caches are four-way set-associative. In addition, the L2 cache controller supports a four-way organization. This relatively high level of set-associativity results in a higher hit rate on cache accesses. Unlike the primary caches, the secondary cache is unified.

The L2 cache controller communicates with the off-chip L2 cache via a dedicated L2 bus. This eliminates conflicts with the external address and data buses. The L2 cache is writeback, while the L1 caches are write-through. The write-through organization of the primary caches lets accesses to the primary and secondary caches occur in parallel. In the event of a miss in a primary cache, much of the secondary cache access has already completed in parallel.

Supporting an external bus, a separate L2 cache interface, and a dedicated FPU interface means that the Nx586 isn't pin-compatible with the Pentium. In fact, it requires a far bigger package — 463 pins versus 296 for the P54C Pentium. In addition, the Nx586's external bus is incompatible with that of the Pentium and 486. The Nx586 requires dedicated logic to interface with standard AT systems logic. NexGen currently supplies a chip to support VL-Bus, with PCI (Peripheral Component Interconnect) support due later this year.

CISC into RISC

During processing, the Nx586 fetches CISC instructions from the instruction cache and stores them in its prefetch buffer. The buffer is divided into three parts, letting the Nx586 manage three different instruction streams at once. This helps keep the execution pipelines filled when the processor is executing instructions speculatively.

From prefetch, instructions move into the decoder/scheduler, where for every cycle, one CISC instruction is translated into one or more RISC86 instructions. Unlike CISC instructions, the RISC86 instructions implement a load/store memory-access model. Also, they are fixed-length instructions, as opposed to variable-length CISC instructions. However, they are significantly longer than standard 32-bit RISC instructions.

In fact, because they aren't designed to reside in memory, RISC86 instructions bear a strong resemblance to microcode. The major difference is that they are not as in tune with the hardware as is microcode. They are flexible enough to work without modification in both the Nx586 and in future versions of the microarchitecture that might contain a different mix of functional units. You might consider them "microarchitecture instructions."

The decode process translates one CISC instruction per clock cycle and dispatches the one or more resultant RISC86 instructions per clock cycle to the three execution units (four, if you have added an FPU to your machine). Therefore, while the Nx586 is a scalar processor from the CISC point of view, it is superscalar on the RISC side. The main limitation to instruction issue is that no more than one RISC86 instruction can issue to a particular execution unit per cycle.

The three execution units are different. One handles the generation of addresses for load/stores, while the other two handle integer instructions. One of the integer units has integral integer multiply and divide hardware, while the other can handle only simpler integer instructions.

Each execution unit, including the FPU, is fronted with a 14-entry instruction queue. All instructions must spend at least one cycle in the queue, even if the execution unit is not busy. Because of the variable length of time an instruction can wait in a queue, the pipelines themselves have an indeterminant depth. The shortest time from fetch to retirement is seven cycles.

Minimizing Penalties

Because the pipelines can get pretty long, the Nx586 devotes a lot of resources to minimizing conditions that can stall or flush them. It uses dynamic branch prediction and speculative execution to let execution continue before it knows the results of a conditional branch. It also uses register renaming and data forwarding to handle data dependencies within the pipelines.

The Nx586 uses 14 rename registers as destinations for writes to the 80x86 architectural registers, including the eight general-purpose registers. When a RISC86 instruction issues from the decoder/scheduler, it is assigned any required rename registers. When it is finally retired, it is allowed to update an architectural register.

The Nx586 fetches, decodes, and translates 80x86 instructions in order and issues RISC86 instructions in order. Instructions can execute and terminate out of order, but they are retired in order. The decoder/scheduler tracks instructions to ensure that they update architectural registers and trigger exceptions in program order.

The Nx586 is a fascinating design. It demonstrates that the 80x86 chip's microarchitecture hardly needs to resemble Intel's offerings. On a range of common benchmarks, it performed comparably to the Pentium. It might even presage some ideas that Intel has planned for the P6 and P7.

As a commercial product, however, the Nx586 has a way to go. It has garnered support from seven motherboard makers and four third-tier system vendors, but as of this writing, the lack of a definite fabrication agreement is troubling. The chip is priced competitively — $506 per chip in lots of one thousand for the 66-MHz version versus $750 per chip in lots of one thousand for the comparable Pentium (the Nx587 costs $128), and its functional mix should result in lower system prices. Overall, the Nx586 shows that NexGen can compete with Intel on the design side, but how NexGen plans to combat Intel's manufacturing muscle is still a mystery.

Illustration: Nx586 Microarchitecture. The Nx586 combines a CISC front end with a RISC-like processing core. It employs dynamic branch prediction and speculative execution to keep its deep pipelines busy.

Bob Ryan is a BYTE technical editor. You can reach him on the Internet or BIX at b.ryan@bix.com.

Copyright 1994-1998 BYTE

Return to Tom's BYTE index page