Core Technologies / July 1996

x86 Enters the Multimedia Era

Intel's new MMX instructions bring faster
multimedia processing to x86-compatible CPUs.

Tom R. Halfhill

MMX is the most significant revision of the x86 architecture since Intel introduced the 32-bit 386 chip in 1985. Programmers get eight new registers and 57 new instructions that are optimized for multimedia tasks. Users will get better performance with video, graphics, animation, and sound. Yet the new MMX-enabled CPUs will be compatible with existing x86 software and should cost about the same as regular x86 processors without MMX technology.

Intel says it will ship the first MMX Pentium (code-named the P55C) in the fourth quarter. Next year, Intel plans to integrate MMX into all its new x86 chips, including the sixth-generation Pentium Pro. By 1998, MMX will probably be as integral to the x86 architecture as the extended 32-bit instructions that Intel added to the 386 more than a decade ago.

Other x86 vendors are adopting MMX, too. Thanks to a recent cross-licensing agreement, future x86 processors from Advanced Micro Devices and AMD's subsidiary, NexGen, should be compatible with Intel's P55C. Cyrix, another x86 vendor, has not yet licensed MMX. However, Cyrix maintains that its future CPUs will be fully compatible with MMX, either by licensing or reverse-engineering the Intzel technology.

Numerous software companies have announced support for MMX in upcoming versions of their products. These include key development tools such as Microsoft Visual C++, Watcom C++, Macromedia Director, and Criterion RenderWare. Microsoft says it will support MMX in its new Direct3D and ActiveMovie APIs.

Inside MMX

Adding new instructions to a microprocessor is easy: Define the new opcodes and add the necessary logic. But adding new instructions without disrupting software compatibility is another matter. It's a particular challenge with the x86 because backward compatibility isn't just advisable; it's mandatory.

Intel mapped the eight new MMX registers into the existing stack of floating-point (FP) registers. There are eight general-purpose FP registers in an x86 FPU, and each one is 80 bits wide. FP values use 64 bits for the mantissa and 16 bits for the exponent. MMX instructions use those 80-bit registers as a random-access file (not a push-pull stack) of eight 64-bit registers. In other words, MMX instructions use only the 64-bit mantissa portion of an FP register to store MMX operands.

This trick gives programmers the virtual equivalent of eight new registers without radically altering the standard x86 architecture. OS vendors don't have to modify their code to save the state of MMX registers during context switches — MMX registers look like ordinary FP registers to the OS. Clever, eh?

But there's a catch. Programmers can use MMX and FP instructions in the same program, but they'd better not mix them because both kinds of instructions need the same registers. When a program finishes a sequence of MMX instructions, it must clear the registers with a new instruction (EMMS: Empty MMX state) to make way for subsequent FP instructions. FP instructions do likewise when they pop values off the FP stack and set the registers' tag bits. If a program mixes FP and MMX instructions, it will pay a performance penalty for these register-level "context switches."

Generally, though, it shouldn't be a problem. Developers of multimedia should segregate MMX instructions in a subroutine or library that's called only after probing the chip's CPU_ID to verify that it supports MMX. It makes sense to group MMX instructions into tight routines, anyway, because multimedia processing typically involves repetitive operations on long sequences of data.

Packed Operands

Even though MMX instructions use FP registers, they're all integer-type instructions. Their 64-bit operands may contain eight packed bytes, four packed 16-bit words, two packed 32-bit doublewords, or a single 64-bit quadword.

Potentially, an MMX instruction could manipulate an 80-bit packed operand if it used a whole FP register. But Intel limited the operands to 64 bits because they match the Pentium's 64-bit I/O bus and internal data paths. Also, 80 isn't an even power of 2 in binary, so it's more troublesome to handle.

As it is, the 64-bit operands are plenty long enough for typical multimedia jobs. Suppose a program is manipulating graphics in 8-bit color, which is often the case in games. An MMX instruction can pack eight pixels into a single operand and process them all at once. An ordinary x86 CPU can shuffle only one pixel at a time. Audio and communications programs often use 16-bit data types, so a single MMX instruction can process four of those values in a single chunk.

Most MMX instructions follow this pattern of performing a single operation on a series of integer values. This technique is called single instruction, multiple data (SIMD), and it lends itself to the algorithms and data types frequently found in multimedia software. Examples include MPEG compression, wavelet compression, motion compensation, motion estimation, color space conversion, texture mapping, 2-D filtering, matrix multiplication, fast Fourier transforms, discrete cosine transforms, and phoneme matching.

Something else these processes have in common is a lot of potential parallelism. It's no coincidence that MMX instructions are integer operations; they're designed to exploit these characteristics. Like most other integer operations in a modern x86, the majority of MMX instructions can execute in a single cycle. MMX multiplication instructions require three cycles to execute, but the CPU can issue a new one every cycle.

Therefore, a superscalar CPU like the Pentium can execute multiple streams of MMX instructions in its parallel integer pipelines. An out-of-order CPU like the Pentium Pro can rearrange MMX instructions for maximum efficiency. The CPU doesn't need a special multimedia execution unit for MMX, so any advances that improve integer performance will benefit MMX performance as well.

One thing you won't find in the MMX instruction set is branch instructions. Branches would disrupt the instruction flow, and mispredicted branches would stall the pipelines — a particular hazard in the superpipelined Pentium Pro. Instead, there are new conditional-select instructions that perform logical operations on multiple operands. By using masks and bitwise comparisons, these instructions can achieve the same results as branches without the delays.

On balance, it appears that Intel has achieved its goal of updating the x86 to meet the demands of modern software without jeopardizing compatibility. Intel could have squeezed out more performance by making more radical changes — for example, by adding new MMX-specific registers instead of aliasing the FP stack — but such changes would slow down the adoption of MMX. The last time Intel extensively revised the x86 architecture was 11 years ago, and most PC users are only now making the transition to 32-bit software. Intel wants MMX to catch on a little faster.

Where to Find

Intel

              Santa Clara, CA



              Phone: (408) 765-8080



              Internet: http://www.intel.com

What MMX Adds to Intel Instructions

Opcode Type	Mnemonic	Descripton
Arithmetic	PADD [B, W, D]	Packed add with wraparound on [byte, word, doubleword]
	PADDS [B, W]	Packed add signed with saturation on [byte, word]
	PADDUS [B, W]	Packed add unsigned with saturation on [byte, word]
	PSUB [B, W, D]	Packed subtract with wraparound on [byte, word, doubleword]
	PSUBS [B, W]	Packed subtract signed with saturation on [byte, word]
	PSUBUS [B, W]	Packed subtract unsigned with saturation on [byte, word]
	PMULHW	Packed multiply high on words
	PMULLW	Packed mutitply low on words
	PMADDWD	Packed multiply on words and add resulting pairs
Comparison	PCMPEQ [B, W, D]	Packed compare for equality [byte, word, doubleword]
Comparison	PCMPGT [B, W, D]	Packed compare greater than [byte, word, doubleword]
Conversion	PACKUSWB	Pack words into byte (unsigned saturation)
	PACKSS [WB, DW]	Pack [words into bytes, doublewords into words] signed with saturation
	PUNPCKH [BW, WD, DQ]	Unpack high-order [bytes, words, doublewords]
	PUNCKL [BW, WD, DQ]	Unpack low-order [bytes, words, doublewords] from MMX register
Logical	PAND	Packed bitwise AND
	PANDN	Packed bitwise AND NOT
	POR	Packed bitwise OR
	PXOR	Packed bitwise XOR
Shift	PSLL [W, D, Q]	Packed shift left logical [word, doubleword, quadword] by MMX register or immediate value
	PSRL [W, D, Q]	Packed shift right logical [word, doubleword, quadword] by MMX register or immediate value
	PSRA [W, D]	Packed shift left arithmetic [word, doubleword] by MMX register or immediate value
Data transfer	MOV [D, Q]	Move [doubleword, quadword] to or from MMX register
FP/MMX state	EMMS	Empty MMX state

Graphical
version of MMX instruction table.

Core Technologies / July 1996

x86 Enters the Multimedia Era

Intel's new MMX instructions bring faster multimedia processing to x86-compatible CPUs.

Tom R. Halfhill

Where to Find

What MMX Adds to Intel Instructions

How MMX Does Chromakeying Without Branching

Inbox / September 1996

Inbox / October 1996

Inbox / December 1996

Return to Tom's BYTE index page

Intel's new MMX instructions bring faster
multimedia processing to x86-compatible CPUs.