Core Technologies / July 1996

x86 Enters the Multimedia Era

Intel's new MMX instructions bring faster
multimedia processing to x86-compatible CPUs.

Tom R. Halfhill

MMX is the most significant revision of the x86 architecture since Intel introduced the 32-bit 386 chip in 1985. Programmers get eight new registers and 57 new instructions that are optimized for multimedia tasks. Users will get better performance with video, graphics, animation, and sound. Yet the new MMX-enabled CPUs will be compatible with existing x86 software and should cost about the same as regular x86 processors without MMX technology.

Intel says it will ship the first MMX Pentium (code-named the P55C) in the fourth quarter. Next year, Intel plans to integrate MMX into all its new x86 chips, including the sixth-generation Pentium Pro. By 1998, MMX will probably be as integral to the x86 architecture as the extended 32-bit instructions that Intel added to the 386 more than a decade ago.

Other x86 vendors are adopting MMX, too. Thanks to a recent cross-licensing agreement, future x86 processors from Advanced Micro Devices and AMD's subsidiary, NexGen, should be compatible with Intel's P55C. Cyrix, another x86 vendor, has not yet licensed MMX. However, Cyrix maintains that its future CPUs will be fully compatible with MMX, either by licensing or reverse-engineering the Intzel technology.

Numerous software companies have announced support for MMX in upcoming versions of their products. These include key development tools such as Microsoft Visual C++, Watcom C++, Macromedia Director, and Criterion RenderWare. Microsoft says it will support MMX in its new Direct3D and ActiveMovie APIs.

Inside MMX

Adding new instructions to a microprocessor is easy: Define the new opcodes and add the necessary logic. But adding new instructions without disrupting software compatibility is another matter. It's a particular challenge with the x86 because backward compatibility isn't just advisable; it's mandatory.

Intel mapped the eight new MMX registers into the existing stack of floating-point (FP) registers. There are eight general-purpose FP registers in an x86 FPU, and each one is 80 bits wide. FP values use 64 bits for the mantissa and 16 bits for the exponent. MMX instructions use those 80-bit registers as a random-access file (not a push-pull stack) of eight 64-bit registers. In other words, MMX instructions use only the 64-bit mantissa portion of an FP register to store MMX operands.

This trick gives programmers the virtual equivalent of eight new registers without radically altering the standard x86 architecture. OS vendors don't have to modify their code to save the state of MMX registers during context switches — MMX registers look like ordinary FP registers to the OS. Clever, eh?

But there's a catch. Programmers can use MMX and FP instructions in the same program, but they'd better not mix them because both kinds of instructions need the same registers. When a program finishes a sequence of MMX instructions, it must clear the registers with a new instruction (EMMS: Empty MMX state) to make way for subsequent FP instructions. FP instructions do likewise when they pop values off the FP stack and set the registers' tag bits. If a program mixes FP and MMX instructions, it will pay a performance penalty for these register-level "context switches."

Generally, though, it shouldn't be a problem. Developers of multimedia should segregate MMX instructions in a subroutine or library that's called only after probing the chip's CPU_ID to verify that it supports MMX. It makes sense to group MMX instructions into tight routines, anyway, because multimedia processing typically involves repetitive operations on long sequences of data.

Packed Operands

Even though MMX instructions use FP registers, they're all integer-type instructions. Their 64-bit operands may contain eight packed bytes, four packed 16-bit words, two packed 32-bit doublewords, or a single 64-bit quadword.

Potentially, an MMX instruction could manipulate an 80-bit packed operand if it used a whole FP register. But Intel limited the operands to 64 bits because they match the Pentium's 64-bit I/O bus and internal data paths. Also, 80 isn't an even power of 2 in binary, so it's more troublesome to handle.

As it is, the 64-bit operands are plenty long enough for typical multimedia jobs. Suppose a program is manipulating graphics in 8-bit color, which is often the case in games. An MMX instruction can pack eight pixels into a single operand and process them all at once. An ordinary x86 CPU can shuffle only one pixel at a time. Audio and communications programs often use 16-bit data types, so a single MMX instruction can process four of those values in a single chunk.

Most MMX instructions follow this pattern of performing a single operation on a series of integer values. This technique is called single instruction, multiple data (SIMD), and it lends itself to the algorithms and data types frequently found in multimedia software. Examples include MPEG compression, wavelet compression, motion compensation, motion estimation, color space conversion, texture mapping, 2-D filtering, matrix multiplication, fast Fourier transforms, discrete cosine transforms, and phoneme matching.

Something else these processes have in common is a lot of potential parallelism. It's no coincidence that MMX instructions are integer operations; they're designed to exploit these characteristics. Like most other integer operations in a modern x86, the majority of MMX instructions can execute in a single cycle. MMX multiplication instructions require three cycles to execute, but the CPU can issue a new one every cycle.

Therefore, a superscalar CPU like the Pentium can execute multiple streams of MMX instructions in its parallel integer pipelines. An out-of-order CPU like the Pentium Pro can rearrange MMX instructions for maximum efficiency. The CPU doesn't need a special multimedia execution unit for MMX, so any advances that improve integer performance will benefit MMX performance as well.

One thing you won't find in the MMX instruction set is branch instructions. Branches would disrupt the instruction flow, and mispredicted branches would stall the pipelines — a particular hazard in the superpipelined Pentium Pro. Instead, there are new conditional-select instructions that perform logical operations on multiple operands. By using masks and bitwise comparisons, these instructions can achieve the same results as branches without the delays.

On balance, it appears that Intel has achieved its goal of updating the x86 to meet the demands of modern software without jeopardizing compatibility. Intel could have squeezed out more performance by making more radical changes — for example, by adding new MMX-specific registers instead of aliasing the FP stack — but such changes would slow down the adoption of MMX. The last time Intel extensively revised the x86 architecture was 11 years ago, and most PC users are only now making the transition to 32-bit software. Intel wants MMX to catch on a little faster.

Where to Find

Intel
Santa Clara, CA

Phone: (408) 765-8080

Internet: http://www.intel.com



What MMX Adds to Intel Instructions

Opcode Type Mnemonic Descripton
Arithmetic PADD [B, W, D] Packed add with wraparound on [byte, word, doubleword]
PADDS [B, W] Packed add signed with saturation on [byte, word]
PADDUS [B, W] Packed add unsigned with saturation on [byte, word]
PSUB [B, W, D] Packed subtract with wraparound on [byte, word, doubleword]
PSUBS [B, W] Packed subtract signed with saturation on [byte, word]
PSUBUS [B, W] Packed subtract unsigned with saturation on [byte, word]
PMULHW Packed multiply high on words
PMULLW Packed mutitply low on words
PMADDWD Packed multiply on words and add resulting pairs
Comparison PCMPEQ [B, W, D] Packed compare for equality [byte, word, doubleword]
PCMPGT [B, W, D] Packed compare greater than [byte, word, doubleword]
Conversion PACKUSWB Pack words into byte (unsigned saturation)
PACKSS [WB, DW] Pack [words into bytes, doublewords into words] signed with saturation
PUNPCKH [BW, WD, DQ] Unpack high-order [bytes, words, doublewords]
PUNCKL [BW, WD, DQ] Unpack low-order [bytes, words, doublewords] from MMX register
Logical PAND Packed bitwise AND
PANDN Packed bitwise AND NOT
POR Packed bitwise OR
PXOR Packed bitwise XOR
Shift PSLL [W, D, Q] Packed shift left logical [word, doubleword, quadword] by MMX register or immediate value
PSRL [W, D, Q] Packed shift right logical [word, doubleword, quadword] by MMX register or immediate value
PSRA [W, D] Packed shift left arithmetic [word, doubleword] by MMX register or immediate value
Data transfer MOV [D, Q] Move [doubleword, quadword] to or from MMX register
FP/MMX state EMMS Empty MMX state


Graphical
              version of MMX instruction table.

How MMX Does Chromakeying Without Branching

MMX chroma keying.
Complex multimedia processing can be done
without code branches by using MMX instructions.


Tom R. Halfhill is a BYTE senior editor based in San Mateo, California.
You can reach him at thalfhill@bix.com.



Inbox / September 1996

Programming for MMX

Tom Halfhill's "x86 Enters the Multimedia Era" (July CPU column) was very informative. However, a problem he describes might not be real. On page 60 he says that "Programmers can use MMX and FP (floating-point) instructions in the same program, but they'd better not mix them because both kinds of instructions need the same registers"; and that multimedia developers "should segregate MMX instructions in a subroutine or library that's called only after probing the chip's CPU_ID to verify that it supports MMX." The first statement must have been directed to programmers writing compilers or x86-specific assembly-language (or machine-language) library routines. Programmers using a compiler or smart assembler that can generate code properly for an MMX-equipped CPU would not have to be aware of this: The tool would save and restore register contents properly to reuse the registers. Possibly, the programmer would only have to know to use the proper tool option for the target CPU. At run time, the library routines would do the CPU probing. I don't think one should assume a priori that a programmer would never want to execute a mixture of MMX and FP instructions on the same register data. Computing hardware-control-register mask bits might require it.

John Michael Williams, Ph.D.
Redwood City, CA

Unfortunately, no compilers currently available have high-level language support for MMX. To use MMX, programmers must embed in-line assembly code into their C or C++ source code. It's the programmer's responsibility to manage the x86 registers shared by MMX and FP instructions. If the programmer doesn't clear the MMX registers (by using the Empty MMX State instruction) before executing an FP operation, the results could be disastrous. Even if compilers do eventually provide high-level support for MMX, it would have to be an awfully smart compiler to reschedule FP instructions that a programmer carelessly mixed into an MMX routine. It's unlikely that MMX and FP instructions would need to operate on the same register data; they use the registers in different ways. FP instructions see a push-pull stack of 80-bit FP values; MMX instructions see a random-access file of 64-bit integer values. However, x86 programmers are infamous for their crazy coding, so I suppose anything's possible. — Tom R. Halfhill, senior editor

Inbox / October 1996

You Can Teach an Old Chip

Kudos (again) to Tom Halfhill for his detailed yet clear explanation of how the MMX instructions work with Intel-type CPUs ("x86 Enters the Multimedia Era," July). The article's nuts-and-bolts focus typifies why BYTE is a valuable resource for all of us.

First, if my browser requests <URL:http://www.some.company/directory>, an HTTP_error_302 is raised, resulting in an extra transaction between browser and server to correct the URL to <URL:http://www.some.company/directory/>.

It works, but it generates extra traffic, which contributes to an already crowded Net. Second, consider a Web server, like Zeus (http://www.zeus.co.uk/), that actually does content negotiation. You don't, for example, put extensions on image files. Instead of <IMG SRC="/images/mypic.gif">, you would write <IMG SRC="/images/mypic">.

The server looks at the HTTP_ACCEPT string and determines whether the browser can display GIF, JPEG, XBM, etc., then serves the best or smallest image accordingly. So, if we have a directory called /images/mypic/ but refer to it as /images/mypic, aren't we potentially in trouble?

Adam Shaun Nealis
adam@lbs.lon.ac.uk

I think you're right on both scores. I do try to include the trailing slash to avoid the extra bit of negotiation you describe. That's pretty well known. But the second case you bring up is really interesting. I've been only vaguely aware of progress in standards for content negotiation, but if it ends up working in the way you describe (i.e., no file extension means negotiate best type), then that would certainly be another reason we should all try to be more precise about trailing slashes. — Jon Udell, executive editor

Inbox / December 1996

No-Mix MMX

To the impressive technical detail Tom Halfhill presented in reply to John Michael Williams' letter about MMX programming (September Inbox), I would like to add one point: Using the Empty MMX State (EMMS) instruction costs 100 cycles. As you must perform this action to clear the in_use attribute of the FPU stack registers, embedding FPU code along with MMX is plain suicide for your program.

Eden Shochat
Senior programmer, Shells I.F.A.
Raanana, Israel
edens@netvision.net.il

Copyright 1994-1998 BYTE

Return to Tom's BYTE index page