
226
Chapter 4: 128-Bit Media and Scientific Programming
AMD 64-Bit Technology
24593—Rev. 3.09—September 2003
PREFETCH
x
instructions very effectively in such cases, as
described in “Cache and Memory Management” on page 79.
Some of the best places to use prefetch instructions are inside
loops that process large amounts of data. If the loop goes
through less than one cache line of data per iteration, partially
unroll the loop. Try to use virtually all of the prefetched data.
This usually requires unit-stride memory accesses—those in
which all accesses are to contiguous memory locations. Exactly
one PREFETCH
x
instruction per cache line must be used.
4.12.8
Use 128-Bit
Media Code for
Moving Data
Movements of data between memory, GPR, XMM, and MMX
registers can take advantage of the parallel vector operations
supported by the 128-bit media MOV
x
instructions. Figure 4-6
on page 134 illustrates the range of move operations available.
4.12.9
Retain
Intermediate Results
in XMM Registers
Keep intermediate results in the XMM registers as much as
possible, especially if the intermediate results are used shortly
after they have been produced. Avoid spilling intermediate
results to memory and reusing them shortly thereafter. In 64-bit
mode, the architecture’s 16 XMM registers offer twice the
number of legacy XMM registers.
4.12.10
Replace GPR
Code with 128-bit
media Code.
In 64-bit mode, the AMD64 architecture provides twice the
number of general-purpose registers (GPRs) as the legacy x86
architecture, thereby reducing potential pressure on GPRs.
Nevertheless, general-purpose instructions do not operate in
parallel on vectors of elements, as do 128-bit media
instructions. Thus, 128-bit media code supports parallel
operations and can perform better with algorithms and data
that are organized for parallel operations.
4.12.11
Replace x87
Code with 128-Bit
Media Code
One of the most useful advantages of 128-bit media instructions
is the ability to intermix integer and floating-point instructions
in the same procedure, using a register set that is separate from
the GPR, MMX, and x87 register sets. Code written with 128-bit
media floating-point instructions can operate in parallel on four
times as many single-precision floating-point operands as can
x87 floating-point code. This achieves potentially four times the
computational work of x87 instructions that take single-
precision operands. Also, the higher density of 128-bit media
floating-point operands may make it possible to remove local
temporary variables that would otherwise be needed in x87
floating-point code. 128-bit media code is also easier to write
than x87 floating-point code, because the XMM register file is