
Chapter 5: 64-Bit Media Programming
281
24593—Rev. 3.09—September 2003
AMD 64-Bit Technology
5.15
Performance Considerations
In addition to typical code optimization techniques, such as
those affecting loops and the inlining of function calls, the
following considerations may help improve the performance of
application programs written with 64-bit media instructions.
These are implementation-independent performance
considerations. Other considerations depend on the hardware
implementation. For information about such implementation-
dependent considerations and for more information about
application performance in general, see the data sheets and the
software-optimization guides relating to particular hardware
implementations.
5.15.1
Use Small
Operand Sizes
The performance advantages available with 64-bit media
operations is to some extent a function of the data sizes
operated upon. The smaller the data size, the more data
elements that can be packed into single 64-bit vectors. The
parallelism of computation increases as the number of
elements per vector increases.
5.15.2
Reorganize
Data for Parallel
Operations
Much of the performance benefit from the 64-bit media
instructions comes from the parallelism inherent in vector
operations. It can be advantageous to reorganize data before
performing arithmetic operations so that its layout after
reorganization maximizes the parallelism of the arithmetic
operations.
The speed of memory access is particularly important for
certain types of computation, such as graphics rendering, that
depend on the regularity and locality of data-memory accesses.
For example, in matrix operations, performance is high when
operating on the rows of the matrix, because row bytes are
contiguous in memory, but lower when operating on the
columns of the matrix, because column bytes are not contiguous
in memory and accessing them can result in cache misses. To
improve performance for operations on such columns, the
matrix should first be transposed. Such transpositions can, for
example, be done using a sequence of unpacking or shuffle
instructions.
5.15.3
Remove
Branches
Branch can be replaced with 64-bit media instructions that
simulate predicated execution or conditional moves, as
described in “Branch Removal” on page 234. Where possible,