
126
Chapter 3: General-Purpose Programming
AMD 64-Bit Technology
24593—Rev. 3.09—September 2003
execution. This spreading out is not necessary for anti-
dependencies and output dependencies.
3.10.8
Avoid Store-to-
Load Dependencies
Store-to-load dependencies occur when data is stored to
memory, only to be read back shortly thereafter. Hardware
implementations of the architecture may contain means of
accelerating such store-to-load dependencies, allowing the load
to obtain the store data before it has been written to memory.
However, this acceleration might be available only when the
addresses and operand sizes of the store and the dependent
load are matched, and when both memory accesses are aligned.
Performance is typically optimized by avoiding such
dependencies altogether and keeping the data, including
temporary variables, in registers.
3.10.9
Optimize Stack
Allocation
When allocating space on the stack for local variables and/or
outgoing parameters within a procedure, adjust the stack
pointer and use moves rather than pushes. This method of
allocation allows random access to the outgoing parameters, so
that they can be set up when they are calculated instead of
being held in a register or memory until the procedure call. This
method also reduces stack-pointer dependencies.
3.10.10
Consider
Repeat-Prefix Setup
Time
The repeat instruction prefixes have a setup overhead. If the
repeated count is variable, the overhead can sometimes be
avoided by substituting a simple loop to move or store the data.
Repeated string instructions can be expanded into equivalent
sequences of inline loads and stores. For details, see “Repeat
Prefixes” in Volume 3.
3.10.11
Replace GPR
with Media
Instructions
Some integer-based programs can be made to run faster by
using 128-bit media or 64-bit media instructions. These
instructions have their own register sets. Because of this, they
relieve register pressure on the GPR registers. For loads, stores,
adds, shifts, etc., media instructions may be good substitutes for
general-purpose integer instructions. GPR registers are freed
up, and the media instructions increase opportunities for
parallel operations.
3.10.12
Organize Data
in Memory Blocks
Organize frequently accessed constants and coefficients into
cache-line-size blocks and prefetch them. Procedures that
access data arranged in memory-bus-sized blocks, or memory-
burst-sized blocks, can make optimum use of the available
memory bandwidth.