
Chapter 3: General-Purpose Programming
125
24593—Rev. 3.09—September 2003
AMD 64-Bit Technology
3.10.1
Use Large
Operand Sizes
Loading, storing, and moving data with the largest relevant
operand size maximizes the memory bandwidth of these
instructions.
3.10.2
Use Short
Instructions
Use the shortest possible form of an instruction (the form with
fewest opcode bytes). This increases the number of instructions
that can be decoded at any one time, and it reduces overall code
size.
3.10.3
Align Data
Data alignment directly affects memory-access performance.
Data alignment is particularly important when accessing
streaming
(also called
non-temporal
) data—data that will not be
reused and therefore should not be cached. Data alignment is
also important in cases where data that is written by one
instruction is subsequently read by a subsequent instruction
soon after the write.
3.10.4
Avoid Branches
Branching can be very time-consuming. If the body of a branch
is small, the branch may be replaceable with conditional move
(CMOV
cc
) instructions, or with 128-bit or 64-bit media
instructions that simulate predicated parallel execution or
parallel conditional moves.
3.10.5
Prefetch Data
Memory latency can be substantially reduced—especially for
data that will be used multiple times—by prefetching such data
into various levels of the cache hierarchy. Software can use the
PREFETCH
x
instructions very effectively in such cases. One
PREFETCH
x
per cache line should be used.
Some of the best places to use prefetch instructions are inside
loops that process large amounts of data. If the loop goes
through less than one cache line of data per iteration, partially
unroll the loop. Try to use virtually all of the prefetched data.
This usually requires unit-stride memory accesses—those in
which all accesses are to contiguous memory locations.
For data that will be used only once in a procedure, consider
using non-temporal accesses. Such accesses are not burdened
by the overhead of cache protocols.
3.10.6
Keep Common
Operands in Registers
Keep frequently used values in registers rather than in memory.
This avoids the comparatively long latencies for accessing
memory.
3.10.7
Avoid True
Dependencies
Spread out true dependencies (write-read or flow
dependencies) to increase the opportunities for parallel