
Chapter 4: 128-Bit Media and Scientific Programming
225
24593—Rev. 3.09—September 2003
AMD 64-Bit Technology
replaced with 128-bit media instructions that simulate
predicated execution or conditional moves. Figure 4-10 on
page 138 shows an example of a non-branching sequence that
implements a two-way multiplexer.
Where possible, break long dependency chains into several
shorter dependency chains which can be executed in parallel.
This is especially important for floating-point instructions
because of their longer latencies.
4.12.4
Use Streaming
Stores
The MOVNTDQ and MASKMOVDQU instructions store
streaming (non-temporal) data to memory. These instructions
indicate to the processor that the data they reference will be
used only once and is therefore not subject to cache-related
overhead (such as write-allocation). A typical case benefitting
from streaming stores occurs when data written by the
processor is never read by the processor, such as data written to
a graphics frame buffer.
4.12.5
Align Data
Data alignment is particularly important for performance when
data written by one instruction is read by a subsequent
instruction soon after the write, or when accessing streaming
(non-temporal) data. These cases may occur frequently in 128-
bit media procedures.
Accesses to data stored at unaligned locations may benefit from
on-the-fly software alignment or from repetition of data at
different alignment boundaries, as required by different loops
that process the data.
4.12.6
Organize Data
for Cacheability
Pack small data structures into cache-line-size blocks. Organize
frequently accessed constants and coefficients into cache-line-
size blocks and prefetch them. Procedures that access data
arranged in memory-bus-sized blocks, or memory-burst-sized
blocks, can make optimum use of the available memory
bandwidth.
For data that will be used only once in a procedure, consider
using non-cacheable memory. Accesses to such memory are not
burdened by the overhead of cache protocols.
4.12.7
Prefetch Data
Media applications typically operate on large data sets.
Because of this, they make intensive use of the memory bus.
Memory latency can be substantially reduced—especially for
data that will be used only once—by prefetching such data into
various levels of the cache hierarchy. Software can use the