21918B/0—October 1999
AMD-K6
-
III
Processor Data Sheet
Chapter 2
Internal Architecture
9
The AMD-K6-III processor implements a two-level branch
prediction scheme based on an 8192-entry branch history table.
The branch history table stores prediction information that is
used for predicting conditional branches. Because the branch
history table does not store predicted target addresses, special
address ALUs calculate target addresses on-the-fly during
instruction decode. The branch target cache augments
predicted branch performance by avoiding a one clock
cache-fetch penalty. This specialized target cache does this by
supplying the first 16 bytes of target instructions to the
decoders when branches are predicted. The return address
stack is a unique device specifically designed for optimizing
CALL and RETURN pairs. In summary, the AMD-K6-III
processor uses dynamic branch logic to minimize delays due to
the branch instructions that are common in x86 software.
3DNow! Technology.
AMD has taken a lead role in improving the
multimedia and 3D capabilities of the x86 processor family with
the introduction of 3DNow! technology, which uses a packed,
single-precision, floating-point data format and Single
Instruction Multiple Data (SIMD) operations based on the
MMX technology model.
2.3
Cache, Instruction Prefetch, and Predecode Bits
The writeback level-one cache on the AMD-K6-III processor is
organized as a separate 32-Kbyte instruction cache and a
32-Kbyte data cache with two-way set associativity. The level-two
cache is 256 Kbytes, and is organized as a unified, four-way set-
associative cache. The cache line size is 32 bytes, and lines are
fetched from external memory using an efficient pipelined burst
transaction. As the level-one instruction cache is filled from the
level-two cache or from external memory, each instruction byte
is analyzed for instruction boundaries using predecoding logic.
Predecoding annotates information (5 bits per byte) to each
instruction byte that later enables the decoders to efficiently
decode multiple instructions simultaneously.
Cache
The processor cache design takes advantage of a sectored
organization (see Figure 2 on page 10). Each sector consists of
64 bytes configured as two 32-byte cache lines. The two cache
lines of a sector share a common tag but have separate pairs of
MESI (Modified, Exclusive, Shared, Invalid) bits that track the
state of each cache line.