1-44
MPC7400 RISC Microprocessor Users Manual
Differences between the MPC7400 and the MPC750
Table 1-7. Differences between the MPC7400 and the MPC750
Feature
Difference
Core
Sequencing
The MPC750 has a 6-entry IQ and a 6-entry CQ. For each clock, it can fetch four instructions, dispatch
two instructions, fold one branch, and complete two instructions. The MPC7400 is identical, except for an
eight-entry CQ, as shown in Figure 1-1. The extra CQ entries reduce the opportunity for dispatch
bottlenecks to the MPC7400s additional execution units.
FPU
On the MPC750, single-precision operations involving multiplication have a 3-cycle latency, while their
double-precision equivalents take an additional cycle. Because the MPC7400 has a full double-precision
FPU, double- and single-precision multiplies have the same latency: 3 cycles. Floating-point divides have
the same latency for both designs (17 cycles for single-precision, 31 for double-precision).
MPC750
Double-precision oating-point multiply
4 cycles
All other oating-point add and multiply
3 cycles
MPC7400
All oating-point add and multiply
3 cycles
AltiVec
technology
The MPC7400 implements all instructions deTned by the AltiVec speciTcation. Two dispatchable AltiVec
functional units were added, a vector permute unit (VPU) and a vector ALU unit (VALU). The VALU
comprises a simple integer unit, a complex integer unit, and a oating-point unit. As shown in Figure 1-1,
the MPC7400 also adds 32 128-bit vector registers (VRs) and 6 VR rename registers.
The VPU handles permute and shift operations and the VALU handles calculations. The LSU handles
AltiVec load and store operations. To support AltiVec operations, all memory subsystem (MSS) data
buses are 128 bits wide (as opposed to 64 bits in the MPC750). Queues have been added and queue
sizes have been increased to sustain heavy AltiVec technology usage.
The AltiVec technology is designed to improve the performance of vector-intensive code, in applications
such as multimedia and digital signal processing. AltiVec-targeted code can accelerate 2D and 3D
graphics functions 3D5 times, especially core functions in 3D engines and game-related 2D functions.
Memory Subsystem (MSS)
The MPC7400 has a new memory subsystem designed to support AltiVec technology loads, the new MPX bus protocol,
and 5-state multiprocessing capabilities. Queues and queue sizes are designed to support more efTcient data ow. For
example, the MPC750 has a three-entry LSU store queue, while the MPC7400 has a six-entry LSU store queue.
The MPC7400 adds an eight-entry reload buffer, where L1 data cache misses can wait for their data to be loaded. This
enables load miss folding and store miss merging.
Load miss
folding
In the MPC750, if a second load misses to the same cache block, the second load must wait for the
critical word of the Trst load before it can access its data, and subsequent accesses are also stalled. In
the MPC7400, the Trst load or store causes an entry to be allocated in the reload buffer. A subsequent
load to the same cache block is placed aside in the load fold queue (LFQ), and it can return its data
immediately when available. Also, subsequent accesses to the cache are not blocked and can be
processed.
For example, on the MPC750 if a load or store (access A) misses in the data cache. Then a subsequent
load (access B) to the same cache block must wait until the critical word for A is retired. Because of this,
any subsequent loads or stores after access B also cannot access the data cache until the reload for
access A completes.
On the other hand, with the MPC7400, load or store access A misses in the data cache, and while the
data is coming back, up to four subsequent misses to the same cache block can be folded into the LFQ,
and subsequent instructions can access the data cache. Loads are blocked only when the reload table or
the LFQ are full.
Store miss
merging
In the MPC750, if a second store misses to the same cache block, it must wait for the critical word of the
Trst store before it can write its data. The MPC7400 can merge several stores to the same cache block
into the same entry in its reload buffer. If enough stores merge to write all 32 bytes of the cache block
(usually via two back-to-back AltiVec store misses), then no data needs to be loaded from the bus and an
address-only transaction (KILL) is broadcast instead.