Chapter 6. Instruction Pipeline and Timing
For More Information On This Product,
Go to: www.freescale.com
6-5
Instruction Fetch Pipeline (IFP)
128-entry, direct-mapped prediction table unit (PTU).
Predicts Bcc instructions that miss in the branch cache. If predicted as taken by the
PTU, the Bcc is accelerated in the same manner as that used in the Version 3
processor. This acceleration is implemented in the IED stage of the prefetch pipeline
and consists of the required hardware to calculate the target instruction address
which is then fed back into the IFP's IAG stage. This mechanism is also used for
certain unconditional change-of-flow instructions. Decoupling the IFP and OEP
usually yields a 1-cycle execution time for correctly predicted accelerated branches.
Again, a hashed address is generated to index into the prediction table. This hashed
address is defined as follows:
hashedPtuAddress[6:0] = IfpAddr[15:9] XOR IfpAddr[8:2]
4-entry LIFO hardware return stack. Accelerates subroutine return instructions.
Because an RTS can return control to any number of target addresses, RTS opcodes
do not benefit from traditional branch cache structures, so the four-entry stack
greatly improves performance of these instructions. This stack is invisible to
application software. When a subroutine call is executed, the IFP pushes the return
program counter (PC) onto the stack. When a subroutine return is encountered in the
prefetch stream, the top of the LIFO stack is popped off (if valid) and used to
establish a new prefetch stream. The OEP subsequently verifies that the predicted
target address matches the return address on top of the memory-based system stack.
If the address differs, the processor aborts processing and reestablishes control at the
address defined by the memory stack. Table 6-2 lists RTS execution times.
:
V4 core performance has been evaluated across a large suite of compiled (no assembly
language optimizations) embedded benchmarks, from which the following IFP
branch-related performance parameters have been measured:
64% of all Bcc instructions are folded so they execute in 0 cycles
87% of the predictions provided by the BCU + PTU on Bcc instructions are correct
Conditional branches typically account for 11% of the dynamic pathlength
99% of all RTS opcodes are predicted correctly by the hardware return stack
The decoupled IFP and OEP and Harvard architecture of the V4 core efficiently handles the
variable-length ColdFire instruction set. Performance measurements indicate that a Base
CPI degradation factor of 0.06 cycles per instruction is caused by the OEP waiting for
opwords or extension words to be supplied by the IFP. In some cases, this factor can be
Table 6-2. V4 RTS Execution Times
Execution Time
Condition
2 (1/0)
Predicted and correct
8 (1/0)
Not predicted
9 (1/0)
Predicted but incorrect
F
Freescale Semiconductor, Inc.
n
.