6-6
ColdFire CF4e Core User’s Manual
For More Information On This Product,
Go to: www.freescale.com
Operand Execution Pipeline (OEP)
reduced by forcing branch target instructions to be aligned on 0-modulo-4 addresses at the
cost of increased code size. In all cases, the 16-bit TPF instruction (0x51FC) should be used
for text fill, rather than a NOP instruction (0x4E71). NOP synchronizes the pipeline as it
begins execution, producing a 6-cycle minimum latency versus the 1-cycle TPF opcode.
6.3 Operand Execution Pipeline (OEP)
The two instruction registers in the decode stage (DS) of the OEP are loaded from the FIFO
instruction buffer or are bypassed directly from the instruction early decode (IED). The
OEP consists of two, traditional two-stage RISC compute engines with a dual-ported
register file access feeding an arithmetic logic unit (ALU).
The compute engine at the top of the OEP (the address ALU) is used typically for operand
address calculations; the execution ALU at the bottom is used for instruction execution. The
resulting structure provides almost 24 Mbytes/MHz of bandwidth to the two compute
engines and supports single-cycle execution speeds for most instructions, including all load
and store operations and most embedded-load operations. The V4 OEP supports the
ColdFire Revision B instruction set, which adds a few new instructions to improve
performance and code density.
The OEP also implements the following advanced performance features:
Stalls are minimized by dynamically basing the choice between the address ALU or
execution ALU for instruction execution on the pipeline state.
The address ALU and register renaming resources together can execute heavily used
opcodes and forward results to subsequent instructions with no pipeline stalls.
Instruction folding involving MOVE instructions allows two instructions to be
issued in one cycle. The resulting microarchitecture approaches full superscalar
performance at a much lower silicon cost.
Unrolling the OEP into five stages improves V4 performance. The resulting structure is
termed ‘limited superscalar’ because of certain, heavily used instruction constructs that
support multiple-instruction dispatch. In particular, the notion of instruction folding where
two consecutive operations are combined into a single issue effectively creates zero-cycle
latency for some instructions.
6.3.1 V4 OEP Conceptual Pipeline Model
The basic compute engine for the V4 ColdFire processor consists of a two-stage
pipeline—a register file with dual read ports feeding an arithmetic/logic unit (ALU). This
compute engine follows the traditional RISC model and is a three-terminal device: two
input operands and a result. Because the ColdFire ISA is not a pure load/store model, the
OEP consists of two of these compute engines, one for operand address generation in the
DS/OAG stages and one for instruction located in the OC2/EX stages. The resulting port
list defines the following set of resources associated with each compute engine:
F
Freescale Semiconductor, Inc.
n
.