
108
Evaluating and Programming the 29K RISC Family
makes the register available for reuse. Later, the stored data is reloaded for further
processing. The better RISC compilers try to keep data in registers longer; and use
register–to–register copying rather than register–to–memory.
Memory Shadowing
The performance impact of a memory access is reduced when the access is per-
formed to a copy–back data cache. However, most processors do not have this advan-
tage available to them. The term “memory shadowing” refers to the increased use of
registers for data variable storage. Again, directing accesses to registers rather than
off–chip memory has significant performance advantages. Of course, if a variable is
defined
volatile
it can not be held in a register.
Memory References Are Coalesced and Aligned
Data memory can be most efficiently accessed using burst–mode addressing.
This requires the use of load– and store–multiple instructions. When a sufficiently
large data object is being moved between memory and registers, it is best to use the
burst–mode supported instructions. The compiler can also arrange for frequently ac-
cessed data to be located (coalesced) in adjacent memory locations, even if the data
variables were not consecutively defined.
There are also performance benefits to be had by aligning target instructions on
cache block boundaries. For example, a procedure can be aligned to start on a 4–word
boundary. This improves cache utilization and performance –– particularly with
caches which do not support partially filled cache blocks.
Delay Slot Filling
The compilers perform “delay slot filling” (see section 3.1.8). Delay slots occur
whenever a 29K processor experiences a disruption in consecutive instruction execu-
tion. The processor always executes the instruction in the decode pipeline stage, even
if the execute stage contains a jump instruction. Delay slot is the term given to the
instruction following the jump or conditional branch instruction. Effectively, the
branch instruction is delayed one cycle. Unlike assembly language programmers, the
compiler easily finds useful instructions to insert after branching instructions. These
instructions, which are executed regardless of the branch condition, are effectively
achieved at no cost. Typically, an instruction that is invariant to the branch outcome is
moved into the delay slot just after the branch or jump instruction.
Jump Optimizations
Because of the pipeline stalling effects of jump instruction, scheduling these
instructions can achieve significant performance improvements. The objective is to
reduce the number of taken branches. For example, code loops typically have condi-
tional tests at the top of the loop to test for loop completion. This results in branch
instructions at the top and the bottom of the loop. If the conditional branch is moved
to the bottom of the loop then the number of branches is reduced.