
RISC Microprocessor Division
Page 21
In this slide, we depict a potential stall that can occur with branches. The code fragments demonstrate
how, in some cases, one can use branches that are foldable to attain better performance than using
non-foldable branches.
The two loops repeat for COUNT iterations. The first code fragment initializes the CTR and uses only
one instruction to control the looping,
bdnz
. (
bdnz
is a simplified mnemonic for a conditional branch
which decrements the CTR and branches if CTR is not zero.) This branch cannot be folded and must
be dispatched. Since branches that dispatch are required to retire from the last stage of the completion
unit, any loop involving a branch that dispatches may need an extra clock (in addition to the loop body
time) to complete execution.
It is possible to avoid the additional latency by using a foldable branch instead of the
bdnz
. The
bgt
and the
subi.
instructions in the second code fragment can be used to obtain the same functionality
as the
bdnz
. The
subi.
instruction is a single cycle instruction that can retire paired with almost any
other instruction; thus in most loops,
subi.
adds no time to the execution of that loop. The
bgt
is also
capable of being folded out of the pipeline and not dispatching at all. Therefore, code that uses the
subi.
/
bgt
combination will likely be a clock faster each time through the loop then
bdnz
. However,
the exact timing difference, if any, would depend on the actual composition of the loop body.