19/231
uPSD33xx
Pre-Fetch Queue (PFQ) and Branch Cache
(BC)
The PFQ is always working to minimize the idle
bus time inherent to 8032 MCU architecture, to
eliminate wasted memory fetches, and to maxi-
mize memory bandwidth to the MCU. The PFQ
does this by running asynchronously in relation to
the MCU, looking ahead to pre-fetch code from
program memory during any idle bus periods. Only
necessary bytes will be fetched (no dummy fetch-
es like standard 8032). The PFQ will queue up to
six code bytes in advance of execution, which sig-
nificantly optimizes sequential program perfor-
mance. However, when program execution
becomes non-sequential (program branch), a typ-
ical pre-fetch queue will empty itself and reload
new code, causing the MCU to stall. The Turbo
uPSD33xx diminishes this problem by using a
Branch Cache with the PFQ. The BC is a four-way,
fully associative cache, meaning that when a pro-
gram branch occurs, it's branch destination ad-
dress is compared simultaneously with four recent
previous branch destinations stored in the BC.
Each of the four cache entries contain up to six
bytes of code related to a branch. If there is a hit
(a match), then all six code bytes of the matching
program branch are transferred immediately and
simultaneously from the BC to the PFQ, and exe-
cution on that branch continues with minimal de-
lay. This greatly reduces the chance that the MCU
will stall from an empty PFQ, and improves perfor-
mance in embedded control systems where it is
quite common to branch and loop in relatively
small code localities.
By default, the PFQ and BC are enabled after
power-up or reset. The 8032 can disable the PFQ
and BC at runtime if desired by writing to a specific
SFR (BUSCON).
The memory in the PSD module operates with
variable wait states depending on the value spec-
ified in the SFR named BUSCON. For example, a
5V uPSD33xx device operating at a 40MHz crystal
frequency requires four memory wait states (equal
to four MCU clocks). In this example, once the
PFQ has one or more bytes of code, the wait
states become transparent and a full 10 MIPS is
achieved when the program stream consists of se-
quential one-byte, one machine-cycle instructions
as shown in
Figure 7., page 18
(transparent be-
cause a machine-cycle is four MCU clocks which
equals the memory pre-fetch wait time that is also
four MCU clocks). But it is also important to under-
stand PFQ operation on multi-cycle instructions.
PFQ Example, Multi-cycle Instructions
Let us look at a string of two-byte, two-cycle in-
structions in
Figure 9., page 20
. There are three
instructions executed sequentially in this example,
instructions A, B, and C. Each of the time divisions
in the figure is one machine-cycle of four clocks,
and there are six phases to reference in this dis-
cussion. Each instruction is pre-fetched into the
PFQ in advance of execution by the MCU. Prior to
Phase 1, the PFQ has pre-fetched the two instruc-
tion bytes (A1 and A2) of instruction A. During
Phase one, both bytes are loaded into the MCU
execution unit. Also in Phase 1, the PFQ is pre-
fetching the first byte (B1) of instruction B from
program memory. In Phase 2, the MCU is pro-
cessing Instruction A internally while the PFQ is
pre-fetching the second byte (B2) of Instruction B.
In Phase 3, both bytes of instruction B are loaded
into the MCU execution unit and the PFQ begins
to pre-fetch bytes for the third instruction C. In
Phase 4 Instruction B is processed and the pre-
fetching continues, eliminating idle bus cycles and
feeding a continuous flow of operands and op-
codes to the MCU execution unit.
The uPSD33xx MCU instructions are an exact 1/3
scale of all standard 8032 instructions with regard
to number of cycles per instruction.
Figure
10., page 20
shows the equivalent instruction se-
quence from the example above on a standard
8032 for comparison.
Aggregate Performance
The stream of two-byte, two-cycle instructions in
Figure 9., page 20
, running on a 40MHz, 5V,
uPSD33xx will yield 5 MIPs. And we saw the
stream of one-byte, one-cycle instructions in
Fig-
ure 7., page 18
, on the same MCU yield 10 MIPs.
Effective performance will depend on a number of
things: the MCU clock frequency; the mixture of in-
structions types (bytes and cycles) in the applica-
tion; the amount of time an empty PFQ stalls the
MCU (mix of instruction types and misses on
Branch Cache); and the operating voltage. A 5V
uPSD33xx device operates with four memory wait
states, but a 3.3V device operates with five mem-
ory wait states yielding 8 MIPS peak compared to
10 MIPs peak for 5V device. The same number of
wait states will apply to both program fetches and
to data READ/WRITEs unless otherwise specified
in the SFR named BUSCON.
In general, a 3X aggregate performance increase
is expected over any standard 8032 application
running at the same clock frequency.