uPSD34xx - 8032 MCU CORE PERFORMANCE ENHANCEMENTS
20/264
Pre-Fetch Queue (PFQ) and Branch Cache
(BC)
The PFQ is always working to minimize the idle
bus time inherent to 8032 MCU architecture, to
eliminate wasted memory fetches, and to maxi-
mize memory bandwidth to the MCU. The PFQ
does this by running asynchronously in relation to
the MCU, looking ahead to pre-fetch two bytes
(word) of code from program memory during any
idle bus periods. Only necessary word will be
fetched (no dummy fetches like standard 8032).
The PFQ will queue up to four code bytes in ad-
vance of execution, which significantly optimizes
sequential program performance. However, when
program execution becomes non-sequential (pro-
gram branch), a typical pre-fetch queue will empty
itself and reload new code, causing the MCU to
stall. The Turbo uPSD34xx diminishes this prob-
lem by using a Branch Cache with the PFQ. The
BC is a four-way, fully associative cache, meaning
that when a program branch occurs, its branch
destination address is compared simultaneously
with four recent previous branch destinations
stored in the BC. Each of the four cache entries
contain up to four bytes of code related to a
branch. If there is a hit (a match), then all four code
bytes of the matching program branch are trans-
ferred immediately and simultaneously from the
BC to the PFQ, and execution on that branch con-
tinues with minimal delay. This greatly reduces the
chance that the MCU will stall from an empty PFQ,
and improves performance in embedded control
systems where it is quite common to branch and
loop in relatively small code localities.
By default, the PFQ and BC are enabled after
power-up or reset. The 8032 can disable the PFQ
and BC at runtime if desired by writing to a specific
SFR (BUSCON).
The memory in the PSD module operates with
variable wait states depending on the value spec-
ified in the SFR named BUSCON. For example, a
5V uPSD34xx device operating at a 40MHz crystal
frequency requires four memory wait states (equal
to four MCU clocks). In this example, once the
PFQ has one word of code, the wait states be-
come transparent and a full 10 MIPS is achieved
when the program stream consists of sequential
one- or two-byte, one machine-cycle instructions
as shown in
Figure 7., page 19
(transparent be-
cause a machine-cycle is four MCU clocks which
equals the memory pre-fetch wait time that is also
four MCU clocks). But it is also important to under-
stand PFQ operation on multi-cycle instructions.
PFQ Example, Multi-cycle Instructions
Let us look at a string of two-byte, two-cycle in-
structions in
Figure 9., page 21
. There are three
instructions executed sequentially in this example,
instructions A, B, and C. Each of the time divisions
in the figure is one machine-cycle of four clocks,
and there are six phases to reference in this dis-
cussion. Each instruction is pre-fetched into the
PFQ in advance of execution by the MCU. Prior to
Phase 1, the PFQ has pre-fetched the two instruc-
tion bytes (A1 and A2) of Instruction A. During
Phase one, both bytes are loaded into the MCU
execution unit. Also in Phase 1, the PFQ is pre-
fetching Instruction B (bytes B1 and B2) from pro-
gram memory. In Phase 2, the MCU is processing
Instruction A internally while the PFQ is pre-fetch-
ing Instruction C. In Phase 3, both bytes of instruc-
tion B are loaded into the MCU execution unit and
the PFQ begins to pre-fetch bytes for the next in-
struction. In Phase 4 Instruction B is processed.
The uPSD34xx MCU instructions are an exact 1/3
scale of all standard 8032 instructions with regard
to number of cycles per instruction.
Figure
10., page 21
shows the equivalent instruction se-
quence from the example above on a standard
8032 for comparison.
Aggregate Performance
The stream of two-byte, two-cycle instructions in
Figure 9., page 21
, running on a 40MHz, 5V,
uPSD34xx will yield 5 MIPs. And we saw the
stream of one- or two-byte, one-cycle instructions
in
Figure 7., page 19
, on the same MCU yield 10
MIPs. Effective performance will depend on a
number of things: the MCU clock frequency; the
mixture of instructions types (bytes and cycles) in
the application; the amount of time an empty PFQ
stalls the MCU (mix of instruction types and miss-
es on Branch Cache); and the operating voltage.
A 5V uPSD34xx device operates with four memory
wait states, but a 3.3V device operates with five
memory wait states yielding 8 MIPS peak com-
pared to 10 MIPs peak for 5V device. The same
number of wait states will apply to both program
fetches and to data READ/WRITEs unless other-
wise specified in the SFR named BUSCON.
In general, a 3X aggregate performance increase
is expected over any standard 8032 application
running at the same clock frequency.