Preliminary
...the world's most energy friendly microcontrollers
2011-05-19 - d0034_Rev0.91
31
www.energymicro.com
7.3.5.2 Zero Wait-state Access
At 16 MHz and below, read operations from flash may be performed without any wait-states. Zero wait-
state access greatly improves code execution performance at frequencies from 16 MHz and below.
By default, the Cortex-M3 uses speculative prefetching and If-Then block folding to maximize code
execution performance at the cost of additional flash accesses and energy consumption.
7.3.5.3 Suppressed Conditional Branch Target Prefetch (SCBTP)
MSC offers a special instruction fetch mode which optimizes energy consumption by cancelling Cortex-
M3 conditional branch target prefetches. Normally, the Cortex-M3 core prefetches both the next
sequential instruction and the instruction at the branch target address when a conditional branch
instruction reaches the pipeline decode stage. This prefetch scheme improves performance while one
extra instruction is fetched from memory at each conditional branch, regardless of whether the branch is
taken or not. To optimize for low energy, the MSC can be configured to cancel these speculative branch
target prefetches. With this configuration, energy consumption is more optimal, as the branch target
instruction fetch is delayed until the branch condition is evaluated.
The performance penalty with this mode enabled is source code dependent, but is normally less than
1% for core frequencies from 16 MHz and below. To enable the mode at frequencies from 16 MHz and
below write WS0SCBTP to the MODE field of the MSC_READCTRL register. For frequencies above 16
MHz, use the WS1SCBTP mode. An increased performance penalty per clock cycle must be expected
in this mode compared to WS0SCBTP mode. The performance penalty in WS1SCBTP mode depends
greatly on the density and organization of conditional branch instructions in the code.
7.3.5.4 Cortex-M3 If-Then Block Folding
The Cortex-M3 offers a mechanism known as if-then block folding. This is a form of speculative
prefetching where small if-then blocks are collapsed in the prefetch buffer if the condition evaluates to
false. The instructions in the block then appear to execute in zero cycles. With this scheme, performance
is optimized at the cost of higher energy consumption as the processor fetches more instructions from
memory than it actually executes. To disable the mode, write a 1 to the DISFOLD bit in the NVIC Auxiliary
Control Register; see the Cortex-M3 Technical Reference Manual for details. Normally, it is expected
that this feature is most efficient at core frequencies above 16 MHz. Folding is enabled by default.
7.3.5.5 Instruction Cache
The MSC includes an instruction cache. The instruction cache for the internal flash memory is enabled
by default, but can be disabled by setting IFCDIS in MSC_READCTRL. When enabled, the instruction
cache typically reduces the number of flash reads significantly, thus saving energy. In most cases a
cache hit-rate of more than 70 % is achievable. When a 32-bit instruction fetch hits in the cache the data
is returned to the processor in one clock cycle. Thus, performance is also improved when wait-states
are used (i.e. running at frequencies above 16 MHz).
The instruction cache is connected directly to the ICODE bus on the Cortex-M3 and functions as a
memory access filter between the processor and the memory system, as illustrated in
Figure 7.2 (p.32) . The cache consists of an access filter, lookup logic, a 128x32 SRAM (512 bytes) and two
performance counters. The access filter checks that the address for the access is to on-chip flash
memory (instructions in RAM are not cached). If the address matches, the cache lookup logic and SRAM
is enabled. Otherwise, the cache is bypassed and the access is forwarded to the memory system.
The cache is then updated when the memory access completes. The access filter also disables cache
updates for interrupt context accesses if caching in interrupt context is disabled. The performance
counters, when enabled, keep track of the number of cache hits and misses. The cache consists of 16
8-word cachelines organized as 4 sets with 4 ways. The cachelines are filled up continuously one word
at a time as the individual words are requested by the processor. Thus, not all words of a cacheline
might be valid at a given time.