
File: cstm.fm5, modified 7/26/99
PRELIMINARY INFORMATION
4-1
Custom Operations for Multimedia
Chapter 4
by Gert Slavenburg, Pieter v.d. Meulen, Yong Cho, Sang-Ju Park
4.1
CUSTOM OPERATION OVERVIEW
Custom operations in the TM1100 DSPCPU architecture
are specialized, high-function operations designed to
dramatically improve performance in important multime-
dia applications. When properly incorporated into appli-
cation source code, custom operations enable an appli-
cation to take advantage of the highly parallel TM1100
microprocessor implementation. Achieving a similar per-
formance increase through other means—e.g., execut-
ing a higher number of traditional microprocessor in-
structions per cycle—would be prohibitively expensive
for TM1100’s low-cost target applications.
Custom operations are simple to understand and consis-
tent in their definition, but their unusual functions make it
difficult for automatic code generation algorithms to use
them effectively. Consequently, custom operations are
inserted into source code by the programmer. To make
this process as painless as possible, custom operation
syntax is consistent with the C programming language,
and, just as with all other operations generated by the
compiler, the scheduler takes care of register allocation,
operation packing, and flow analysis.
4.1.1
Custom Operation Motivation
For both general-purpose and embedded microproces-
sor-based applications, programming in a high-level lan-
guage is desirable. To effectively support optimizing
compilers and a simple programming model, certain mi-
croprocessor architecture features are needed, such as
a large, linear address space, general-purpose registers,
and register-to-register operations that directly support
the manipulation of linear address pointers. A common
choice in microprocessor architectures is 32-bit linear
addresses, 32-bit registers, and 32-bit integer opera-
tions. TM1100 is such a microprocessor architecture.
For the data manipulation in many algorithms, however,
32-bit data and operations are wasteful of expensive sil-
icon resources. Important multimedia applications, such
as the decompression of MPEG video streams, spend
significant amounts of execution time dealing with eight-
bit data items. Using 32-bit operations to manipulate
small data items makes inefficient use of 32-bit execution
hardware in the implementation. If these 32-bit resources
could be used instead to operate on four eight-bit data
items simultaneously, performance would be improved
by a significant factor with only a tiny increase in imple-
mentation cost.
Getting the highest execution rate from standard micro-
processor resources is one of the motivations behind
custom operations in TM1100. A range of custom opera-
tions is provided that each process—simultaneously—
four eight-bit or two sixteen-bit data items. There is little
cost difference between a standard 32-bit ALU and one
that can process either one pair of 32-bit operands or
four pairs of eight-bit operands, but there is a big perfor-
mance difference for TM1100’s target applications.
TM1100’s custom operations go beyond simply making
the best use of standard resources. Custom operations
that combine several simple operations are provided.
These combinations of operations are tailored specifical-
ly to the needs of important multimedia applications.
Some high-function custom operations eliminate condi-
tional branches, which helps the scheduler make effec-
tive use of all five operation slots in each TM1100 in-
struction. Filling up all five slots is especially important in
the inner loops of computational intensive multimedia
applications.
In short, custom operations help TM1100 reach its goals
of extremely high multimedia performance at the lowest
possible cost.
4.1.2
Introduction to Custom Operations
tom operations available in the TM1100 architecture.
Table 4-1 groups the custom operations by type of func-
tion while
Table 4-2 lists the operations by operand size.
For more detailed information about the custom opera-
Some operations exist in several versions that differ in
the treatment of their operands and results, and the mne-
monics for these versions make it easy to select the ap-
propriate operation. For example, the sum of products
operations all have “fir” in their mnemonics; the prefix
and suffix of the mnemonic expresses the treatment of
the operands and result. The ifir8ii operation treats both
of its operands as signed (ifir8ii) and produces a signed
result (ifir8ii). The ifir8iu operation treats its first operand
as signed (ifir8iu), the second as unsigned (ifir8iu), and
produces a signed result (ifir8iu). The ume8ii operation
implements an eight-bit motion-estimation; it treats both
operands as signed but produces an unsigned result.
The operations beginning with “dsp” implement a clip-
ping (sometimes called saturating) function before stor-
ing the result(s) in the destination register. Otherwise,
their naming follows the rules given above where appro-
priate. For example, the dspuquadaddui operation imple-