![](http://datasheet.mmic.net.cn/260000/PTM1300FBEA_datasheet_15959396/PTM1300FBEA_70.png)
TM1300 Data Book
Philips Semiconductors
4-4
PRODUCT SPECIFICATION
ed for further computations (the TM1300 optimizing C
compiler performs this analysis automatically). In this ex-
ample, the transpose matrix is placed in registers R18,
R19, R20, and R21. The final four store-word operations
put the transposed matrix back into memory.
Thus, using the TM1300 custom operations, the byte-
matrix transposition requires four load-word operations
and four store-word operations (the minimum possible)
and eight register-to-register data-manipulation opera-
tions. The result is 16 operations, or byte-matrix transpo-
sition at the rate of one operation per byte.
While the advantage of the custom-operation-based al-
gorithm over the brute-force code that uses 24 load- and
store-byte instruction seems to be only eight operations
(a 33% reduction), the advantage is actually much great-
er. First, using custom operations, the number of memo-
ry references is reduced from 24 to eight (a factor of
three). Since memory references are slower than regis-
ter-to-register operations (such as the custom operations
in this example), the reduction in memory references is
significant.
Further, the ability of the TM1300 VLIW compilation sys-
tem to exploit the performance potential of the TM1300
microprocessor hardware is enhanced by the custom-
operation-based code. This is because it is easier for the
compilation system to produce an optimal schedule (ar-
rangement) of the code when the number of memory ref-
erences is in balance with the number of register-to-reg-
ister operations. The TM1300 CPU (like all high-
performance microprocessors) has a limit on the number
of memory references that can be processed in a single
cycle (two is the current limit). A long sequence of code
that contains only memory references can result in emp-
ty operation slots in the long TM1300 instructions. Empty
operation slots waste the performance potential of the
TM1300 hardware.
As this example has shown, careful use of custom oper-
ations has the potential to not only reduce the absolute
number of operations needed to perform a computation
but can also help the compilation system produce code
that fully exploits the performance potential of the
TM1300 CPU.
4.3
EXAMPLE 2: MPEG IMAGE
RECONSTRUCTION
The complete MPEG video decoding algorithm is com-
posed of many different phases, each with computational
intensive kernels. One important kernel deals with recon-
structing a single image frame given that the forward-
and backward-predicted frames and the inverse discrete
cosine transform (IDCT) results have already been com-
puted. This kernel provides an excellent opportunity to il-
lustrate of the power of TM1300’s specialized custom op-
erators.
In the code fragments that follow, the backward-predict-
ed block is assumed to have been computed into an ar-
ray back[], the forward-predicted block is assumed to
have been computed into forward[], and the IDCT results
are assumed to have been computed into idct[].
a
e
i
m
b
f
j
n
c
g
k
o
d
h
l
p
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
Row Major
Column Major
mergemsb
mergemsb
a e b f
i m j n
mergelsb
mergelsb
c g d h
k o l p
pack16msb
pack16lsb
pack16msb
pack16lsb
Figure 4-2. Application of merge and pack instructions to the byte-matrix transposition of
Figure 4-1
.
ld32d(0) r100
→
r10
ld32d(4) r100
→
r11
ld32d(8) r100
→
r12
ld32d(12) r100
→
r13
mergemsb r10 r11
→
r14
mergemsb r12 r13
→
r15
mergelsb r10 r11
→
r16
mergelsb r12 r13
→
r17
pack16msb r14 r15
→
r18
pack16lsb r14 r15
→
r19
pack16msb r16 r17
→
r20
pack16lsb r16 r17
→
r21
st32d(0) r101 r18
st32d(4) r101 r19
st32d(8) r101 r20
st32d(12) r101 r21
char matrix[4][4];
.
.
.
int *m = (int *) matrix;
temp0 = MERGEMSB(m[0], m[1]);
temp1 = MERGEMSB(m[2], m[3]);
temp2 = MERGELSB(m[0], m[1]);
temp3 = MERGELSB(m[2], m[3]);
m[0] = PACK16MSB(temp0, temp1);
m[1] = PACK16LSB(temp0, temp1);
m[2] = PACK16MSB(temp2, temp3);
m[3] = PACK16LSB(temp2, temp3);
.
.
.
Figure 4-3. On the left is a complete list of operations to perform the byte-matrix transposition of
Figure 4-1
and
Figure 4-2
. On the left is an equivalent C-language fragment.