
Philips Semiconductors
Custom Operations for Multimedia
File: cstm.fm5, modified 7/26/99
PRELIMINARY INFORMATION
4-9
Though the steps are presented here in detail, a pro-
grammer with a even a little experience can often per-
form these transformations by visual inspection.
If we hope to use a custom operation that processes four
pixel values simultaneously, we first need to create four
parallel pixel computations.
Figure 4-10 shows the loop
of
Figure 4-9 unrolled by a factor of four. Unfortunately,
the code in the unrolled loop is not parallel because each
line depends on the one above it.
from
Figure 4-10. By simply giving each computation its
own cost variable and then summing the costs all at
once, each cost computation is completely independent.
Excluding the array accesses, the loop body in
Figure 4-11 is recognizable now as exactly the function
performed by the ume8uu custom operation: the sum of
four absolute values of four differences. To use the
ume8uu operation, however, the code must access the
arrays with 32-bit word pointers instead of with 8-bit byte
pointers.
B[][] as one-dimensional instead of as two-dimensional
arrays. We take advantage of our knowledge of C-lan-
guage array storage conventions to perform this code
transformation. Recoding to use one-dimensional arrays
prepares the code for the transformation to 32-bit array
accesses.
(From here on, until the final code is shown, the declara-
tions of the A and B arrays will be omitted from the code
fragments for the sake of brevity.)
use ume8uu. Once again taking advantage of our knowl-
edge of the C-language array storage conventions, the
one-dimensional byte array is now accessed as a one-di-
mensional 32-bit-word array. The declarations of the
pointers IA and IB as pointers to integers is the key, but
also notice that the multiplier in the expression for rowoff-
set has been scaled from 16 to four to account for the fact
that there are four bytes in a 32-bit word.
Of course, since we are now using one-dimensional ar-
rays to access the pixel data, it is natural to use a single
for loop instead of two.
Figure 4-14 shows this stream-
lined version of the code without the inner loop. Since C-
language arrays are stored as a linear vector of values,
we can simply increase the number of iterations of the
outer loop from 16 to 64 to traverse the entire array.
The recoding and use of the ume8uu operation has re-
sulted in a substantial improvement in the performance
of the match-cost loop. In the original version, the code
unsigned char A[16][16];
unsigned char B[16][16];
.
for (row = 0; row < 16; row += 1)
{
for (col = 0; col < 16; col += 4)
{
cost += abs(A[row][col+0] – B[row][col+0]);
cost += abs(A[row][col+1] – B[row][col+1]);
cost += abs(A[row][col+2] – B[row][col+2]);
cost += abs(A[row][col+3] – B[row][col+3]);
Figure 4-10. Unrolled, but not parallel, version of the loop from Figure 4-9. unsigned char A[16][16];
unsigned char B[16][16];
.
for (row = 0; row < 16; row += 1)
{
for (col = 0; col < 16; col += 4)
{
cost0 = abs(A[row][col+0] – B[row][col+0]);
cost1 = abs(A[row][col+1] – B[row][col+1]);
cost2 = abs(A[row][col+2] – B[row][col+2]);
cost3 = abs(A[row][col+3] – B[row][col+3]);
cost += cost0 + cost1 + cost2 + cost3;
loop eliminated.
unsigned int *IA = (unsigned int *) A;
unsigned int *IB = (unsigned int *) B;
for (i = 0; i < 64; i += 1)
cost += UME8UU(IA[i], IB[i]);