`#define RIDX(i,j,n) ((i)*(n)+(j))`

**Naive Rotate Function**(Click to show)

```
void naive_rotate(int dim, pixel *src, pixel *dst)
{
int i, j;
for (i = 0; i < dim; i++)
for (j = 0; j < dim; j++)
dst[RIDX(dim-1-j, i, dim)] = src[RIDX(i, j, dim)];
}
```

**My Rotate Function**(Click to show)

```
void rotate(int dim, pixel *src, pixel *dst)
{
int i, j, dimm_i, dimm, i1, i2, i3, i4, i5, i6, i7, i8, i9, i10, i11, i12, i13, i14, i15;
for (i = 0; i < dim; i+=16) {
i1 = i + 1;
i2 = i + 2;
i3 = i + 3;
i4 = i + 4;
i5 = i + 5;
i6 = i + 6;
i7 = i + 7;
i8 = i + 8 ;
i9 = i + 9 ;
i10 = i + 10;
i11 = i + 11;
i12 = i + 12;
i13 = i + 13;
i14 = i + 14;
i15 = i + 15;
for (j = 0; j < dim; j++) {
dimm = (dim - 1 - j)*dim;
dimm_i = dimm + i;
dst[dimm_i] = src[i * dim + j];
dst[dimm_i + 1] = src[i1 * dim + j];
dst[dimm_i + 2] = src[i2 * dim + j];
dst[dimm_i + 3] = src[i3 * dim + j];
dst[dimm_i + 4] = src[i4 * dim + j];
dst[dimm_i + 5] = src[i5 * dim + j];
dst[dimm_i + 6] = src[i6 * dim + j];
dst[dimm_i + 7] = src[i7 * dim + j];
dst[dimm_i + 8] = src[i8 * dim + j];
dst[dimm_i + 9] = src[i9 * dim + j];
dst[dimm_i + 10] = src[i10 * dim + j];
dst[dimm_i + 11] = src[i11 * dim + j];
dst[dimm_i + 12] = src[i12 * dim + j];
dst[dimm_i + 13] = src[i13 * dim + j];
dst[dimm_i + 14] = src[i14 * dim + j];
dst[dimm_i + 15] = src[i15 * dim + j];
}
}
}
```

I just unrolled the outer loop by 16 so there were a lot of "parallel" operations, and that hit our required improvement. It's ugly, but it worked.

**Naive Smooth Function**(Click to show)

```
void naive_smooth(int dim, pixel *src, pixel *dst)
{
int i, j;
for (i = 0; i < dim; i++)
for (j = 0; j < dim; j++)
dst[RIDX(i, j, dim)] = avg(dim, i, j, src);
}
```

**Helper Code for Smooth**(Click to show)

```
/* A struct used to compute averaged pixel value */
typedef struct {
int red;
int green;
int blue;
int num;
} pixel_sum;
/* Compute min and max of two integers, respectively */
static int min(int a, int b) { return (a < b ? a : b); }
static int max(int a, int b) { return (a > b ? a : b); }
/*
* initialize_pixel_sum - Initializes all fields of sum to 0
*/
static void initialize_pixel_sum(pixel_sum *sum)
{
sum->red = sum->green = sum->blue = 0;
sum->num = 0;
return;
}
/*
* accumulate_sum - Accumulates field values of p in corresponding
* fields of sum
*/
static void accumulate_sum(pixel_sum *sum, pixel p)
{
sum->red += (int) p.red;
sum->green += (int) p.green;
sum->blue += (int) p.blue;
sum->num++;
return;
}
/*
* assign_sum_to_pixel - Computes averaged pixel value in current_pixel
*/
static void assign_sum_to_pixel(pixel *current_pixel, pixel_sum sum)
{
current_pixel->red = (unsigned short) (sum.red/sum.num);
current_pixel->green = (unsigned short) (sum.green/sum.num);
current_pixel->blue = (unsigned short) (sum.blue/sum.num);
return;
}
/*
* avg - Returns averaged pixel value at (i,j)
*/
static pixel avg(int dim, int i, int j, pixel *src)
{
int ii, jj;
pixel_sum sum;
pixel current_pixel;
initialize_pixel_sum(&sum);
for(ii = max(i-1, 0); ii <= min(i+1, dim-1); ii++)
for(jj = max(j-1, 0); jj <= min(j+1, dim-1); jj++)
accumulate_sum(&sum, src[RIDX(ii, jj, dim)]);
assign_sum_to_pixel(¤t_pixel, sum);
return current_pixel;
}
```

Basically, the base version does checks for every pixel to see the i and j bounds for adjacent pixels, adds the r, g, b, values to a local pixel_sum object that has a counter for how many pixels it accumulated, then it divides at end to get the average and returns it such that the destination matrix holds the averaged values.

**My Current Smooth Function**(Click to show)

```
void smooth(int dim, pixel *src, pixel *dst)
{
int i, j;
int dimSub1 = dim - 1;
int dimSub1xDim = dimSub1 * dim;
dst[0] = avg_top_left_corner(dim, 0, 0, src);
dst[dimSub1] = avg_top_right_corner(dim, 0, dimSub1, src);
dst[dimSub1xDim] = avg_bottom_left_corner(dim, dimSub1, 0, src);
dst[(dim * dim) - 1] = avg_bottom_right_corner(dim, dimSub1, dimSub1, src);
for (j = 1; j < dimSub1; j++) {
dst[j] = avg_top(dim, 0, j, src);
dst[dimSub1xDim + j] = avg_bottom(dim, dimSub1, j, src);
}
for (i = 1; i < dimSub1; i++) {
dst[i * dim] = avg_left(dim, i, 0, src);
dst[i * dim + dimSub1] = avg_right(dim, i, dimSub1, src);
}
for (i = 1; i < dimSub1; i++)
for (j = 1; j < dimSub1; j++)
dst[i * dim + j] = avg(dim, i, j, src);
}
```

A big part of the slowdown for the original avg function was all the checks it had to do for finding the bounds around a given pixel, so I just broke it up into different functions for each corner, and the four edges between the corners, explicitly setting the bounds for each case (e.g. for the top left corner, the bounds are i, i+1, j, j+1 such that it would average using the three neighbors and itself). The last loop goes through the remaining "bulk" of the matrix inside of the edges/corners.

This gave me a fairly good improvement, but I'm still behind. I've tried unrolling on the outer loop by 2, the inner loop by 2, keeping all the operations in one set of loops, but it gets slower every time.

Any insight? This goes pretty low level, dealing with available registers, cache hits/misses and memory accessing, function calls, etc.

**EfemaN**

I'm going to go ahead and post what my solution ended up being, in case someone stumbles upon this down the road. These are just in the order that I remember them.

- The biggest speedup came from removing checks by splitting the code up into special cases. For example, I had different functions for each of the four corners, and the edges between the corners, and then left the inner bulk of the matrix to one function. This let me define the bounds explicitly, so the function could just run. I didn't use conditionals to keep it compact, as mis-predictions can become very costly.
- There's a portion that requires division; this is by far the most expensive arithmetic. I knew that the values would always be zero or positive, so I changed the ints in the pixel_sum struct to unsigned and got a huge boost.
- Per Replica's recommendation, I switched the loops to decrement from the top end, with the "check" being "not equal to zero". This also gave me a significant speed-up.
- If there were any repeated operations, I moved them into a variable and used that variable in place of the operation (i.e. with parameters x and y, say there was a portion of the code that did x * y over and over; I would make a variable xy = x * y and use that instead).
- Setting frequently used variables (counters, indices) as register variables gave me small speedups.
- I also switched all array indexing to pointers (e.g. array[1] would become *(array + 1)). I didn't get a good handle on how much it helped, but it certainly didn't hurt.

The compiler's own optimizations did about 5 times faster than the default code, and then I was able to get about another 2x improvement on that, so roughly 10x faster than the default code!

Edited by EfemaN - 12/6/12 at 10:09pm