Overclock.net › Forums › Software, Programming and Coding › Coding and Programming › [Q] Matrix Operation Optimization in C [Done!]
New Posts  All Forums:Forum Nav:

[Q] Matrix Operation Optimization in C [Done!]

post #1 of 8
Thread Starter 
We're going through optimization in my machine architecture class. This last lab for the semester has us going through the algorithms for rotating an image 90 degrees, and the other is smoothing by taking the average of all 4, 6, or 9 pixels (depending on where on the square matrix it is) around a given pixel (including that pixel).
Code:
#define RIDX(i,j,n) ((i)*(n)+(j))
Naive Rotate Function (Click to show)
Code:
void naive_rotate(int dim, pixel *src, pixel *dst) 
{
    int i, j;

    for (i = 0; i < dim; i++)
        for (j = 0; j < dim; j++)
            dst[RIDX(dim-1-j, i, dim)] = src[RIDX(i, j, dim)];
}
My Rotate Function (Click to show)
Code:
void rotate(int dim, pixel *src, pixel *dst) 
{
    int i, j, dimm_i, dimm, i1, i2, i3, i4, i5, i6, i7, i8, i9, i10, i11, i12, i13, i14, i15;

        for (i = 0; i < dim; i+=16) {
                i1 = i + 1; 
                i2 = i + 2; 
                i3 = i + 3; 
                i4 = i + 4; 
                i5 = i + 5; 
                i6 = i + 6; 
                i7 = i + 7;
                i8 = i + 8 ;
                i9 = i + 9 ; 
                i10 = i + 10;
                i11 = i + 11; 
                i12 = i + 12; 
                i13 = i + 13; 
                i14 = i + 14; 
                i15 = i + 15; 
                for (j = 0; j < dim; j++) {
                        dimm = (dim - 1 - j)*dim;
                        dimm_i = dimm + i;
                        dst[dimm_i]       = src[i * dim + j];
                        dst[dimm_i + 1] = src[i1 * dim + j];
                        dst[dimm_i + 2] = src[i2 * dim + j];
                        dst[dimm_i + 3] = src[i3 * dim + j];
                        dst[dimm_i + 4] = src[i4 * dim + j];
                        dst[dimm_i + 5] = src[i5 * dim + j];
                        dst[dimm_i + 6] = src[i6 * dim + j];
                        dst[dimm_i + 7] = src[i7 * dim + j];
                        dst[dimm_i + 8] = src[i8 * dim + j];
                        dst[dimm_i + 9] = src[i9 * dim + j];
                        dst[dimm_i + 10] = src[i10 * dim + j];
                        dst[dimm_i + 11] = src[i11 * dim + j];
                        dst[dimm_i + 12] = src[i12 * dim + j];
                        dst[dimm_i + 13] = src[i13 * dim + j];
                        dst[dimm_i + 14] = src[i14 * dim + j];
                        dst[dimm_i + 15] = src[i15 * dim + j];
                }
        }
}

I just unrolled the outer loop by 16 so there were a lot of "parallel" operations, and that hit our required improvement. It's ugly, but it worked.

Naive Smooth Function (Click to show)
Code:
void naive_smooth(int dim, pixel *src, pixel *dst) 
{
    int i, j;

    for (i = 0; i < dim; i++)
        for (j = 0; j < dim; j++)
            dst[RIDX(i, j, dim)] = avg(dim, i, j, src);
}
Helper Code for Smooth (Click to show)
Code:
/* A struct used to compute averaged pixel value */
typedef struct {
    int red;
    int green;
    int blue;
    int num;
} pixel_sum;

/* Compute min and max of two integers, respectively */
static int min(int a, int b) { return (a < b ? a : b); }
static int max(int a, int b) { return (a > b ? a : b); }

/* 
 * initialize_pixel_sum - Initializes all fields of sum to 0 
 */
static void initialize_pixel_sum(pixel_sum *sum) 
{
    sum->red = sum->green = sum->blue = 0;
    sum->num = 0;
    return;
}

/* 
 * accumulate_sum - Accumulates field values of p in corresponding 
 * fields of sum 
 */
static void accumulate_sum(pixel_sum *sum, pixel p) 
{
    sum->red += (int) p.red;
    sum->green += (int) p.green;
    sum->blue += (int) p.blue;
    sum->num++;
    return;
}

/* 
 * assign_sum_to_pixel - Computes averaged pixel value in current_pixel 
 */
static void assign_sum_to_pixel(pixel *current_pixel, pixel_sum sum) 
{
    current_pixel->red = (unsigned short) (sum.red/sum.num);
    current_pixel->green = (unsigned short) (sum.green/sum.num);
    current_pixel->blue = (unsigned short) (sum.blue/sum.num);
    return;
}

/* 
 * avg - Returns averaged pixel value at (i,j) 
 */
static pixel avg(int dim, int i, int j, pixel *src) 
{
    int ii, jj;
    pixel_sum sum;
    pixel current_pixel;

    initialize_pixel_sum(&sum);
    for(ii = max(i-1, 0); ii <= min(i+1, dim-1); ii++) 
        for(jj = max(j-1, 0); jj <= min(j+1, dim-1); jj++) 
            accumulate_sum(&sum, src[RIDX(ii, jj, dim)]);

    assign_sum_to_pixel(&current_pixel, sum);
    return current_pixel;
}

Basically, the base version does checks for every pixel to see the i and j bounds for adjacent pixels, adds the r, g, b, values to a local pixel_sum object that has a counter for how many pixels it accumulated, then it divides at end to get the average and returns it such that the destination matrix holds the averaged values.

My Current Smooth Function (Click to show)
Code:
void smooth(int dim, pixel *src, pixel *dst) 
{
    int i, j;
        int dimSub1 = dim - 1;
        int dimSub1xDim = dimSub1 * dim;

        dst[0] = avg_top_left_corner(dim, 0, 0, src);
        dst[dimSub1] = avg_top_right_corner(dim, 0, dimSub1, src);
        dst[dimSub1xDim] = avg_bottom_left_corner(dim, dimSub1, 0, src);
        dst[(dim * dim) - 1] = avg_bottom_right_corner(dim, dimSub1, dimSub1, src);

        for (j = 1; j < dimSub1; j++) {
                dst[j] = avg_top(dim, 0, j, src);
                dst[dimSub1xDim + j] = avg_bottom(dim, dimSub1, j, src);
        }

        for (i = 1; i < dimSub1; i++) {
                dst[i * dim] = avg_left(dim, i, 0, src);
                dst[i * dim + dimSub1] = avg_right(dim, i, dimSub1, src);
        }


        for (i = 1; i < dimSub1; i++)
                for (j = 1; j < dimSub1; j++)
                        dst[i * dim + j] = avg(dim, i, j, src);

}

A big part of the slowdown for the original avg function was all the checks it had to do for finding the bounds around a given pixel, so I just broke it up into different functions for each corner, and the four edges between the corners, explicitly setting the bounds for each case (e.g. for the top left corner, the bounds are i, i+1, j, j+1 such that it would average using the three neighbors and itself). The last loop goes through the remaining "bulk" of the matrix inside of the edges/corners.

This gave me a fairly good improvement, but I'm still behind. I've tried unrolling on the outer loop by 2, the inner loop by 2, keeping all the operations in one set of loops, but it gets slower every time.

Any insight? This goes pretty low level, dealing with available registers, cache hits/misses and memory accessing, function calls, etc.
Quote:
Originally Posted by EfemaN View Post

I'm going to go ahead and post what my solution ended up being, in case someone stumbles upon this down the road. These are just in the order that I remember them.
  1. The biggest speedup came from removing checks by splitting the code up into special cases. For example, I had different functions for each of the four corners, and the edges between the corners, and then left the inner bulk of the matrix to one function. This let me define the bounds explicitly, so the function could just run. I didn't use conditionals to keep it compact, as mis-predictions can become very costly.
  2. There's a portion that requires division; this is by far the most expensive arithmetic. I knew that the values would always be zero or positive, so I changed the ints in the pixel_sum struct to unsigned and got a huge boost.
  3. Per Replica's recommendation, I switched the loops to decrement from the top end, with the "check" being "not equal to zero". This also gave me a significant speed-up.
  4. If there were any repeated operations, I moved them into a variable and used that variable in place of the operation (i.e. with parameters x and y, say there was a portion of the code that did x * y over and over; I would make a variable xy = x * y and use that instead).
  5. Setting frequently used variables (counters, indices) as register variables gave me small speedups.
  6. I also switched all array indexing to pointers (e.g. array[1] would become *(array + 1)). I didn't get a good handle on how much it helped, but it certainly didn't hurt.

The compiler's own optimizations did about 5 times faster than the default code, and then I was able to get about another 2x improvement on that, so roughly 10x faster than the default code!

Edited by EfemaN - 12/6/12 at 10:09pm
post #2 of 8

Try the following

 

  1. Change your loop counters to unsigned ints, make them count down to zero and change the loop conditions to "not equal to zero"
  2. Try using register ints and see if that helps with loop unrolling
  3. Use a profiler to find out where the bottleneck exists
  4. Find out if you're allowed to use multiple threads
post #3 of 8
Thread Starter 
Quote:
Originally Posted by TFL Replica View Post

Try the following
  1. Change your loop counters to unsigned ints, make them count down to zero and change the loop conditions to "not equal to zero"
  2. Try using register ints and see if that helps with loop unrolling
  3. Use a profiler to find out where the bottleneck exists
  4. Find out if you're allowed to use multiple threads

Thanks, Replica.

1. I'll try this tonight. Why would this work? Does the instruction for "not equal" run more quickly than "less than"?
2. So, set the variables defined up top to register ints? I don't know how many are available on the CPUs these run on; where would I draw the line so that there are still enough to not bottleneck the matrix operations themselves?
3. I haven't used one before, wouldn't know where to begin. The programs are run on the school Linux machines for consistency, don't know if that presents a problem.
4. They haven't specified any restrictions; it's basically, "do whatever you want". That being said, I wouldn't know where to begin for threading.
post #4 of 8

1. It has been known to perform better on some compilers. In theory, the compiler should be doing all this stuff but it never hurts to try.

 

2. Try the loop counter variables (i and j) first and see if it makes any difference, then keep adding until see how the program reacts. As with the previous suggestion it's possible that the compiler may already be doing this and/or it wouldn't affect such a small program.

 

3. If you're using Linux I'd recommend gprof as a quick and easy to use profiler: http://sourceware.org/binutils/docs/gprof/

Consider using the "hot" attribute after you've identified the bottleneck: http://gcc.gnu.org/onlinedocs/gcc/Function-Attributes.html

 

4. If threads are allowed you're missing on a huge opportunity to parallelize those for loops. Try OpenMP. For the purpose of getting a free performance boost in this assignment, it would be ridiculously easy to use.

post #5 of 8
Quote:
Originally Posted by TFL Replica View Post

Try the following
[*] Use a profiler to find out where the bottleneck exists

Start with this one. I can recommend callgrind, which is a tool included in valgrind. You can then visualize the data using kcachegrind. Iirc it's easier to use then gprof. Have a look here:

http://www.baptiste-wicht.com/2011/09/profile-c-application-with-callgrind-kcachegrind/
buka
(17 items)
 
  
Reply
buka
(17 items)
 
  
Reply
post #6 of 8
Thread Starter 
Quote:
Originally Posted by TFL Replica View Post

1. It has been known to perform better on some compilers. In theory, the compiler should be doing all this stuff but it never hurts to try.

2. Try the loop counter variables (i and j) first and see if it makes any difference, then keep adding until see how the program reacts. As with the previous suggestion it's possible that the compiler may already be doing this and/or it wouldn't affect such a small program.

3. If you're using Linux I'd recommend gprof as a quick and easy to use profiler: http://sourceware.org/binutils/docs/gprof/
Consider using the "hot" attribute after you've identified the bottleneck: http://gcc.gnu.org/onlinedocs/gcc/Function-Attributes.html

4. If threads are allowed you're missing on a huge opportunity to parallelize those for loops. Try OpenMP. For the purpose of getting a free performance boost in this assignment, it would be ridiculously easy to use.

1/2: It's one of the versions of gcc that we're using. Will try these tonight.

3. I won't be able to install any packages on the school machines, and I don't have quick access to a linux distro, and limited time to get this done. But good for the future.

4. I should have clarified that I can't include any libraries or external files; it all strictly in the code. Not to mention, threading would sort of skip the point of the lab, which is to optimize the code itself.

Quote:
Originally Posted by poroboszcz View Post


Start with this one. I can recommend callgrind, which is a tool included in valgrind. You can then visualize the data using kcachegrind. Iirc it's easier to use then gprof. Have a look here:

http://www.baptiste-wicht.com/2011/09/profile-c-application-with-callgrind-kcachegrind/

Thanks for the input. I'll use this in the future, but unfortunately not in the time that I have available for this lab.
post #7 of 8
Thread Starter 
I'm going to go ahead and post what my solution ended up being, in case someone stumbles upon this down the road. These are just in the order that I remember them.
  1. The biggest speedup came from removing checks by splitting the code up into special cases. For example, I had different functions for each of the four corners, and the edges between the corners, and then left the inner bulk of the matrix to one function. This let me define the bounds explicitly, so the function could just run. I didn't use conditionals to keep it compact, as mis-predictions can become very costly.
  2. There's a portion that requires division; this is by far the most expensive arithmetic. I knew that the values would always be zero or positive, so I changed the ints in the pixel_sum struct to unsigned and got a huge boost.
  3. Per Replica's recommendation, I switched the loops to decrement from the top end, with the "check" being "not equal to zero". This also gave me a significant speed-up.
  4. If there were any repeated operations, I moved them into a variable and used that variable in place of the operation (i.e. with parameters x and y, say there was a portion of the code that did x * y over and over; I would make a variable xy = x * y and use that instead).
  5. Setting frequently used variables (counters, indices) as register variables gave me small speedups.
  6. I also switched all array indexing to pointers (e.g. array[1] would become *(array + 1)). I didn't get a good handle on how much it helped, but it certainly didn't hurt.

The compiler's own optimizations did about 5 times faster than the default code, and then I was able to get about another 2x improvement on that, so roughly 10x faster than the default code!
post #8 of 8

Good to know. thumb.gif

New Posts  All Forums:Forum Nav:
  Return Home
  Back to Forum: Coding and Programming
Overclock.net › Forums › Software, Programming and Coding › Coding and Programming › [Q] Matrix Operation Optimization in C [Done!]