Revision as of 00:27, 30 March 2010

Loop Unrolling

Motivation

One area of Computer Science interest to me is neural networks, especially feed-forward neural networks. Because of the way that neural networks learn, they need to evaluate a large number of different weight combinations over a large number of samples. Since evaluating the value of each node is just a dot product of a node's weights and the values of the previous layer, the nexus of fast NN evaluation come down to fast dot products. The dot product of x and y is [math]\sum_{i=1}^n x_iy_i[/math], or the sum of all the product of the ith components.

The classic C code for computing a dot product is:

double dot(double *a, double *b, size_t s)
{
    double sum = 0.0;
    for(int i=0; i<s; i++)
    {
        sum += a[i]*b[i];
    }
    return sum;
}

Assuming that the floating-point operations are done with a math co-processor (ie. on any non-embedded CPU), most of the overhead comes from counting the index and comparing against the end of the loop. In optimizing this looping procedure, how can we reduce the loop counting overhead? How can we write code to give the compiler hints about what operations vector processors (SSE, AltiVec, etc.) can handle more efficiently. How do the optimization arguments given to the compiler affect code performance?

Difference between revisions of "Brian Articles"

Revision as of 00:27, 30 March 2010

Loop Unrolling

Motivation

Navigation menu

Search

@@ Line 1: / Line 1: @@
 =Loop Unrolling=
+===Motivation===
+One area of Computer Science interest to me is neural networks, especially feed-forward neural networks. Because of the way that neural networks learn, they need to evaluate a large number of different weight combinations over a large number of samples. Since evaluating the value of each node is just a dot product of a node's weights and the values of the previous layer, the nexus of fast NN evaluation come down to fast dot products. The dot product of x and y is <math>\sum_{i=1}^n x_iy_i</math>, or the sum of all the product of the ith components.
+The classic C code for computing a dot product is:
+ double dot(double *a, double *b, size_t s)
+ {
+     double sum = 0.0;
+     for(int i=0; i<s; i++)
+     {
+         sum += a[i]*b[i];
+     }
+     return sum;
+ }
+Assuming that the floating-point operations are done with a math co-processor (ie. on any non-embedded CPU), most of the overhead comes from counting the index and comparing against the end of the loop. In optimizing this looping procedure, how can we reduce the loop counting overhead? How can we write code to give the compiler hints about what operations vector processors (SSE, AltiVec, etc.) can handle more efficiently. How do the optimization arguments given to the compiler affect code performance?