Brian Articles

From New IAC Wiki
Revision as of 01:26, 30 March 2010 by Oborn (talk | contribs)
Jump to navigation Jump to search

Benchmarking Code

Quick-n-Dirty with time

The easiest way to benchmark the time of a program is simply using the time command:

brian@winnebago:~/code/cpi$ time ./cpi 
Running 100000000 trials
Pi is: 3.14133
real	0m5.845s
user	0m5.830s
sys	0m0.010s
  • "real" is the actual elapsed time the program took to run (often referred to as the wall-clock time).
  • "user" is the amount of time that the OS spent running you actual code (this is the one that you usually pay the most attention to).
  • "sys" is the amount of time that the OS spent on your behalf doing things in the kernel, like I/O.

This is useful to test in the real-world, including the effects of program initialization, hardware limitations, and network time. However, all of these factors can also add significantly noise of the measurement as they can vary according to system load, network load, systems caches, any many, many other factors. You will want to test a program many times to gain confidence in any numbers returned. As an example, compare these runs of exactly the same command:

brian@winnebago:~/gumstix$ time find ./ | wc -l
275404
real	0m53.946s
user	0m0.710s
sys	0m2.990s
brian@winnebago:~/gumstix$ time find ./ | wc -l
275404
real	0m1.241s
user	0m0.340s
sys	0m0.610s

brian@winnebago:~/gumstix$ time find ./ | wc -l
275404
real	0m0.891s
user	0m0.290s
sys	0m0.510s
brian@winnebago:~/gumstix$ time find ./ | wc -l
275404
real	0m0.806s
user	0m0.190s
sys	0m0.550s

The first search took almost 54s as the filesystem tree wasn't cached in RAM. Subsequent searches of the same file tree are faster as the data is now in RAM and the hard drive no longer needs to be accessed. This admittedly contrived example none-the-less shows how there can be unforeseen side-effects when benchmarking your code.

Loop Unrolling

Motivation

One area of Computer Science interest to me is neural networks, especially feed-forward neural networks. Because of the way that neural networks learn, they need to evaluate a large number of different weight combinations over a large number of samples. Since evaluating the value of each node is just a dot product of a node's weights and the values of the previous layer, the nexus of fast NN evaluation come down to fast dot products. The dot product of x and y is [math]\sum_{i=1}^n x_iy_i[/math], or the sum of all the product of the ith components.

The classic C code for computing a dot product is:

double dot(double *a, double *b, size_t s)
{
    double sum = 0.0;
    for(int i=0; i<s; i++)
    {
        sum += a[i]*b[i];
    }
    return sum;
}

Assuming that the floating-point operations are done with a math co-processor (ie. on any non-embedded CPU), most of the overhead comes from counting the index and comparing against the end of the loop. In optimizing this looping procedure, how can we reduce the loop counting overhead? How can we write code to give the compiler hints about what operations vector processors (SSE, AltiVec, etc.) can handle more efficiently. How do the optimization arguments given to the compiler affect code performance?