Difference between revisions of "Brian Articles"
Line 1: | Line 1: | ||
+ | =Benchmarking Code= | ||
+ | |||
+ | ====Quick-n-Dirty with time==== | ||
+ | The easiest way to benchmark the time of a program is simply using the time command: | ||
+ | brian@winnebago:~/code/cpi$ time ./cpi | ||
+ | Running 100000000 trials | ||
+ | Pi is: 3.14133 | ||
+ | real 0m5.845s | ||
+ | user 0m5.830s | ||
+ | sys 0m0.010s | ||
+ | *"real" is the actual elapsed time the program took to run (often referred to as the wall-clock time). | ||
+ | *"user" is the amount of time that the OS spent running you actual code (this is the one that you usually pay the most attention to). | ||
+ | *"sys" is the amount of time that the OS spent on your behalf doing things in the kernel, like I/O. | ||
+ | |||
+ | This is useful to test in the real-world, including the effects of program initialization, hardware limitations, and network time. However, all of these factors can also add significantly noise of the measurement as they can vary according to system load, network load, systems caches, any many, many other factors. You will want to test a program many times to gain confidence in any numbers returned. As an example, compare these runs of exactly the same command: | ||
+ | brian@winnebago:~/gumstix$ time find ./ | wc -l | ||
+ | 275404 | ||
+ | real 0m53.946s | ||
+ | user 0m0.710s | ||
+ | sys 0m2.990s | ||
+ | |||
+ | brian@winnebago:~/gumstix$ time find ./ | wc -l | ||
+ | 275404 | ||
+ | real 0m1.241s | ||
+ | user 0m0.340s | ||
+ | sys 0m0.610s | ||
+ | |||
+ | brian@winnebago:~/gumstix$ time find ./ | wc -l | ||
+ | 275404 | ||
+ | real 0m0.891s | ||
+ | user 0m0.290s | ||
+ | sys 0m0.510s | ||
+ | |||
+ | brian@winnebago:~/gumstix$ time find ./ | wc -l | ||
+ | 275404 | ||
+ | real 0m0.806s | ||
+ | user 0m0.190s | ||
+ | sys 0m0.550s | ||
+ | |||
+ | The first search took almost 54s as the filesystem tree wasn't cached in RAM. Subsequent searches of the same file tree are faster as the data is now in RAM and the hard drive no longer needs to be accessed. This admittedly contrived example none-the-less shows how there can be unforeseen side-effects when benchmarking your code. | ||
+ | |||
=Loop Unrolling= | =Loop Unrolling= | ||
Revision as of 01:26, 30 March 2010
Benchmarking Code
Quick-n-Dirty with time
The easiest way to benchmark the time of a program is simply using the time command:
brian@winnebago:~/code/cpi$ time ./cpi Running 100000000 trials Pi is: 3.14133 real 0m5.845s user 0m5.830s sys 0m0.010s
- "real" is the actual elapsed time the program took to run (often referred to as the wall-clock time).
- "user" is the amount of time that the OS spent running you actual code (this is the one that you usually pay the most attention to).
- "sys" is the amount of time that the OS spent on your behalf doing things in the kernel, like I/O.
This is useful to test in the real-world, including the effects of program initialization, hardware limitations, and network time. However, all of these factors can also add significantly noise of the measurement as they can vary according to system load, network load, systems caches, any many, many other factors. You will want to test a program many times to gain confidence in any numbers returned. As an example, compare these runs of exactly the same command:
brian@winnebago:~/gumstix$ time find ./ | wc -l 275404 real 0m53.946s user 0m0.710s sys 0m2.990s
brian@winnebago:~/gumstix$ time find ./ | wc -l 275404 real 0m1.241s user 0m0.340s sys 0m0.610s brian@winnebago:~/gumstix$ time find ./ | wc -l 275404 real 0m0.891s user 0m0.290s sys 0m0.510s
brian@winnebago:~/gumstix$ time find ./ | wc -l 275404 real 0m0.806s user 0m0.190s sys 0m0.550s
The first search took almost 54s as the filesystem tree wasn't cached in RAM. Subsequent searches of the same file tree are faster as the data is now in RAM and the hard drive no longer needs to be accessed. This admittedly contrived example none-the-less shows how there can be unforeseen side-effects when benchmarking your code.
Loop Unrolling
Motivation
One area of Computer Science interest to me is neural networks, especially feed-forward neural networks. Because of the way that neural networks learn, they need to evaluate a large number of different weight combinations over a large number of samples. Since evaluating the value of each node is just a dot product of a node's weights and the values of the previous layer, the nexus of fast NN evaluation come down to fast dot products. The dot product of x and y is
, or the sum of all the product of the ith components.The classic C code for computing a dot product is:
double dot(double *a, double *b, size_t s) { double sum = 0.0; for(int i=0; i<s; i++) { sum += a[i]*b[i]; } return sum; }
Assuming that the floating-point operations are done with a math co-processor (ie. on any non-embedded CPU), most of the overhead comes from counting the index and comparing against the end of the loop. In optimizing this looping procedure, how can we reduce the loop counting overhead? How can we write code to give the compiler hints about what operations vector processors (SSE, AltiVec, etc.) can handle more efficiently. How do the optimization arguments given to the compiler affect code performance?