How to benchmark fairly and accurately. Written in particular for benchmarking OpenOffice.org performance, but these guidelines should work well for other applications too.
THIS PAGE IS A WORK IN PROGRESS! :)
Define the purpose of your benchmark. For example: to evaluate the performance of the edition of OpenOffice.org, so a user can make an educated decision about which edition to install on his system.
Limit variables. For example, if evaluating multiple versions of OpenOffice.org, don't also change the operating system.
The results should be reproducible. The user should see similar results on his own system.
Use a clean install of the operating system with the default installation settings. Install the latest OS and driver updates.
Install the application as the user would. With OpenOffice.org that means installing the version from www.openoffice.org, from a Linux repository, from the Go-OOo.org web site, etc. (State the source.) Don't build from source with obscure settings (unless the purpose is to measure a particular build setting) because your build environment and settings may not represent the package the user would install. For example, the vanilla OpenOffice.org builds Linux binaries with an old version of GCC and includes libraries that are common on many Linux systems.
Minimize noise. These are processes that run infrequently that may skew results inconsistently or unpredictably. For example, disable the operating systems' automatic software update system and the screensaver.
Do not optimize the operating system. Doing so could yield results that typical users will not see. For example, don't reduce the visual effects (Windows or Compiz). In Linux don't use add the noatime attribute. In Windows, don't 'clean' the registry, change the pagefile settings, or clear the prefetcher.
Measure meaningful operations. People wait for an application to open and complain about how long it takes, but scrolling a window is fast on any system.
Never take measurements in a virtual machine such as VMWare, VirtualBox, QEMU, etc. (unless your purpose is to measure the overhead added by the virtual machine).
If available, use older hardware—older, but not ancient. The benchmark will take longer, so it's easier to measure small differences. Also, most people don't have new hardware, so new hardware is not representative.
Before publishing, evaluate the variance. If the variance is high, identify the cause, fix it, and test again.
Display a boxplot to illustrate variance.
The only way to measure accurately cold start times is use the real thing: a reboot. Simulating a cold start by flushing the disk cache does not represent a real reboot. There are many factors including operating system disk cache and physical disk cache. Even if you could flush all the caches, the process of starting up the operating system loads certain data into the cache (such as common libraries). Just reboot.
Clearly state the conditions of the benchmark and provide some analysis of the results.
Measure using automated tools. Human reaction time is too inaccurate to operate a stopwatch and get good results with processes that take less than a minute.
Use the same screen resolution.
Install the operating system on the same part of the hard drive (e.g., the first 20GB of the disk) instead of dual booting (where the tests are performed in different areas of the disk).