author: melinte title: Performance goggles

Overview

The Performance Application Programming Interface (PAPI) offers the programmer access to the performance counter hardware found in most major microprocessors. Measuring branch mispredictions and cache misses (and much more) is literally one line of code away.

Goggles on

Lets look a bit at these lines:

   const int nlines = 196608;
   const int ncols  = 64;
   char ctrash[nlines][ncols];
   {
       int x;
       papi::counters pc("by column");
       for (int c = 0; c < ncols; ++c) {
           for (int l = 0; l < nlines; ++l) {
               x = ctrash[l][c];
           }
       }
   }

The code just loops over an array but in the wrong order: the innermost loop iterates on the outer index. While the result is the same whether we loop over the first index first or over the last one, theorically, to preserve cache locality, the innermost loop should iterate over the innermost index. This should make a big difference for the time it takes to iterate over the array:

   {
       int x;
       papi::counters pc("By line");
       for (int l = 0; l < nlines; ++l) {
           for (int c = 0; c < ncols; ++c) {
               x = ctrash[l][c];
           }
       }
   }

papi::counters is a class wrapping around PAPI functionality. It will take a snaphost of some performance counters (in our case, we are interested in cache misses and in branch mispredictions) when a counters object is instantiated and another snapshot when the object is destroyed. Then it will print out the differences.

A first measure, with non-optimized code (-O0), shows the following:

Delta by column:
  PAPI_TOT_INS (Total instructions): 188744788 (380506167-191761379)
  PAPI_TOT_CYC (Total cpu cycles): 92390347 (187804288-95413941)
  PAPI_L1_DCM (L1 load  misses): 28427 (30620-2193)
  PAPI_L2_DCM (L2 load  misses): 102 (1269-1167)
  PAPI_BR_MSP (Branch mispredictions): 176 (207651-207475)

Delta By line:
  PAPI_TOT_INS (Total instructions): 190909841 (191734047-824206)
  PAPI_TOT_CYC (Total cpu cycles): 94460862 (95387664-926802)
  PAPI_L1_DCM (L1 load  misses): 403 (2046-1643)
  PAPI_L2_DCM (L2 load  misses): 21 (1081-1060)
  PAPI_BR_MSP (Branch mispredictions): 205934 (207350-1416)

While the cache misses have indeed improved, branch mispredictions exploded. Not exactly a good the tradeoff. Down in the pipeline of the processor, a comparison operation translates into a branch operation. Something is funny with the unoptimized code the compiler generated.

Maybe the optimized code (-O2) is behaving better? Or maybe not:

Delta by column:
  PAPI_TOT_INS (Total instructions): 329 (229368-229039)
  PAPI_TOT_CYC (Total cpu cycles): 513 (186217-185704)
  PAPI_L1_DCM (L1 load  misses): 2 (1523-1521)
  PAPI_L2_DCM (L2 load  misses): 0 (993-993)
  PAPI_BR_MSP (Branch mispredictions): 7 (1287-1280)

Delta By line:
  PAPI_TOT_INS (Total instructions): 330 (209614-209284)
  PAPI_TOT_CYC (Total cpu cycles): 499 (173487-172988)
  PAPI_L1_DCM (L1 load  misses): 2 (1498-1496)
  PAPI_L2_DCM (L2 load  misses): 0 (992-992)
  PAPI_BR_MSP (Branch mispredictions): 7 (1225-1218)

This time the compiler optimized the loops out! It figured we do not really use the data in the array, so it got rid of. Completely! Let's see how this code behaves:

   {
       int x;
       papi::counters pc("by column");
       for (int c = 0; c < ncols; ++c) {
           for (int l = 0; l < nlines; ++l) {
               x = ctrash[l][c];
               ctrash[l][c] = x + 1;
           }
       }
   }
Delta by column:
  PAPI_TOT_INS (Total instructions): 62918492 (63167552-249060)
  PAPI_TOT_CYC (Total cpu cycles): 224705473 (224904307-198834)
  PAPI_L1_DCM (L1 load  misses): 12415661 (12417203-1542)
  PAPI_L2_DCM (L2 load  misses): 9654638 (9655632-994)
  PAPI_BR_MSP (Branch mispredictions): 14217 (15558-1341)

Delta By line:
  PAPI_TOT_INS (Total instructions): 51904854 (115092642-63187788)
  PAPI_TOT_CYC (Total cpu cycles): 25914254 (250864272-224950018)
  PAPI_L1_DCM (L1 load  misses): 197104 (12614449-12417345)
  PAPI_L2_DCM (L2 load  misses): 6330 (9662090-9655760)
  PAPI_BR_MSP (Branch mispredictions): 296 (16066-15770)

Both cache misses and branch mispredictions improved by at least an order of magnitude. A run with unoptimized code will show the same order of improvement.

Resources