8.2 Efficiency

In this section, let us examine the design choices that affect the efficiency of the Math.h++ class library.

8.2.1 Slices

A Math.h++ vector view describes a slice of data. This slice is defined by a pointer to the beginning of the data, a length, and the increment between successive data items. Indexing into a vector thus requires a multiply as well as an add in order to account for the stride between successive data items. This contrasts with C, where only an add is necessary.

Accessing data through slices can be just as efficient as with a C-style vector. To scan through a regular vector of contiguous elements, a pointer must be incremented by the size of an element; for example, 8 in the case of a double on a typical machine. To scan through a slice requires incrementing by the size of the element times the stride length. The only difference is that the size of the element is known at compile time (always 8), while the stride length is known at run time (8 times stride length). Hence, the former can be held as an immediate operand in a move statement, and the latter must be held as an indirect operand, requiring one more memory access. If the machine has enough registers, the memory access is only required at the start of the scan. From then on, it can be held in a register, and there is no time penalty beyond the initial memory access. The net effect is that the time it takes to multiply two vector slices together, for example, is essentially the same as to multiply two vectors of contiguous elements. Furthermore, much of Math.h++ is based on the BLAS package, which was written in terms of slices. (See Section 8.3, "Basic Linear Algebra (BLA) Package.")

With access times essentially the same, the challenge is to make the actual construction of a slice as cheap as possible. The Math.h++ class library does this by making every vector a slice. This makes the construction of a helper class to keep track of the slice unnecessary. Hence, there is very little performance penalty for working with slices instead of vectors of contiguous elements.

Indexing, however, is a different story. This requires a multiply by the stride length with every access, not just the first one. Tests have shown about a 20% slowdown to index random elements in a slice versus a vector of contiguous elements.

8.2.2 Overall Efficiency

The following comparative example demonstrates the efficiency you can achieve with Math.h++. Consider the Fortran program given below:

       
double precision a(8100), b(8100), c(8100)
       double precision second, start, dt

       Niter = 1000

       do 500 N=500,4000,500
          do 100 i=1,N
             a(i) = 3
             b(i) = 5
 100      continue

c         Second returns elapsed time
          start = second()

          do 200 iter=1, Niter
             do 200 i=1, N
                c(i) = a(i) * b(i)
 200      continue

          dt = second() - start

          write(6,2000) N, dt, dt/Niter
 2000     format(i6, 5x, f8.5, 12x, f8.5)
 500   continue
       end

Here is the C++ program, using Math.h++:

#include <rw/math/mathvec.h>
#include <iostream.h>

double second();         // Returns elapsed time

void main() {
  int Niter = 1000;

  for(int N=500; N<=4000; N+=500){
    RWMathVec<double> a(N, 3);
    RWMathVec<double> b(N, 5);

    double start = second();
    
    for(int iter=0; iter<Niter; iter++){
      RWMathVec<double> c = a*b;
    }
    
    double dt = second() - start;

    cout << N << " " << dt << " " << dt/Niter << endl;
  }
}

Notice that the C++ program is much simpler. Also, the Fortran program uses static memory; the memory is declared and defined at compile time. This greatly simplifies the job for the compiler, since the addresses of the arrays are known at compile time, but greatly complicates the job for the programmer, since you must know the maximum size of your arrays in advance. The Math.h++ version uses dynamic memory; it is obtained at run time.

Despite the use of dynamic memory and built-in slices, Math.h++ is faster than Fortran. Because C++ allows all of the code for the various binary and unary operators to be localized in one spot, it becomes easy to write optimization in the critical spots. This is what we have done. With Fortran, you are at the mercy of the compiler.

In this example, the critical code resides inside the binary multiply operator:

template <class T>
RWMathVec<T> operator*(const RWMathVec<double>&, 
                       const RWMathVec<double>&);

This operator requires that three addresses be juggled: the vector operands and the return value. Two of these, the operands, can potentially be slices (that is, their stride may not be 1), which requires that their strides be held as well. Note that other operators, such as:

RWMathVec<double>& RWMathVec<double>::operator+=(const 
                                            RWMathVec<double>&);

involve only two operands, and so are much easier to optimize.