Return Styles: Pseud0ch, Terminal, Valhalla, NES, Geocities, Blue Moon. Entire thread

Fast positive integer power function

Name: Anonymous 2012-04-15 6:00

I'm trying to figure out how to compute an integer power of a double efficiently with as little overhead as possible (trying to avoid loops and recursion to minimize the amount of calls and jumps). Calculation speed is the priority.
So far I got this trivial piece of code:

double f(double a,char b){
if(b==2)return a*a;
if(b==3)return a*a*a;
if(b==4)return a*a*a*a;
if(b==5)return a*a*a*a*a;
if(b==6)return a*a*a*a*a*a;
if(b==7)return a*a*a*a*a*a*a;
if(b==8)return a*a*a*a*a*a*a*a;
if(b==9)return a*a*a*a*a*a*a*a*a;
        return a*a*a*a*a*a*a*a*a*a;
}


...which of course assumes the exponent is at most 10. The code can easily be expanded if higher exponents are needed.
I'm thinking this is probably optimal, but I know you /prog/riders have some magic tricks up your sleeve, so hit me with them.

Name: Anonymous 2012-04-16 23:05

>>62
wow.

If you use the exponentiation by squaring technique, and implement it using this:


#define NUM_BITS sizeof(long long)*8

double powerlonglong(double x, long long n) {
  double powers[NUM_BITS]; // powers[i] = x^(2^i)
  double mul;
  int index;
  powers[0] = x; // x^(2^0) = x^1 = x
  for(index = 1; index < NUM_BITS; index++) {
    powers[index] = powers[index-1]*powers[index-1]; // x^(2^i) = x^((2^i-1)*2) = (x^(2^i-1))^2
  }
  // x^n = x^(sum c_i2^i) = mul x^(c_i2^i) = mul x_i where x_i = 1 if c_i = 0,
                                                        // x_i = x^(2^i) = powers[i] if c_i = 1
  mul = 1.0;
  for(index = 0; index < NUM_BITS; index++) {
    if(n&(1<<index)) {
      mul *= powers[index];
    }
  }
  return mul;
}


NUM_BITS is the number of bits taken by the exponent.

It will take NUM_BITS multiplications to generate the powers array. Then the needed elements of the powers array are multiplied together to form the final product. In the worst case (if the exponent is 0b1111111...), then every term in the powers array will be multiplied together again, giving NUM_BITS multiplications again.

So the whole thing can be done usng a worst case 2*NUM_BITS amount of multiplications. So accounting for time in multiplication alone, with would take 10*NUM_BITS clock cycles in the worst case, which would hypothetically give better performance than 150 clock cycles when NUM_BITS < 15, and it could beat sandy bridge when NUM_BITS < 80. So Sandy bridge seems like it would definitely be possible, although you'd have to get the control flow to be efficient.

Newer Posts
Don't change these.
Name: Email:
Entire Thread Thread List