Further BigInteger performance improvements

Alan Eliasen eliasen at mindspring.com
Thu Jun 4 22:58:48 UTC 2009

```Andrew Haley wrote:
> You give examples of the speedup for very large bignums, but you don't say
> the size of numbers at which your approach becomes faster than the current
> code.  Of course any asymptotic improvement helps with numbers that are
> half a million decimal digits long, but where's the crossover?

Andrew,

Good question.  Like most Karatsuba and Toom-Cook implementations,
the optimal crossover point is vague, as it can depend on the size of
both operands, and the curves don't separate fast at that point.  If you
look at the updated file (again, at
http://futureboy.us/temp/BigInteger.java ) you'll see the crossover points:

/**
* The threshold value for using Karatsuba multiplication.  If the
number
* of ints in both mag arrays are greater than this number, then
* Karatsuba multiplication will be used.   This value is found
* experimentally to work well.
*/
private static final int KARATSUBA_THRESHOLD = 50;

/**
* The threshold value for using 3-way Toom-Cook multiplication.
* If the number of ints in both mag arrays are greater than this
number,
* then Toom-Cook multiplication will be used.   This value is found
* experimentally to work well.
*/
private static final int TOOM_COOK_THRESHOLD = 75;

/**
* The threshold value for using Karatsuba squaring.  If the number
* of ints in the number are larger than this value,
* Karatsuba squaring will be used.   This value is found
* experimentally to work well.
*/
private static final int KARATSUBA_SQUARE_THRESHOLD = 90;

/**
* The threshold value for using Toom-Cook squaring.  If the number
* of ints in the number are larger than this value,
* Karatsuba squaring will be used.   This value is found
* experimentally to work well.
*/
private static final int TOOM_COOK_SQUARE_THRESHOLD = 140;

Thus, the first crossover for Karatsuba multiplication is at 50*32
bits (1600 bits), or about 482 decimal digits.  Toom-Cook at 2400 bits.
(Hint: divide the number of bits by 3.32 to get the approximate number
of decimal digits.)

These crossovers are determined experimentally, of course, and
constants are chosen that work well for a variety of number forms and
operand combinations.  These numbers were found to work well after a lot
of timing runs with a lot of different number sizes and forms.  Of
course, it's possible that different thresholds can work better on
different architectures and VM settings.  I'm not proposing we tune for
each architecture, but rather choose conservative crossover thresholds
which work well, which I believe these are.

It generally isn't damaging if you're just "in the ballpark" when
setting your thresholds; the algorithms will still work fine, and
performance will still be very similar for most operands.  The curves
don't diverge too quickly around the crossover point, and there's a lot
of random variation for different size operands, and all of your
threshold points interact with each other, so it's a multi-dimensional
optimization problem.

In fact, the effects of choosing different thresholds is hard to
predict.  There may be little local minima and maxima in performance
whether you choose, say, 45 or 48 or 50 or 52 for the crossover, and
operands of different lengths may behave quite differently with
different crossovers.  You won't determine a good crossover point by
asymptotic analysis.  You also have to see if a certain combination of
crossovers exhibits anomalously bad behavior for certain combinations of
operand sizes and avoid those, which is why it's as much black art as
science.

implementation is (and there are *lots* of different ways to split and
combine numbers in the Toom-Cook method.)  My method for Toom-Cook is
the "optimal" algorithm found by Marco Bodrato, (see citations in source
code) who is perhaps the leading researcher into Toom-Cook
multiplication.  He also analyzed this patch, corrected my original
Toom-Cook to a slightly different, more optimal variation, (and this
changed the Toom-Cook threshold downwards somewhat, as it was a bit faster.)

There are also possible optimizations in the future for "unbalanced"
operand sizes, where one operand is significantly larger than the other.
There are unbalanced Toom-Cook multiplies that can achieve better
performance than the straight 3-way Toom-Cook.  And of course there are
higher-order Toom-Cook multiplications possible, for even larger
numbers, but the marginal benefits of higher orders diminishes, and the
code is more complex.

Again, these crossovers were found experimentally to work well.  They
still work well with tests of Xiaobin's improvements to multiplication,
but I may run yet another set of exhaustive tuning tests to see if the
thresholds can be again improved after my massive regression test
finishes, hopefully in the next 24 hours.

Note that my optimizations for the pow() function give vastly better
performance at even small bit sizes for many operands, as they factor
out powers of 2 in the exponent and perform these very rapidly as
bit-shifts.

--
Alan Eliasen
eliasen at mindspring.com
http://futureboy.us/

```