Math trig intrinsics and compiler options
Joseph D. Darcy
Joe.Darcy at Sun.COM
Thu Aug 6 08:58:59 PDT 2009
gustav trede wrote:
> 2009/7/16 Christian Thalinger <Christian.Thalinger at sun.com
> <mailto:Christian.Thalinger at sun.com>>
> Azeem Jiva wrote:
> > Joe,
> > Gustav sent me an email asking for help with the
> intrinsification of
> > the trig functions and a suggestion I gave him was to not call
> > fsin/fcos/ftan since those instructions are microcoded on Intel/AMD
> > hardware and very slow. Slower than the call to
> > sharedRuntimeTrig.cpp, and in all cases it's best to stay away from
> > the hardware instructions.
> I just did some micro-benchmarking on an Intel Core2 Duo and in the
> range of [0,2pi) inlining the hardware instructions is slightly faster
> (about 2.5%). Limiting the range to [0,pi/4) (means no runtime calls)
> hardware instructions are 1.5x faster.
> I think we should keep the current approach.
> -- Christian
> Neither linux nor the windows platform has compiler opts enabled, only
> solaris does, it seems when this was evaluated many years ago no other
> platform had working compilers.
> That fact alone is likely to make the fsin,fcos path faster then the C
> version for the +-PI/4 range for those platforms.
> Its some work to check the current status for the different
> platforms/compilers regarding if they are still producing bad code
> with opts or not,
> its however reasonable to expect the compilers to improve over the years.
The code from the non-Sun C compilers is not "bad" per se, it is just
bad in not implementing the desired semantics of the FDLIBM code, which
is very sensitive to optimizations legal in C which defeat the purpose
of the code. The Sun C compiler can be sufficiently attuned to such
floating-point need under optimization, the other C compilers were not
and I suspect still are not.
My preferred long-term approach is to port the FDLIBM C code to Java,
which I've wanted to do for a while, but has never bubbled to the top of
my to-do list.
> Regarding the proposed patch, sharedRuntimeTrig.cpp usage for the
> entire input range without external rounding:
> I compare with 3 input,output pairs that has leaked from the JCK, and
> vs the current Math impl for many input,output pairs and i don't
> manage to detect any differences.
What is many? There are on the order of 2^64 inputs to check!
> There is consistent performance improvement for all input ranges, i
> get up to 40% improvement for intel core2 on solaris.
> Its hard for me to know if there are some corner cases that do require
> the external rounding in order to stay within the spec, thats the
> reason i asked for help here.
> gustav trede
More information about the hotspot-dev