[aarch64-port-dev ] RFR(M): 8212043: Add floating-point Math.min/max intrinsics
Pengfei Li (Arm Technology China)
Pengfei.Li at arm.com
Mon Oct 29 09:03:10 UTC 2018
> > I got a reason why consecutive fmins are slower. The fmin sequence
> generated by the nested min() calls has RaW data dependencies. One fmin
> writes an fp register and the next fmin reads the same one. It leads the
> instruction pipeline to stall frequently.
> Wouldn't that also be true for a non-intrinsic fmin too? Each fp register
> output would be the input for a following comparison and conditional
In non-intrinsic generated code (see below pasted), fmovs are never executed since branches are biased to TAKEN.
0x0000ffff94d08e14: fmov d28, d19
0x0000ffff94d08e18: fcmp s28, s20
0x0000ffff94d08e1c: b.lt 0x0000ffff94d08e24
0x0000ffff94d08e20: fmov d28, d20
0x0000ffff94d08e24: fcmp s28, s17
0x0000ffff94d08e28: b.lt 0x0000ffff94d08e30
0x0000ffff94d08e2c: fmov d28, d17
0x0000ffff94d08e30: fcmp s28, s18
0x0000ffff94d08e34: b.lt 0x0000ffff94d08e3c
The code sequence actually executed is: fcmp, b.lt, fcmp, b.lt, fcmp, b.lt, ...
More information about the hotspot-compiler-dev