Performance issue with Nashorn and C2's global code motion
martin.doerr at sap.com
Fri Sep 11 13:58:44 UTC 2015
thanks for your quick response.
I've uploaded a small subset of the Octane benchmark with a simple launcher here:
It can be run by the following command line:
openjdk_8/bin/java -agentlib:jdwp=transport=dt_socket,address=8000,server=y,suspend=n OctaneLauncher
It only contains the "EarleyBoyer" benchmark which is a good for stressing Node_Backward_Iterator::next().
The DUIterator_Fast may iterate over several billions of edges in sum consuming the big majority of the whole CPU time.
Even without the parameter which enables can_access_local_variables the Node_Backward_Iterator::next() consumes a noticeable (but not dominant) amount of CPU time.
From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com]
Sent: Donnerstag, 10. September 2015 22:16
To: Doerr, Martin; hotspot-compiler-dev at openjdk.java.net
Subject: Re: Performance issue with Nashorn and C2's global code motion
It is first time I am hearing about this method's performance problem.
That code was not changed for very long time since we never thought we need to optimize it.
// '_stack' is emulating a real _stack. The 'visit-all-users' loop has been
// made stateless, so I do not need to record the index 'i' on my _stack.
// Instead I visit all users each time, scanning for unvisited users.
May be we can optimize it by not going through all users each time. We should file RFE for this.
We would need to know how to reproduce it.
On 9/10/15 5:17 AM, Doerr, Martin wrote:
> we were running Octane benchmark and noticed a very significant performance drop with JVMTI.
> VTune measurement showed that the JVM has spent the majority of the whole CPU time in Node_Backward_Iterator::next
> during PhaseCFG::schedule_late when JvmtiExport::_can_access_local_variables is on
> (see http://cr.openjdk.java.net/~mdoerr/OctaneVTune.jpg).
> We were using openjdk 8 with/without the following option:
> This option activates the JVMTI capability can_access_local_variables which prevents C2 from killing dead locals leading
> to a higher number of edges in the graph.
> If we don't use this option PhaseCFG::schedule_late does no longer play a significant role regarding the CPU time.
> Have you noticed this before? Is this of interest to you?
> For us, this is a significant issue, as we have can_access_local_variables on by default.
> As a solution we could think of limiting the node iterations in schedule_late and generating a quicker and less
> optimized schedule in extreme cases.
> Best regards,
More information about the hotspot-compiler-dev