deduplicating lambda methods
john.r.rose at oracle.com
Tue Mar 6 02:27:53 UTC 2018
On Mar 5, 2018, at 3:37 AM, Maurizio Cimadamore <maurizio.cimadamore at oracle.com> wrote:
> Now, with indy, I believe we can still optimize them fully, given that indy has some knowledge/state about the site making the call, so the two call won't be treated as 'identical' by the JIT and the profiling info won't be merged (I guess). But what if, someday, we were to replace indy with condy here? Would we lose performances?
Deduplication of code always has the risk of lower performance,
if the deduplicated code has a profile that the JIT might rely on.
Duplicated code can sometimes collect crucial profile data which
deduplicated code would miss.
You are right about indy-generated lambdas being distinguished
by the JIT. This is especially true when (as is current) each LMF
invocation makes a fresh body of code. It's a clear invitation to
the JIT to optimize that chunk of logic separately.
Suppose the LMF were to de-duplicate dynamically. That would
have a similar effect to a condy-based use of the LMF, in that
the JIT would see fewer distinctions between lambdas from
distinct capture sites.
The most important distinctions are static types and dynamic
type profiles. Monomorphic type profiles are especially favorable
since they allow devirtualization and (usually) subsequent inlining.
The worst performance cliff to fall down is when a monomorphic type
profile point in a duplicated bytecode becomes polluted when it is
merged with another duplicate that contributes incompatible type
profile data. If there are virtual calls to the profiled type, they can
fail to inline when a profile is polluted.
This can happen when a generic algorithm is reused on two unrelated
types, and it gets refactored into shared code.
There's an important mitigating effect: If the generic code is
inlined into a caller which has a monomorphic profile and/or
verifiable strong static types, then the better types from the
caller can override the polluted profile of the callee.
In effect, a successful inlining decision makes new copies
of the old code, which can sometimes reverse bad effects from
deduplication or genericity. But no real profiling happens
after the JIT splits out a copy of some bytecode for inlining.
This means JIT inlining does not clean up polluted profiles.
(That is a current limitation, not a fundamental problem,
since HotSpot could collect split profiles from early tiers
of compilation for use by later tiers. See JDK-8015416.)
In the case of generic code, if a generic algorithm like
ArrayList::indexOf is inlined into a caller which knows that
the argument to indexOf is always a String, then the call
to Object.equals inside the algorithm can be devirtualized
to String.equals and inlined. The knowledge of the String
type could come from the caller's static types or the caller's
Currently each lambda indy is a distinct lambda factory call,
so that lambda gets its own adapter code and hence its own
profile and static types. If the desugared body is simple and
inlines, then the JVM has a good chance at optimizing it,
regardless of the quality of the desugared body's profile.
But if the body is complex, the applicability of the adapter's
static types and profile will only apply "around the edges".
So if two lambdas share a desugared body (whether locally or
globally), the profile of that body might be polluted, but some
(not all) of the pollution can be cleaned up if the JIT finds
useful context in each indy site. The lambda bodies that
are best suited for this are ones which mainly call methods
on argument objects already seen (and maybe profiled)
by the lambda's adapter object (FI implementation).
A lambda body which creates a temporary value of
a statically unpredictable type might fall off the cliff sooner,
if it is shared and the sharing makes it harder to get
a monomorphic profile on the unpredictable type.
These rules of thumb are generally valid, but specific
results are not robust, since JVM optimization techniques
change over time.
A new language construct like streams or lambdas
or patterns always introduces new code shapes which
the JIT has to respond to, and there is a teething
period where the JIT can't optimize the construct
as nicely as we want. JIT engineers are full of
tricks, and they can often find a tweak that will
make the code digest well. Sometimes that's hard,
in which case it might be years before a new feature
performs as well as hand-optimized code. The
low-level stuff we are talking about here is probably
in the easier category, of tweakable code.
One way to preserve more contextuality from translation
of lambdas is to retain indy in the translation strategy,
but use condy to hold shared resources. Maybe the
LMF can make a tunable, on-the-fly decision how to use
the context to "hook in" a profile to the shared stuff
in the condy. It would be very easy to wrap a trivial
bit of bytecode around a method which does nothing
except profile arguments and call the method on those
arguments. In this way most class structure generated
by LMFs could be reused across compilation units,
while still collecting contextual profile information.
And when I say "shared stuff in condy", it could be
either locally deduplicated stuff, or dynamically and
Perhaps a well balanced design would manage to
create distinct profile points for every distinct spot in the
source code, while sharing everything else as globally
as possible. And telling the JIT about it, so that it
never fails to inline the shared stuff into each distinctly
profiled context. Our current system approximates this.
More information about the amber-dev