optimizing acmp in L-World

Karen Kinnear karen.kinnear at oracle.com
Wed Jan 31 15:26:10 UTC 2018


Really glad to see the potential performance paths here.

A note - the way I understand this, not only does clever_life 
depend on .Equals() that would never return false for exact
pointer matches, new_acmp has the same assumption.

I have been assuming that value classes would be able to override 
Object.equals() (the “new” implementation of Object.equals() that
handles value types with component-wise comparisons).

Should we add a requirement that Equals can not ever return
false for exact pointer matches?

Do we assume that most value types will be able to take advantage
of an intrinsic for the new Object.equals()?


> On Jan 29, 2018, at 3:42 PM, John Rose <john.r.rose at oracle.com> wrote:
> Tobias, Roland, and I had a chat in Santa Clara last week
> about optimizing the acmp operation in the JVM.  Here are
> my notes on it.
> Background:  The acmp operation, also known as reference
> comparison, is overloaded in L-World to handle values as well,
> with a specific semantics of returning false if either operand is
> a value (similar to NaN behavior for fcmp).  The rationale for
> this is that acmp is reference comparison, even in L-World,
> and since values have no references, they can never be
> equal as references.
> (Alternatively, it is *as if* both operands were converted
> *to references* before the acmp operation, which means
> that either operand, if it is a value, *would be* boxed to a
> reference value, which then, being a new reference, is
> automatically unequal to any other reference.  This tricky
> account may be appealing in some circumstances.)
> OK, so how do we make it fast?  Ideally, we would like
> to strength-reduce the enhanced acmp instruction down
> to the original acmp instruction, which after all is just a
> single instruction (such as cmpq) on all platforms.
> The first thing to notice, before going into details, is
> that in L-world all Q-values are buffered in the interpreter.
> This means that, at least for interpreter-generated values,
> all values can be treated as physical pointers, even
> if they are logically Q-values.
> (Crib sheet:  In L-world, all kinds of values except primitives
> are formally carried by L-values, under L-descriptors.  True
> references can be referred to as R-values of R-type, reference
> type, or object class.  New value types are referred to as Q-values
> of Q-type, value type, or value class.  When you don't know what
> you have, you can say it's a U-type, which just like a Q-type or
> R-type is carried by an L-value under an L-descriptor.)
> This means that the old acmp semantics are almost correct,
> in many cases.  If you have two L-values and do a pointer
> comparison on their physical pointers, and if the pointers
> differ, you are done.  If the physical pointers compare equal,
> then there is one more thing you have to do:  Make sure that
> the L-value carried by both physical pointers is in fact a
> reference, an R-value.
> So the logic looks like this, and the new semantics are in
> the second "if" statement:
>  bool new_acmp(ptr x, ptr y) {
>    if (!old_acmp(x, y))  return false;  // x != y => false
>    if (is_value_ptr(x))  return false;  // x.is_value => false
>    return true;
>  }
>  bool old_acmp(ptr x, ptr y) {
>    return x == y;  // cmpq etc.
>  }
> Note that since acmp is symmetric, the JVM has the option
> to silently test either x or y for "is_value_ptr" if the two physical
> pointers compare equal.
> This leads immediately to the first set of optimizations, which
> probably are enough to get us where we want to go.  (There
> is a second set we can add later.)  All the JVM has to do is
> prove or detect that one or the other physical pointer either
> *is certainly* a Q-value, or *is certainly not* a Q-value.
> If either physical pointer is certainly a Q-value, then the
> new_acmp instruction is a constant false.  If either physical
> pointer is certainly not a Q-value, then the new_acmp
> instruction strength reduces to the old_acmp instruction.
> The strength reduction to false can be done in any place
> where the type of either operand is known to be a Q-type.
> The interpreter doesn't track types, and the acmp instruction
> isn't resolved so it can't be quickened to a type context,
> but the JIT has lots of type information.  Thus, if either
> operand of acmp is statically known (or accurately speculated)
> to be a value type, then the acmp instruction constant folds
> to false.
> Likewise, if either operand is known to be an R-type
> (true reference) the acmp may be strength reduced to
> a simple reference comparison.  This also is likely to
> happen in the JIT.  It also happens in the interpreter
> in the acmp_null instruction, where (obviously) the
> second operand is a reference (i.e., null).
> (Note that the type Object is a U-type in L-world, so
> Object-to-Object comparisons such as are found in
> generic collection code do not benefit from this
> optimization.  More later on that.)  
> The JIT can also track, in its profile, whether various
> instructions "see" only Q-values or only R-values.
> (The classic HotSpot already tracks "sightings" of
> null for similar purposes.)  In that case, speculative
> narrowings (only Q-types or only R-types) can be
> used to short circuit new_acmp down to old_acmp.
> This is useful because the speculation can sometimes
> be verified out of a loop body, where the actual acmp
> instruction is executed in a loop.  Remember that the
> JIT can choose which operand of acmp to test for
> value-ness; if it chooses a loop invariant, and the
> invariant is always Q or always R, then the acmp
> inside the loop folds up, no matter what the dynamic
> value of the other operand.
> For that matter, even if speculation fails, loop
> unswitching can get to a similar result, at the cost
> of two copies of the loop.
> So, in many cases, contextual information can allow
> the JVM to fold up new_acmp to old_acmp or constant
> false.
> In cases such folding fails, there is as simple trick
> which can make things easier.  Recall that folding
> fails when *both* operands are statically U-types,
> *neither* operand folding to a Q-type or R-type.
> In that case, the JVM must pick one operand or
> the other (it doesn't matter which) and perform
> the is_value_ptr operation on it, in such a way
> that the result of that operand affects the final
> result in the correct way.  A naive implementation
> of new_acmp adds an extra test-and-branch
> to fold in the is_value_ptr test, but a clever might
> be able to avoid it.
> Suppose there is a branch free technique for
> implementing is_value_ptr, such as extracting
> a bit from the physical pointer to the U-value,
> or (after a null check) extracting a bit from the
> physical pointer of the class metadata pointer
> in the value's header, or (after two indirections)
> extracting a bit from the structure of the class
> metadata.  (Mr. Simm's prototype puts such
> a bit in the metadata pointer, in the object
> header.)
> If such a bit can be obtained in a branch-free
> manner, then it can be used to perturb the result
> of the old_acmp in a useful manner.  For example:
>  bool new_acmp(ptr x, ptr y) {
>    if (x == null)  return y == null;
>    int bit = (x->class_in_header & Q_TYPE_BIT);
>    ptr x1 = x | swar_broadcast_bit(bit != 0);
>    return old_acmp(x1, y);
>  }
>  int swar_broadcast_bit(bool z) {
>    return z ? -1 : 0;  // use ALU ops
>  }
> The idea is to extract a bit which signals the presence
> of a Q-type (in either operand) and use it to "mess up"
> the equality comparison, using and, or, or xor as needed.
> This perturbation technique has two costs:  First,
> it requires a null check (unless the Q-bit is in the
> actual physical reference).  Second, it requires
> an ALU operation or two to spread the perturbation.
> But these costs will probably be negligible, except
> (perhaps) in very tight loops in generic code.
> Generic code loops performing frequent acmp operations
> on untyped (Object) operands will need to perform
> extra null checks and/or value/reference detections
> if they are not already present (by luck) in the loop
> context.  In this case, loop unswitching may be useful.
> There is one more contextual operation which may be
> helpful in such cases, if all else fails (and loop unswitching
> is not desirable).  Honestly, I haven't seen any real-world
> code yet that needs it, but it is comforting to have as a
> fall-back.  The idea is this:  If the acmp is contextually
> followed by a call to Object.equals, in the usual Legacy
> Idiom For Equality (L.I.F.E.), then there is one more trick
> we can play.  We'd like to force the new_acmp down to
> an old_acmp in these cases.  Can we?  Here's the choice:
>  bool dutiful_life(ptr x, ptr y) {
>    if (new_acmp(x, y))  return true;
>    return x->Object_equals(y);
>  }
>  bool clever_life(ptr x, ptr y) {
>    if (old_acmp(x, y))  return true;
>    return x->Object_equals(y);
>  }
> Note that the only semantic difference between old_acmp
> and new_acmp arises when *both* operands are the *same*
> value.  (Work out the cases:  In every other case, the simple
> pointer comparison detects the difference and produces the
> correct "false" answer.)  Thus, in the L.I.F.E., the only case
> where old_acmp gives the wrong answer produces an
> immediate answer of "true", instead of following up with
> a call to x.equals(x) on the Q-value.
> To put it in a nutshell, the only difference between dutiful_life
> and clever_life is that the former always calls Object::equals
> if either operand is a value, and the the latter calls
> Object::equals in a subset of those cases (when the
> physical pointers differ).  When the physical pointers
> do *not* differ, there is no call to Object::equals.  But
> clever_life always delivers the same answer as dutiful_life.
> So is there a problem?  Not much, although it requires us to treat
> Object::equals as an intrinsic with some predictable semantics.
> The crucial inside that allows us to adopt clever_life is that,
> if you call x.equals(x) on a non-null x, the result must always
> be true.  Now, this is not enforced by the JVM, but it is clearly
> documented in the API specification for Object::equals, and
> lots of stuff would be already broken if that were ever false.
> Thus, if we are willing to rule out implementations of Object::equals
> for which a value can ever compare *not equal* to itself, then
> we can substitute clever_life for dutiful_life.  We also have to
> allow one more thing:  That we can silently drop calls to equals
> (at least in the context of L.I.F.E.) when the JVM can prove that
> the contract of equals allows the JVM to predict the outcome.
> This means that if you put side effects (like print statements)
> into an equals method on a value type, you might not see the
> side effects, after the JVM has optimized things.  This corner
> case may require further thought in order to straighten it out.
> Note that L.I.F.E. is frequently found in generic code, where
> instead of operating on Object values it operates on type
> parameter types.  (The JVM just sees Object, after erasure.)
> In that case, if the JVM were to specialize the generic code
> separately for some types (such as value types), along the
> lines of the template class proposals, then the L.I.F.E.
> code would fold up in the JIT using the rules given at the
> top of this note.  The generic expression "a == b", where
> a or b was a value of a specialized type parameter T,
> would fold up to false in every specialization where T was
> a Q-type.  Likewise, as noted above, if L.I.F.E. is applied
> to a true R-type (or specialized to such a type), then any
> new_acmp on that type can be short-circuited to the faster
> old_acmp.
> All in all, it appears that the cost of the new acmp semantics
> can, for all practical purposes, be pushed down into the noise.
> — John

More information about the valhalla-dev mailing list