Possible subtle memory model error in ClassValue

John Rose john.r.rose at oracle.com
Fri Aug 7 21:35:16 UTC 2020

On Aug 7, 2020, at 9:52 AM, Andrew Haley <aph at redhat.com> wrote:
> On 8/7/20 4:39 PM, David Lloyd wrote:
> > However, I'm wondering if this isn't a side effect of what appears
> > to be an incorrect double-checked lock at lines 374-378 of
> > ClassValue.java [1].  In order for the write to the non-volatile
> > `cache` field of ClassValueMap, it is my understanding that there
> > must be some acquire/release edge from where the variable is
> > published to guarantee that all of the writes are visible to the
> > read site, and I'm not really sure that the exit-the-constructor
> > edge is going to match up with with the synchronized block which
> > protects a different non-volatile field.  And even if my feeling
> > that this is dodgy is valid, I'm not sure whether this NPE is a
> > possible outcome of that problem!
> It certainly doesn't look right to me. AFAICS this is classic broken
> double-checked locking. It'd need some sort of fence after
> constructing the ClassValueMap instance and before publishing it.

Y’all are calling my baby ugly but he’s really cute if you look
at him from the right angle.

(First, a quick question:  This bug shows up on x86, right?
If so, we probably are not looking at a hardware reordering
problem but something stemming from Graal’s reordering
of operations compared to C1 and C2, perhaps after a different
set of inlining decisions uncharacteristic of C2.)

The bug indeed looks like a dropped fence at the end of the
constructor for ClassValueMap.  (I don’t see an easier way to
get a null at the NPE point.)  The fence might not actually be
dropped, but a write which the fence is supposed to fence
might have been reordered after the fence, and after an
unlock operation, allowing someone to see the initial
null after the unlock.

The double-check idiom breaks when the outer check
(not synchronized) gets half-baked data and acts on it.
This much-maligned idiom would not be broken in the
present case under the assumption that a data dependency
on type.classValueMap (set in initializeMap) will see
a fully initialized value of ClassValueMap.cacheArray.

Now we get into the fine print, and I agree there is a bug
here.  The easy fix is to repair the logic under which that
ugly everybody’s-hating-on-it double check idiom would
be in fact correct.

The fine print requires a couple different things that are not
quite fulfilled by the present code, and Graal has either
reordered the memory accesses to expose the problem, or
it (itself) has a complementary bug.  (Or it has an unrelated
bug, and you people are *staring* at my baby!)

First, as Paul noted, you need a final variable in your class
to get the benefit of a fence at the end of the constructor.
Oops #1:  ClassValueMap doesn’t have *any* finals.  That’s
awkward.  Two fixes for that:  (a) add a dummy final, and
(b) add a real fence in the constructor.  For better luck,
maybe put the fence at the end of sizeCache.

On machines (not x86) which need fences *before* final
reads, another explicit fence should be placed before the
unsynchronized read (or else inside the getCache method).

By the letter of the JMM, the helpful effects of the fence
at the end of the constructor are only guaranteed for
references which depend on final variables set by that
constructor, because only those final variables get a
“freeze” operation which stabilizes them (and values
dependently loaded via them).  Throwing in a dummy
final is therefore not guaranteed to work.  But throwing
in a fence will work.  One downside of the fence is that
the JMM is silent about fences, but the JMM is widely
recognized as a partial story, not the full or final story.

(Here’s a tidbit of JMM politics:  I once heard Doug Lea
considering whether maybe all fields should be treated
more like final fields.  I don’t know if this is still a live
idea, but it would make this bug go way, since nearly all
constructors would then get fences of some sort.)

Here’s a bit of explanatory (non-normative) text from the draft
POPL paper for JMM, where the parenthetic comment indicates
(I think) the sort of thing that is happening here:

> Let us now assume that the acquire and release do happen. As long as
> these actions take place after object has been constructed (and there
> is no code motion around the end of the constructor), the diffs that
> the second processor acquires are guaranteed to reflect the correctly
> constructed object.

Basically, the synchronization statement is expected to do a
release *after* the non-null value is stored.  It looks like this
is failing due to a lost and/or reordered fence.

Second, David says:

> I’m not really sure that the exit-the-constructor edge is going to match
> up with with the synchronized block which protects a different
> non-volatile field.

By the letter of the JMM (AFAIUI), this code is coloring way outside
the lines.  (Yes, I admit it.)  The problem is pretty fundamental.
I went and swapped in some JMM spec just now (don’t have it in
short term memory; go figure), and here’s what I re-learned.

In JMM terms, a “release” is the earlier end of any edge in relation
called “synchronizes-with”.  An “acquire” is the latter end of such
an edge.  In the JMM this relation applies only to lock, unlock,
and volatile reads and writes.  Your guess is as good as mine whether
other operations (CAS, fences, etc.) participate; the JMM is silent.
Crucially, acquires and release do not touch plain fields.  This is
counter-intuitive for those of us who like to reason with fences.

The JMM prevents regular writes from floating past what we like
to think of as “release points” (unlocks) by appealing to processor
order inside lock/unlock pairs, and also by appealing to global
ordering of multiple lock/unlock pairs (or stuff with volatiles
and other corner cases which we will neglect here).  In order
to work right, a normal write has to come before a release *and*
a matching normal read has to come after an acquire, and such
an acquire is nothing other than the lock of a later lock/unlock
pair, typically in another thread.

So the rules which reliably connect normal writes to normal
reads in other threads work differently from acquire fences and
release fences as we (or just I?) usually think of them.  Yes, there
are “acquires” and “releases” in the JMM.  No, they are not
primitives but rather descriptions of the way the happens-before
relation is constrained by (among other things) lock and unlock

On x86, you don’t have to worry about acquires; you just set up
the right release fences.  In the JMM, there’s no way to get either
an acquire or a release without a synchronized statement (or
other fancy stuff like volatiles).  That’s why double-check is
broken in the pure JMM, and why it can be repaired if you
add some (non-JMM) fences.

So why is this showing up suddenly with Graal?  Possibly it’s
got a really aggressive interpretation of the JMM, relative to
the standard HotSpot JITs.  Possibly it’s got a bug.  Graal might
be allowing the initialization of ClassValueMap.cacheArray
to float down past publication of the ClassValueMap on
type.classValueMap (in initializeMap).  Less likely, it is allowing
the initialization to totally escape the lock/unlock pair, something
the JMM might allow but the Cookbook advises against (in its
“Can Reorder” table).

Perhaps Graal is modeling the “freeze” operations on finals more
accurately (and the Cookbook gives instructions for this).  That
would allow the ClassValueMap to correctly initialize its frozen
finals, but not (unfortunately) the not-frozen cacheArray field.
It appears that the new JIT on the block is exposing my neglect
for my poor baby, in not putting the right fences around its playpen.

Bottom line:  Add a release fence before type.cVM is set, and add
a (no-op on x86) acquire fence in getCache.  Quick test:  Add a dummy
final to CVM, to see if that makes the bug let go.

— John

More information about the core-libs-dev mailing list