Possible subtle memory model error in ClassValue
galder at redhat.com
Mon Aug 10 06:12:55 UTC 2020
I'm David's colleague. I'm the one who's spotted this issue twice while
GraalVM is doing its points-to analysis with jdk11u-dev. The fuller
exception is here:
Caused by: java.lang.NullPointerException
... 5 more
On Fri, Aug 7, 2020 at 11:36 PM John Rose <john.r.rose at oracle.com> wrote:
> On Aug 7, 2020, at 9:52 AM, Andrew Haley <aph at redhat.com> wrote:
> > On 8/7/20 4:39 PM, David Lloyd wrote:
> > > However, I'm wondering if this isn't a side effect of what appears
> > > to be an incorrect double-checked lock at lines 374-378 of
> > > ClassValue.java . In order for the write to the non-volatile
> > > `cache` field of ClassValueMap, it is my understanding that there
> > > must be some acquire/release edge from where the variable is
> > > published to guarantee that all of the writes are visible to the
> > > read site, and I'm not really sure that the exit-the-constructor
> > > edge is going to match up with with the synchronized block which
> > > protects a different non-volatile field. And even if my feeling
> > > that this is dodgy is valid, I'm not sure whether this NPE is a
> > > possible outcome of that problem!
> > It certainly doesn't look right to me. AFAICS this is classic broken
> > double-checked locking. It'd need some sort of fence after
> > constructing the ClassValueMap instance and before publishing it.
> Y’all are calling my baby ugly but he’s really cute if you look
> at him from the right angle.
> (First, a quick question: This bug shows up on x86, right?
Yes, the issue is happening on x86.
> If so, we probably are not looking at a hardware reordering
> problem but something stemming from Graal’s reordering
> of operations compared to C1 and C2, perhaps after a different
> set of inlining decisions uncharacteristic of C2.)
To the best of my knowledge, GraalVM points-to analysis is just a JVM
process that figures out which code is used by the end-user application.
The result of this process is then used to create the native executable.
I'm not aware of the Graal compiler having an impact here.
>From what I've learnt looking at this code and more recent JVMCI
implementations, jdk11u-dev relies on reflection to understand the end-user
application, whereas more modern JVMCI stacks avoid the use of reflection
and query the JVM directly.
I've been trying to create a vanilla JDK stress test that would trigger
such NPE but I've not had luck so far. I don't know whether JVMCI has a
part to play here.
> The bug indeed looks like a dropped fence at the end of the
> constructor for ClassValueMap. (I don’t see an easier way to
> get a null at the NPE point.) The fence might not actually be
> dropped, but a write which the fence is supposed to fence
> might have been reordered after the fence, and after an
> unlock operation, allowing someone to see the initial
> null after the unlock.
> The double-check idiom breaks when the outer check
> (not synchronized) gets half-baked data and acts on it.
> This much-maligned idiom would not be broken in the
> present case under the assumption that a data dependency
> on type.classValueMap (set in initializeMap) will see
> a fully initialized value of ClassValueMap.cacheArray.
> Now we get into the fine print, and I agree there is a bug
> here. The easy fix is to repair the logic under which that
> ugly everybody’s-hating-on-it double check idiom would
> be in fact correct.
> The fine print requires a couple different things that are not
> quite fulfilled by the present code, and Graal has either
> reordered the memory accesses to expose the problem, or
> it (itself) has a complementary bug. (Or it has an unrelated
> bug, and you people are *staring* at my baby!)
> First, as Paul noted, you need a final variable in your class
> to get the benefit of a fence at the end of the constructor.
> Oops #1: ClassValueMap doesn’t have *any* finals. That’s
> awkward. Two fixes for that: (a) add a dummy final, and
> (b) add a real fence in the constructor. For better luck,
> maybe put the fence at the end of sizeCache.
> On machines (not x86) which need fences *before* final
> reads, another explicit fence should be placed before the
> unsynchronized read (or else inside the getCache method).
> By the letter of the JMM, the helpful effects of the fence
> at the end of the constructor are only guaranteed for
> references which depend on final variables set by that
> constructor, because only those final variables get a
> “freeze” operation which stabilizes them (and values
> dependently loaded via them). Throwing in a dummy
> final is therefore not guaranteed to work. But throwing
> in a fence will work. One downside of the fence is that
> the JMM is silent about fences, but the JMM is widely
> recognized as a partial story, not the full or final story.
> (Here’s a tidbit of JMM politics: I once heard Doug Lea
> considering whether maybe all fields should be treated
> more like final fields. I don’t know if this is still a live
> idea, but it would make this bug go way, since nearly all
> constructors would then get fences of some sort.)
> Here’s a bit of explanatory (non-normative) text from the draft
> POPL paper for JMM, where the parenthetic comment indicates
> (I think) the sort of thing that is happening here:
> > Let us now assume that the acquire and release do happen. As long as
> > these actions take place after object has been constructed (and there
> > is no code motion around the end of the constructor), the diffs that
> > the second processor acquires are guaranteed to reflect the correctly
> > constructed object.
> Basically, the synchronization statement is expected to do a
> release *after* the non-null value is stored. It looks like this
> is failing due to a lost and/or reordered fence.
> Second, David says:
> > I’m not really sure that the exit-the-constructor edge is going to match
> > up with with the synchronized block which protects a different
> > non-volatile field.
> By the letter of the JMM (AFAIUI), this code is coloring way outside
> the lines. (Yes, I admit it.) The problem is pretty fundamental.
> I went and swapped in some JMM spec just now (don’t have it in
> short term memory; go figure), and here’s what I re-learned.
> In JMM terms, a “release” is the earlier end of any edge in relation
> called “synchronizes-with”. An “acquire” is the latter end of such
> an edge. In the JMM this relation applies only to lock, unlock,
> and volatile reads and writes. Your guess is as good as mine whether
> other operations (CAS, fences, etc.) participate; the JMM is silent.
> Crucially, acquires and release do not touch plain fields. This is
> counter-intuitive for those of us who like to reason with fences.
> The JMM prevents regular writes from floating past what we like
> to think of as “release points” (unlocks) by appealing to processor
> order inside lock/unlock pairs, and also by appealing to global
> ordering of multiple lock/unlock pairs (or stuff with volatiles
> and other corner cases which we will neglect here). In order
> to work right, a normal write has to come before a release *and*
> a matching normal read has to come after an acquire, and such
> an acquire is nothing other than the lock of a later lock/unlock
> pair, typically in another thread.
> So the rules which reliably connect normal writes to normal
> reads in other threads work differently from acquire fences and
> release fences as we (or just I?) usually think of them. Yes, there
> are “acquires” and “releases” in the JMM. No, they are not
> primitives but rather descriptions of the way the happens-before
> relation is constrained by (among other things) lock and unlock
> On x86, you don’t have to worry about acquires; you just set up
> the right release fences. In the JMM, there’s no way to get either
> an acquire or a release without a synchronized statement (or
> other fancy stuff like volatiles). That’s why double-check is
> broken in the pure JMM, and why it can be repaired if you
> add some (non-JMM) fences.
> So why is this showing up suddenly with Graal? Possibly it’s
> got a really aggressive interpretation of the JMM, relative to
> the standard HotSpot JITs. Possibly it’s got a bug. Graal might
> be allowing the initialization of ClassValueMap.cacheArray
> to float down past publication of the ClassValueMap on
> type.classValueMap (in initializeMap). Less likely, it is allowing
> the initialization to totally escape the lock/unlock pair, something
> the JMM might allow but the Cookbook advises against (in its
> “Can Reorder” table).
> Perhaps Graal is modeling the “freeze” operations on finals more
> accurately (and the Cookbook gives instructions for this). That
> would allow the ClassValueMap to correctly initialize its frozen
> finals, but not (unfortunately) the not-frozen cacheArray field.
> It appears that the new JIT on the block is exposing my neglect
> for my poor baby, in not putting the right fences around its playpen.
> Bottom line: Add a release fence before type.cVM is set, and add
> a (no-op on x86) acquire fence in getCache. Quick test: Add a dummy
> final to CVM, to see if that makes the bug let go.
> — John
More information about the core-libs-dev