Possible subtle memory model error in ClassValue
paul.sandoz at oracle.com
Fri Aug 7 23:22:28 UTC 2020
Here’s some pertinent information from Doug’s excellent document of memory order modes :
"As a delicate (but commonly exploited) special case of the above considerations, acquire-style reads of immutable fields of newly constructed objects do not require an explicit fence or moded access -- Plain mode reads suffice: If the consumer has not seen the new object before, it cannot have stale values that it must otherwise ignore or discard, and it cannot perform the read until it knows the address. On subsequent encounters, reusing old values is OK, because they cannot have changed. This reasoning rests on the only known defensible exception to the rule of never making assumptions about local precedence order: The reference (address) of a new object is assumed never to be known and impossible to speculate until creation. This assumption is also relied on by other Java security requirements.
The resulting techniques are used in Java final field implementations, and are the reason why specified guarantees for final fields are conditional upon constructors not leaking references to objects before constructor return. Classes with final fields are normally implemented by issuing a form of Release fence upon constructor return. Further, because nothing need be guaranteed about interactions with reads by the constructor, a StoreStoreFence suffices. Similar techniques may apply in other contexts, but can be unacceptably fragile. For example, code that works when the associated objects are always newly constructed may, without further safeguards, fail upon later changes to instead recycle the objects from pools."
I thought that by virtue of ClassValueMap extending WeakHashMap, which contains final fields, it would all work out. But, I was uncertain so I ran a jcstress test modifying the sample JMMSample_06_Finals .
See gist here:
I ran in tough mode and observed this failure on my MacBook:
*** FAILED tests
Strong asserts were violated. Correct implementations should have no assert failures here.
1 matching test results.
(JVM args: [-XX:-TieredCompilation, -XX:+UnlockDiagnosticVMOptions, -XX:+StressLCM, -XX:+StressGCM])
Observed state Occurrences Expectation Interpretation
-1 2,420,284,404 ACCEPTABLE Object is not seen yet.
3 320 FORBIDDEN Seeing partially constructed object.
4 53 FORBIDDEN Seeing partially constructed object.
5 3,486 FORBIDDEN Seeing partially constructed object.
6 8,409 FORBIDDEN Seeing partially constructed object.
7 11,282 FORBIDDEN Seeing partially constructed object.
8 181,774,476 ACCEPTABLE Seen the complete object.
So for HotSpot it seems to occur only for C2 under certain stress options, -XX:+StressLCM, -XX:+StressGCM, which might explain why an issue is not generally being observed. I don’t understand enough about these options to know why this is so.
However, this makes me nervous enough that it might be best to add a dummy final field to ClassMapValue, and to review other similar code.
David, perhaps you could add a dummy final field to ClassValueMap and see if that fixes the problem with Graal?
 http://gee.cs.oswego.edu/dl/html/j9mm.html <http://gee.cs.oswego.edu/dl/html/j9mm.html>
 http://hg.openjdk.java.net/code-tools/jcstress/file/1dd2cca36fa9/jcstress-samples/src/main/java/org/openjdk/jcstress/samples/JMMSample_06_Finals.java <http://hg.openjdk.java.net/code-tools/jcstress/file/1dd2cca36fa9/jcstress-samples/src/main/java/org/openjdk/jcstress/samples/JMMSample_06_Finals.java>
> On Aug 7, 2020, at 2:35 PM, John Rose <john.r.rose at oracle.com> wrote:
> On Aug 7, 2020, at 9:52 AM, Andrew Haley <aph at redhat.com> wrote:
>> On 8/7/20 4:39 PM, David Lloyd wrote:
>>> However, I'm wondering if this isn't a side effect of what appears
>>> to be an incorrect double-checked lock at lines 374-378 of
>>> ClassValue.java . In order for the write to the non-volatile
>>> `cache` field of ClassValueMap, it is my understanding that there
>>> must be some acquire/release edge from where the variable is
>>> published to guarantee that all of the writes are visible to the
>>> read site, and I'm not really sure that the exit-the-constructor
>>> edge is going to match up with with the synchronized block which
>>> protects a different non-volatile field. And even if my feeling
>>> that this is dodgy is valid, I'm not sure whether this NPE is a
>>> possible outcome of that problem!
>> It certainly doesn't look right to me. AFAICS this is classic broken
>> double-checked locking. It'd need some sort of fence after
>> constructing the ClassValueMap instance and before publishing it.
> Y’all are calling my baby ugly but he’s really cute if you look
> at him from the right angle.
> (First, a quick question: This bug shows up on x86, right?
> If so, we probably are not looking at a hardware reordering
> problem but something stemming from Graal’s reordering
> of operations compared to C1 and C2, perhaps after a different
> set of inlining decisions uncharacteristic of C2.)
> The bug indeed looks like a dropped fence at the end of the
> constructor for ClassValueMap. (I don’t see an easier way to
> get a null at the NPE point.) The fence might not actually be
> dropped, but a write which the fence is supposed to fence
> might have been reordered after the fence, and after an
> unlock operation, allowing someone to see the initial
> null after the unlock.
> The double-check idiom breaks when the outer check
> (not synchronized) gets half-baked data and acts on it.
> This much-maligned idiom would not be broken in the
> present case under the assumption that a data dependency
> on type.classValueMap (set in initializeMap) will see
> a fully initialized value of ClassValueMap.cacheArray.
> Now we get into the fine print, and I agree there is a bug
> here. The easy fix is to repair the logic under which that
> ugly everybody’s-hating-on-it double check idiom would
> be in fact correct.
> The fine print requires a couple different things that are not
> quite fulfilled by the present code, and Graal has either
> reordered the memory accesses to expose the problem, or
> it (itself) has a complementary bug. (Or it has an unrelated
> bug, and you people are *staring* at my baby!)
> First, as Paul noted, you need a final variable in your class
> to get the benefit of a fence at the end of the constructor.
> Oops #1: ClassValueMap doesn’t have *any* finals. That’s
> awkward. Two fixes for that: (a) add a dummy final, and
> (b) add a real fence in the constructor. For better luck,
> maybe put the fence at the end of sizeCache.
> On machines (not x86) which need fences *before* final
> reads, another explicit fence should be placed before the
> unsynchronized read (or else inside the getCache method).
> By the letter of the JMM, the helpful effects of the fence
> at the end of the constructor are only guaranteed for
> references which depend on final variables set by that
> constructor, because only those final variables get a
> “freeze” operation which stabilizes them (and values
> dependently loaded via them). Throwing in a dummy
> final is therefore not guaranteed to work. But throwing
> in a fence will work. One downside of the fence is that
> the JMM is silent about fences, but the JMM is widely
> recognized as a partial story, not the full or final story.
> (Here’s a tidbit of JMM politics: I once heard Doug Lea
> considering whether maybe all fields should be treated
> more like final fields. I don’t know if this is still a live
> idea, but it would make this bug go way, since nearly all
> constructors would then get fences of some sort.)
> Here’s a bit of explanatory (non-normative) text from the draft
> POPL paper for JMM, where the parenthetic comment indicates
> (I think) the sort of thing that is happening here:
>> Let us now assume that the acquire and release do happen. As long as
>> these actions take place after object has been constructed (and there
>> is no code motion around the end of the constructor), the diffs that
>> the second processor acquires are guaranteed to reflect the correctly
>> constructed object.
> Basically, the synchronization statement is expected to do a
> release *after* the non-null value is stored. It looks like this
> is failing due to a lost and/or reordered fence.
> Second, David says:
>> I’m not really sure that the exit-the-constructor edge is going to match
>> up with with the synchronized block which protects a different
>> non-volatile field.
> By the letter of the JMM (AFAIUI), this code is coloring way outside
> the lines. (Yes, I admit it.) The problem is pretty fundamental.
> I went and swapped in some JMM spec just now (don’t have it in
> short term memory; go figure), and here’s what I re-learned.
> In JMM terms, a “release” is the earlier end of any edge in relation
> called “synchronizes-with”. An “acquire” is the latter end of such
> an edge. In the JMM this relation applies only to lock, unlock,
> and volatile reads and writes. Your guess is as good as mine whether
> other operations (CAS, fences, etc.) participate; the JMM is silent.
> Crucially, acquires and release do not touch plain fields. This is
> counter-intuitive for those of us who like to reason with fences.
> The JMM prevents regular writes from floating past what we like
> to think of as “release points” (unlocks) by appealing to processor
> order inside lock/unlock pairs, and also by appealing to global
> ordering of multiple lock/unlock pairs (or stuff with volatiles
> and other corner cases which we will neglect here). In order
> to work right, a normal write has to come before a release *and*
> a matching normal read has to come after an acquire, and such
> an acquire is nothing other than the lock of a later lock/unlock
> pair, typically in another thread.
> So the rules which reliably connect normal writes to normal
> reads in other threads work differently from acquire fences and
> release fences as we (or just I?) usually think of them. Yes, there
> are “acquires” and “releases” in the JMM. No, they are not
> primitives but rather descriptions of the way the happens-before
> relation is constrained by (among other things) lock and unlock
> On x86, you don’t have to worry about acquires; you just set up
> the right release fences. In the JMM, there’s no way to get either
> an acquire or a release without a synchronized statement (or
> other fancy stuff like volatiles). That’s why double-check is
> broken in the pure JMM, and why it can be repaired if you
> add some (non-JMM) fences.
> So why is this showing up suddenly with Graal? Possibly it’s
> got a really aggressive interpretation of the JMM, relative to
> the standard HotSpot JITs. Possibly it’s got a bug. Graal might
> be allowing the initialization of ClassValueMap.cacheArray
> to float down past publication of the ClassValueMap on
> type.classValueMap (in initializeMap). Less likely, it is allowing
> the initialization to totally escape the lock/unlock pair, something
> the JMM might allow but the Cookbook advises against (in its
> “Can Reorder” table).
> Perhaps Graal is modeling the “freeze” operations on finals more
> accurately (and the Cookbook gives instructions for this). That
> would allow the ClassValueMap to correctly initialize its frozen
> finals, but not (unfortunately) the not-frozen cacheArray field.
> It appears that the new JIT on the block is exposing my neglect
> for my poor baby, in not putting the right fences around its playpen.
> Bottom line: Add a release fence before type.cVM is set, and add
> a (no-op on x86) acquire fence in getCache. Quick test: Add a dummy
> final to CVM, to see if that makes the bug let go.
> — John
More information about the core-libs-dev