RFR: 8080293: AARCH64: Remove unnecessary dmbs from generated CAS code

Andrew Dinn adinn at redhat.com
Mon Aug 24 14:31:29 UTC 2015

The following webrev against hs-comp head fixes 8080293


It is a follow on to the prior volatile object patch

  8078743: AARCH64: Extend use of stlr to cater for volatile object stores

and requires that previous patch to be applied first.


The patch is sensitive to GC configuration so it was tested against 5
relevant configs


The validity of the transformation was verified by:

  generating and eyeballing compiled code for simple test programs
  successfully running a fairly large program (netbeans)
  generating and eyeballing HashMap code compiled on a fairly large
program run

The fix was performance tested on 2 implementations of the AArch64
architecture (more details below). On an O-O-O CPU it gave no noticeable
benefit. On a simple pipeline CPU it gave a very significant benefit in
specific cases.


Andrew Dinn
Senior Principal Software Engineer
Red Hat UK Ltd
Registered in UK and Wales under Company Registration No. 3798903
Directors: Michael Cunningham (USA), Matt Parson (USA), Charlie Peters
(USA), Michael O'Neill (Ireland)

The Test

As with the prior patch I tested the original vs new code generation
strategy by running a jmh test first with -XX:+UseBarriersForVolatile
and then with -XX:+UseBarriersForVolatile. Four different test programs
ran in all 5 GC configs executing. Each test executed repeated CAS
operations to an object field in a single thread with a BlackHole
backoff between CASes varying from 0 to 64.

Test one performed a CAS guaranteed to fail; test two performed a
successful CAS from a fixed object to null and then back; test three
performed a successful CAS from a fixed object to another fixed object
and then back; test four performed a successful CAS from a fixed
object to a newly allocated object and then back. The average time per
CAS operation (ns/op) -- actually per 2 CAS operations for the latter 3
tests -- was used as a score.

The Results

On an O-O-O CPU there was no significant difference in the time taken.

On a simple pipeline CPU the optimization gave a very significant
benefit for the Fail tests on all GC configurations except CMS
+ UseCondCardMark. In all other cases there was no significant
measurable benefit.

Example Test

package org.openjdk;

import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.infra.Blackhole;

import java.util.concurrent.TimeUnit;
import java.util.concurrent.atomic.AtomicReference;

@Warmup(iterations = 10, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
public class CasNull {

    Object tombstone;

    AtomicReference<Object> ref;

    @Param({"0", "1", "2", "4", "8", "16", "32", "64"})
    int backoff;

    public void setup() {
        tombstone = new Object();

        ref = new AtomicReference<>();

    public boolean test() {
        ref.compareAndSet(tombstone, null);
        ref.compareAndSet(null, tombstone);
	return true;

More information about the hotspot-compiler-dev mailing list