VarHandles & LMAX Disruptor

Vitaly Davidovich vitalyd at gmail.com
Mon Aug 3 02:52:58 UTC 2015


I think any approach that requires reading the length of an array, whether
user or compiler inserted, is going to run afoul of possible false sharing;
only approach #3 would work since the mask field could be isolated to its
own cacheline.

sent from my phone
On Aug 2, 2015 9:54 PM, "Michael Barker" <mikeb01 at gmail.com> wrote:

> Hi,
>
> Having played with it a bit more, I can get the bound check removed under
> one specific arrangement of the code, the following is for the
> RingBufferFields.elementAt:
>
> 1)  return (E) entries[((int) sequence) & (entries.length - 1)];
>
> This works and I get the pattern mentioned Paul's example.  There is a
> drawback to implementing it this way, details below.
>
> 2)  return (E) entries[(int) sequence & (entries.length - 1)];
>
> Note the position of the brackets.  In this case entries.length is being
> promoted to a long before applying the mask, then being cast down to an
> int, which seems to trip up the compiler.
>
> 3)  return (E) entries[((int) sequence) & indexMask];  // indexMask is a
> final field of value entries.length - 1
>
> I suspect this is because the compiler can't assume that final fields are
> actually final.
>
> 4)  return (E) entries[BUFFER_PAD + (((int) sequence) & (entries.length -
> ((BUFFER_PAD * 2) + 1)))];
>
> I'm guessing it gets just too complex for the compiler to figure our that
> the resulting value will be less that entries.length.
>
> One of the niggles with approach #1, is that I need to reference
> entries.length at the call site and can't use a separate field to store the
> value.  This can cause issue in a concurrent ring-buffer-like structures as
> the array length can end up sharing a cache line with the array itself.  So
> if a producer thread writes into one of the first 15 array slots (worst
> case), the whole line including the part containing the length field will
> be invalid on the consumer's core resulting in an additional cache miss in
> certain cases.  This is probably less of an issue for the default set up of
> the disruptor as we don't actually replace the references inside of the
> entries array, but a number of fast queue implementations would suffer.
> The developer is caught having to make a trade-off between reducing false
> sharing and removing bounds checking.
>
> Mike.
>
>
> On 1 August 2015 at 00:18, Paul Sandoz <paul.sandoz at oracle.com> wrote:
>
> >
> > On 31 Jul 2015, at 07:43, Michael Barker <mikeb01 at gmail.com> wrote:
> >
> > > Hi Paul,
> > >
> > > I've had a look at the patch that you mentioned and AFAICT it doesn't
> > seem to be able to eliminate the bounds check, even if I remove the array
> > padding.  Just to confirm my analysis the assembler around the aaload
> > instruction looks like:
> > >
> > > 0x00007f627d6b3aea: cmp    %ebx,%edx
> > > 0x00007f627d6b3aec: jae    0x00007f627d6b3e0c
> > > 0x00007f627d6b3af2: mov    %r8,%r13
> > > 0x00007f627d6b3af5: mov    %r11d,%r8d
> > > 0x00007f627d6b3af8: mov    0x10(%rax,%rdx,4),%r11d  ;*aaload
> > >
> > > I'm guessing that the cmp/jae pair is the bounds check?
> > >
> >
> > Yes, i am assuming that is generated code of RingBuffer.elementAt
> (aaload)
> > and not MultiProducerSequencer. I suspect the padding is throwing the
> > compiler off the scent (it does not know that indexMask is less than
> > entries.length - BUFFER_INDEX).
> >
> > For an array access such as:
> >
> >   int j = i & (a.length - 1);
> >   X x = a[j];
> >
> > I would expect to observe something like:
> >
> >   test   %edi,%edi
> >   jbe    …
> >
> > The bound check should get strength reduced to checking if the array
> > length is zero, and i would expect such a test to be hoisted out of any
> > loop (and get folded into another dominating array length check, if any).
> >
> > Here is an example, given this benchmark method:
> >
> > @Benchmark
> > public int relaxed_r_aa() {
> >     Value[] _receiver = receiver;
> >     int sum = 0;
> >     for (int i = start; i < end; i++) {
> >         int j = i & (_receiver.length - 1);
> >         sum += _receiver[j].i;
> >     }
> >     return sum;
> > }
> >
> > Without the patch the following code is generated (when loop unrolling is
> > switched off):
> >
> > : mov    0xc(%r12,%rbx,8),%r9d  ;*arraylength
> > : mov    %r9d,%r11d
> > : dec    %r11d
> > : lea    (%r12,%rbx,8),%rcx
> > : xor    %eax,%eax          ;*iload_3
> >
> > <loop>: mov    %r11d,%ebp
> > : and    %r10d,%ebp         ;*iand
> > : cmp    %r9d,%ebp
> > : jae    <bounds-fail>
> > : mov    0x10(%rcx,%rbp,4),%edi  ;*aaload
> > : add    0xc(%r12,%rdi,8),%eax  ;*iadd
> > : inc    %r10d              ;*iinc
> > : cmp    %r8d,%r10d
> > : jl     <loop>  ;*if_icmpge
> >
> >
> > When the patch is applied the following code is generated:
> >
> > : mov    0xc(%r12,%rbp,8),%r9d  ;*arraylength
> > : test   %r9d,%r9d
> > : jbe    <bounds-fail>
> > : dec    %r9d
> > : lea    (%r12,%rbp,8),%r8
> > : data32 nopw 0x0(%rax,%rax,1)
> > : data32 data32 xchg %ax,%ax  ;*iload_3
> >
> > <loop>: mov    %r9d,%ebx
> > : and    %r11d,%ebx
> > : mov    0x10(%r8,%rbx,4),%ebx  ;*aaload
> > : add    0xc(%r12,%rbx,8),%eax  ;*iadd
> > : inc    %r11d              ;*iinc
> > : cmp    %r10d,%r11d
> > : jl     <loop>  ;*if_icmpge
> >
> >
> >
> > > A quick note on the patch, it merged, but didn't compile.  There seemed
> > to be a signature change on the signature of ConINode::make, so I changed
> > line 1311 in subnode.cpp from 'return ConINode::make(phase->C, 1);' to
> > 'return ConINode::make(1);'.  That let it compile, but I don't really
> > understand the code so I'm not sure if it is semantically correct.
> > >
> >
> > Oops sorry about that, i uploaded a new revision of the patch the same
> fix
> > that you applied.
> >
> > Paul.
> >
> >
>


More information about the valhalla-dev mailing list