Numerical Stream code

Brian Goetz brian.goetz at
Thu Feb 14 08:06:04 PST 2013

As with all things, the answer is ... it depends.

But, this code, where each task is reading and writing from a shared 
array, is at high risk for "cache-line ping-pong".  Especially because 
the first thing each thread does is write to the first and last element 
of its chunk.  If the partitions don't perfectly line up with cache line 
boundaries, now you've got two threads both wanting to write to the same 
cache line.  Which works, but is slow.  Slower than sequential, usually.

Bottom line, you need to think about locality when writing code like 
this.  "Don't write to shared arrays" is simply an approximation for 
"let the library think about locality for you."

On 2/14/2013 10:56 AM, Peter Levart wrote:
> On 02/14/2013 03:45 PM, Brian Goetz wrote:
>>> The parallel version is almost certainly suffering false cache line
>>> sharing when adjacent tasks are writing to the shared arrays u0, etc.
>>> Nothing to do with streams, just a standard parallelism gotcha.
>> Cure: don't write to shared arrays from parallel tasks.
> Hi,
> I would like to discuss this a little bit (hence the cc:
> concurrency-interest - the conversation can continue on this list only).
> Is it really important to avoid writing to shared arrays from multiple
> threads (of course without synchronization, not even volatile
> writes/reads) when indexes are not shared (each thread writes/reads it's
> own disjunct subset).
> Do element sizes matter (byte vs. short vs. int  vs. long)?
> I had a (false?) feeling that cache lines are not invalidated when
> writes are performed without fences.
> Also I don't know how short (byte, char) writes are combined into memory
> words on the hardware when they come from different cores and whether
> this is connected to any performance issues.
> Thanks,
> Peter

More information about the lambda-dev mailing list