Numerical Stream code
brian.goetz at oracle.com
Thu Feb 14 08:06:04 PST 2013
As with all things, the answer is ... it depends.
But, this code, where each task is reading and writing from a shared
array, is at high risk for "cache-line ping-pong". Especially because
the first thing each thread does is write to the first and last element
of its chunk. If the partitions don't perfectly line up with cache line
boundaries, now you've got two threads both wanting to write to the same
cache line. Which works, but is slow. Slower than sequential, usually.
Bottom line, you need to think about locality when writing code like
this. "Don't write to shared arrays" is simply an approximation for
"let the library think about locality for you."
On 2/14/2013 10:56 AM, Peter Levart wrote:
> On 02/14/2013 03:45 PM, Brian Goetz wrote:
>>> The parallel version is almost certainly suffering false cache line
>>> sharing when adjacent tasks are writing to the shared arrays u0, etc.
>>> Nothing to do with streams, just a standard parallelism gotcha.
>> Cure: don't write to shared arrays from parallel tasks.
> I would like to discuss this a little bit (hence the cc:
> concurrency-interest - the conversation can continue on this list only).
> Is it really important to avoid writing to shared arrays from multiple
> threads (of course without synchronization, not even volatile
> writes/reads) when indexes are not shared (each thread writes/reads it's
> own disjunct subset).
> Do element sizes matter (byte vs. short vs. int vs. long)?
> I had a (false?) feeling that cache lines are not invalidated when
> writes are performed without fences.
> Also I don't know how short (byte, char) writes are combined into memory
> words on the hardware when they come from different cores and whether
> this is connected to any performance issues.
More information about the lambda-dev