[concurrency-interest] Numerical Stream code
howard.lovatt at gmail.com
Fri Feb 15 12:26:09 PST 2013
Thanks for all the replies. This is largely a holding email. I am travelling with work and don't have my laptop. When get home I will post some more code.
@Jin: I did warm up the code, but I do agree that benchmarks are tricky. As I said I was expecting some overhead but was surprised at how much.
@Brian: The reason I factored t0 and tg0 out into methods is that they are common between the serial and parallel versions and I thought the code read better. I don't think it makes any difference, but I will check.
@Others: To avoid writing over an old array I will have to allocate each time round the t loop. I will give this a try and see if it helps. The discussion about the parallel problems is interesting, but how come the serial version is so slow? Could a problem with the Stream code in general be the underlying problem with the parallel version?
Sent from my iPad
On 15/02/2013, at 3:48 AM, Stanimir Simeonoff <stanimir at riflexo.com> wrote:
>> > Do element sizes matter (byte vs. short vs. int vs. long)?
>> I don't think so. All of this assumes that the proper instruction is used. For example, if 2 threads are writing to adjacent bytes, then the "mov" instruction has to only write the byte. If the compiler, decides to read 32-bits, mask in the 8-bits and write 32-bits then the data will be corrupted.
> JLS mandates no corruption for neighbor writes.
>> I believe that HotSpot will only generate the write byte mov instruction.
> That would be the correct one. The case affects only boolean/byte/short/char as simple primitive fields are always at least 32bits.
>> Nathan Reynolds | Architect | 602.333.9091
>> Oracle PSR Engineering | Server Technology
>> On 2/14/2013 8:56 AM, Peter Levart wrote:
>>> On 02/14/2013 03:45 PM, Brian Goetz wrote:
>>>>> The parallel version is almost certainly suffering false cache line
>>>>> sharing when adjacent tasks are writing to the shared arrays u0, etc.
>>>>> Nothing to do with streams, just a standard parallelism gotcha.
>>>> Cure: don't write to shared arrays from parallel tasks.
>>> I would like to discuss this a little bit (hence the cc: concurrency-interest - the conversation can continue on this list only).
>>> Is it really important to avoid writing to shared arrays from multiple threads (of course without synchronization, not even volatile writes/reads) when indexes are not shared (each thread writes/reads it's own disjunct subset).
>>> Do element sizes matter (byte vs. short vs. int vs. long)?
>>> I had a (false?) feeling that cache lines are not invalidated when writes are performed without fences.
>>> Also I don't know how short (byte, char) writes are combined into memory words on the hardware when they come from different cores and whether this is connected to any performance issues.
>>> Concurrency-interest mailing list
>>> Concurrency-interest at cs.oswego.edu
>> Concurrency-interest mailing list
>> Concurrency-interest at cs.oswego.edu
> Concurrency-interest mailing list
> Concurrency-interest at cs.oswego.edu
More information about the lambda-dev