accumulate locally and mutatively but combine concurrently (a use case)

John Rose john.r.rose at
Fri May 10 14:19:11 PDT 2013

On May 10, 2013, at 9:05 AM, Brian Goetz <brian.goetz at> wrote:

> ...The basic idea is that you only want to do the finishing step at the very last minute.  Let's say your mutable representation is an "unboxed" form (e.g., a mutable counter) and your final representation is a boxed form.  There's no value in boxing the leaf results; this is all wasted effort.  Better to combine the counters as we go up the tree, and box at the end.

Yes.  That's an example where the collector size is O(1) or not much bigger.  For such problems it seems you just need one type R, even if there are two phases.

Things get harder to manage when you are combining O(N) data.  (Examples: flat filtering results, frequency data (histograms, n-grams), or scans (parallel prefix ops), segmented reduce (group-by).)  Bulk communication (i.e., memory effects greater than cache sizes) shows up as a cost to manage.  That's what leads me to think about memory fences and copy-less algorithms and freezing.

> And here, I think, is where our assumptions might part ways.  I'm thinking "fork join", and you're thinking "gpu".  The "combine the counters" step is trivially easy to do in FJ, but involves much more complexity when mapping to a GPU substrate.

With the GPU counter combining scenario, I think the GPU dispatch would end with a result of O(P) counters (P = number of processors) which would be rolled up by a CPU or two.  The communication costs are light, O(P).

> The Collector API offers a choice here -- combine the containers mutably (dump the right into the left) or create a new combined container that might just be a shallow copy.  But that choice belongs to the Collector writer (do I do leftList.addAll(rightList), or create a new view of concat(left, right)), not the framework implementor.

Yes, I think I like that feature.  ("How do you implement overloaded op+= when you don't know if your accumulators are mutative or functional?"  Ans: "Compile a = a.'op+='(b), and let a choose what to do.")

>> 3c. (For some mutable aggregates, the freezing step could also include operations like right-sizing or balancing.)
>> 3d. The freezing operation includes a suitable memory fence operation to settle all side effects to the aggregate.
> Basically, Collector is more "stripped down" that this; its a straightforward extension of fold to mutable builders.  We did look into adding additional hook points, but its not clear what the right ones are.  The one everyone wants is a transform that is applied to the root result (StringBuilder::toString), which is obvious and desirable but painful with cascaded reductions like group-by, and in the end seemed to be more complexity than it was worth.
> But, a per-leaf transform which would handle (3), and whose default was a no-op, might be perfectly tractible.  Again, though, the choices here accrue to the Collector writer, not the reduction implementor.

I think I understand that.  The border between those two fellows is subtle.

>> 4. The guy(s) that passed out stream chunks waits for the resulting aggregate and combines them, using O(log N) pointer pasting operations.
> This choice is distributed among log n "combiner" invocations.
>> It seems to me that step 2 is well represented by a STRICTLY_MUTATIVE but not CONCURRENT Collector.
> The concurrent case is never going to map well to a GPU; this is the case where we have a data structure for which concurrent inserts are cheaper than merging (which is true for many map operations, which again are not going to map well to GPUs.)

There is a literature on segmented or nested data parallelism which discusses how maps can play a surprisingly deep role in GPU-ish computations.

>> Do we need a collector mode which says, "I accumulate locally and mutatively but I combine concurrently"?
> When you say "combine concurrently", do you really mean "combine functionally"?

Yes, I mean "functionally" in the sense you and I both mean here.  I said "combine concurrently" because I was striving to use terms already defined or used in the Collector API spec.

But, the term "CONCURRENT" (as a mode bit, as defined in Collector) doesn't seem to fit into my use case at all, so I see how that word hurts the clarity of my question.

The use case could be called "accumulate locally and mutatively but combine functionally and non-mutatively".

— John

More information about the lambda-dev mailing list