determining when to offload to the gpu

Jani Väinölä jani.vainola at
Fri Sep 19 08:22:21 UTC 2014

(I am new here so please bare with me. I just couldn't stay silent)

About part b), as a java developer I would love if there was some kind of
expert API that gives me control of this feature. I would like to able to
control where my code is executed (if it is possible).

For instance, say that I would write a custom collection class, I want to
be able to write a member function that does a few operations on the data
in a data-parallell manner and a therefore always (if possible) run those
parts of it on the GPU. That is, I want to make the decisions inside my
collection but hide it from the users. In my opinion, they should only call
my collection and not have any clue on where or how the execution is done.

I guess I could slice up the code into a few private functions that use the
parallel functionality on the data and use those in the public function but
it would be very neat to have a good API for this instead.


2014-09-18 17:04 GMT+02:00 Deneau, Tom <tom.deneau at>:

> Ryan --
> So I believe you are saying:
>    a) Given a lambda marked parallel to execute across a range, the
>       decision of where to run it does not have to be an all-CPU or
>       all-GPU decision.  It may be possible to subdivide the problem
>       and run part of it on the GPU and part on the CPU.  This
>       subdividing could be part of the framework.
>    b) There should be an API that allows the expert user to break up
>       the problem and control which parts run on the CPU and GPU.
> I think solving part a) in the JVM or JDK is an interesting (and
> difficult) problem for the future but may be beyond the scope of the
> current Sumatra.  I will definitely open an issue on this once we get
> the Sumatra project in place on
> Meanwhile, for now I think we will limit the automatic decision of
> where to run to all-GPU or all-CPU.  I think there is a middle ground
> of problems that either may or may not gain thru offloading (for
> example depending on GPU or CPU hardware capabilities) and where the
> programmer wants to leave that decision up to the framework.
> I will also enter an issue for Part b).  I agree this is something
> that an expert user might want.
> -- Tom
> -------------------------------------------------
> -----Original Message-----
> From: LaMothe, Ryan R [mailto:Ryan.LaMothe at]
> Sent: Tuesday, September 09, 2014 7:03 PM
> To: Deneau, Tom; sumatra-dev at; graal-dev at
> Subject: Re: determining when to offload to the gpu
> Hi Tom,
> I thought this may be a good point to jump in and make a quick comment on
> some thoughts.
> A question: At what level is it better to encapsulate this in the JVM and
> at what level is this better left to the user/utility functions?
> For example, in the Aparapi project there is an example project named
> correlation-matrix that gives a pretty good idea about what it takes to
> realistically decide in code whether to run a specific matrix computation
> on CPU or GPU and how to split up the work. This is a very basic example
> and is only a sample of the real code base from which it was derived, but
> should help highlight the issue.
> Instead of the JVM trying to figure out how to decompose the lambda
> functions optimally and offload to HSA automatically for all possible
> cases, might it be better to take the following approach:
> - Implement the base functionality in the JVM for HSA offload and then
> search the entire JDK for places where offloading may be obvious or easily
> achieved (i.e. Matrix Math, etc.)? Maybe this even means implementing new
> base classes for specific packages that are HSA-enabled.
> - For non-obvious cases, allow the developer to somehow indicate in the
> lambda that they want the execution to occur via HSA/offload, if possible,
> and provide some form of annotations or other functionality to give the JVM
> hints about how they would like it done?
> Maybe that seems like steps backwards, but thought it was worth mentioning.
> -Ryan
> On 9/9/14, 3:02 PM, "Deneau, Tom" <tom.deneau at> wrote:
> >The following is an issue we need to resolve in Sumatra.  We intend to
> >file this in the openjdk bugs system once we get the Sumatra project
> >set up as a project there.  Meanwhile, comments are welcome.
> >
> >
> >In the current prototype, a config flag enables offload and if a Stream
> >API parallel().forEach call is encountered which meets the other
> >criteria for being offloaded, then on its first invocation it is
> >compiled for the HSA target and executed.  The compilation happens
> >once, the compiled kernel is saved and can be reused on subsequent
> >invocations of the same lambda.  (Note: if for any reason the lambda
> >cannot be compiled for an HSA target, offload is disabled for this
> >lambda and the usual CPU parallel path is used).  The logic for
> >deciding whether to offload or not is all in the special
> >Sumatra-modified JDK classes in java/util/stream.
> >
> >The above logic could be improved:
> >
> >   a) instead of being offloaded on the first invocation, the lambda
> >      should first be executed thru the interpreter so that profiling
> >      information is gathered which could then be useful in the
> >      eventual HSAIL compilation step.
> >
> >   b) instead of being offloaded unconditionally, it would be good if
> >      the lambda would be offloaded only if the offload is determined
> >      profitable when compared to running parallel on the CPU.  We
> >      assume that in general it is not possible to predict the
> >      profitability of GPU offload statically and that measurement
> >      will be necessary.
> >
> >So how to meet the above needs?  Our current thoughts are that at the
> >JDK level where we decide to offload a particular parallel lambda
> >invocation would go thru a number of stages:
> >
> >   * Interpreted (to gather profiling information)
> >   * Compiled and executed on Parallel CPU and timed
> >   * Compiled and executed on Parallel GPU and timed
> >
> >And then at that point make some decision about which way is faster and
> >use that going forward.
> >
> >Do people think making these measurements back at the JDK API level is
> >the right place? (It seems to fit there since that is where we decide
> >whether or not to offload)
> >
> >Some concerns
> >-------------
> >This comparison works well if the work per stream call is similar for
> >all invocations.  However, even the range may not be the same from
> >invocation to invocation.  We should try to compare parCPU and parGPU
> >runs with the same range.  If we can't find runs with the same range,
> >we could derive a time per workitem measurement and compare those.
> >However, time per workitem for a small range may be quite different for
> >time per workitem for a large range so would be difficult to compare.
> >Even then the work per run may be different (might take different paths
> >thru the lambda).
> >
> >How to detect that we are in the "Compiled" stage for the Parallel CPU
> >runs?  I guess knowing the range of each forEach call we should be able
> >to estimate this, or just see a reduction in the runtime.
> >
> >-- Tom Deneau
> >
> >

More information about the graal-dev mailing list