JNI-performance - Is it really that fast?
Dave.Dice at Sun.COM
Wed Mar 26 14:02:35 PDT 2008
> Well, ok, a stop-the-world thing "just" to revoke the bias would be
> really expensive ... not sure if it would be a win even under optimal
> conditions for this use.
It speaks to how (relatively) expensive those atomics are that biased
locking is profitable even though we might have to occasionally
perform revocation via stop-the-world safepoints.
> Well thats an argument I not really thought of and you're right of
> Most likely the situations where this could be a win are seldom, and a
> lot of work would be needed to implement and maintain it.
> But don't you think the situation will only get worse the more cores
> the current "design" is stretched to?
> Lets just hope the situation will improve :)
The atomics proper shouldn't be any more expensive on a 256-way than
on a 2-way. For the most part they're accomplished locally in the
cache. (There was a time when atomics "locked the bus" and impeded
scalability, but that hasn't been true for many years). Conceptually
a compare-and-swap (CAS) or other atomic shouldn't have any more
impact on the system than a store. And in fact there was a rough
relationship between pipeline depth and CAS latency, as most CPUs
implemented CAS by draining the pipeline, allowing the store buffer to
drain, and otherwise letting the processor quiesce, as well as killing
out-of-order execution. Those are CPU-local effects. Furthermore
they're largely an implementation artifact as until recently processor
designers didn't pay too much attention to atomic latency. Thankfully
that appears to be changing and we're seeing more efficient atomics.
> Thanks a lot for listening and explaining everything that detailed,
> lg Clemens
> I did some micro-benchmarking again on my machine again for a JNI-
> 180ms - jni per call, no locking
> 240ms - command-buffer(32k), locked JNI call every 1600 calls,
> native-side buffer interpreter (a switch statement)
> 629ms - jni per call, locked
> locking was done with a ReentrantLock.
> So the command-buffering and interpreting semms to pay off, although
> on my machine its still slower than a un-locked JNI call.
If I'm interpreting the data correctly, that doesn't seem too
surprising. There was a sufficiently long warmup period, and the
warmup exercised the code in precisely the same way it would execute
during the benchmark interval?
More information about the discuss