Performance regression in

Clemens Eisserer linuxhippy at
Mon Jan 7 00:11:24 UTC 2008

Hello again,

I implemented two prototypes of the striding to see how they perform
and how complex the code would be. Both prototypes implement the
striding on the java-side (call JNI method for each stride) which I
plan to change to minimize overhead and hide the striding (except Sun
would like to have it in Java).

The first prototype uses two Direct-ByteBuffers where it copies the
data to/from the input/output arrays, the whole input/output data is
this way only copied once.
The second prototype uses striding (1kb chunks) in the
Cirtical-Section, I also did some measurements to see how long the
cirtical-section is held in worst-case.

Buffers / 2k/1k stride size: (input-buffer: 2k, output-buffer 1k)
1.) Compress 50mb with level=0 / 100byte-output-array:  603ms
2.) Compress 50mb with level=1  / 100byte-output-array:  277ms
3.) Compress 50mb with level=9  / 1kb output-array  784ms

Critical / 1k stride size: (no copying)
1.) Compress 50mb with level=0  / 100byte-output-array:720ms
2.) Compress 50mb with level=1  / 100byte-output-array: 270ms
3.) Compress 50mb with level=9  / 1kb output-array 778ms

The first two measurements are worst-case scenarios which measure the
overhead of striding when the output-buffer is way too small - here
the copying approach is even fast (maybe GetPrimitiveArrayCritical has
more overhead then GetDirectBufferAdress).
The 3.) shows a real-world example with high compression where
copying-overhead should not be high - but however it does show up
(only a few percent). I did many more measurements (however I don't
remeber exactly what I measured, it was some time ago) and my
conclusion was that especially for a little bit larger buffers (e.g.
8k/4k) the copying overhead is really low - also oprofile showed ~2-5%
in memcpy).
Because the non-copying critical-section approach has to use small
strides the are both almost equal fast, in real-world use-cases the
non-copying approach was a few ms faster.

However one thing of the copying solution I don't like: Its quite
complex, whereas the critical-section approach is quite clean.

I did some benchmarks how long the critical section is held with
compression-ratio=9 + uncompressable data (assumed this is a
worst-case) and 1kb strides in µs:
3470 (worst case over all runs)

So on my Core2Duo (2ghz) I see worst-cases of about 3ms including
JNI-overhead with 1kb strides. Making the strides small won't help as
zlib waits until it has enough data to compress (thats why there are
2µs calls - which I assume are only used to move data inside of zlibs
compression buffer).

On the hotspot-runtime list I started a thread about "how evil"
GetPrimitiveArrayCritical is, they said it only blocks the GC - I
don't know wether 3ms are problematic. However keeping in mind that
Deflater is quite slow anyway, the copying overhead is not relevant I

So to sum it up I would recommend for Deflater either the
non-copying/critical solution or a copying solution which both work in
strides. The copying solution would allocate the stride-buffers in
deflater_init(), and free it on deflater_end(), doing the looping and
copying on the native side.

However for inflater, which is a lot faster (and has more predictable
pause-times) I would not recommend a copying approach. The remaining
question seems to be how long tolerable pauses are, and ideas?

I would be interested in some ideas and feedback. What do you think
would be a good solution?

Thank you in advance, lg Clemens

PS: The striding+GetPrimitive... is even used by NIO for copying
java-arrays into direct-ByteBuffers:
    while (length > 0) {
	size = (length > MBYTE ? MBYTE : length);
	GETCRITICAL(bytes, env, dst);
 	memcpy(bytes + dstPos, (void *)srcAddr, size);
	RELEASECRITICAL(bytes, env, dst, 0);

More information about the core-libs-dev mailing list